Power State Manipulation commands
From Grid5000
Contents |
Motivations
With the rise of green computing as a research topic, users needs a command more powerful than kareboot to manipulate power states. In the same time, the variety of implementations of power state management commands needs to be hidden to administrators working on all the sites, and uniform high level tools need to be given to them. In this page, we attempt to bring together the best ideas form the different ad-hoc tools developed by admins to answer this need to help converge towards a comprehensive set of tools.
Constraints
There are a few things to take into account when designing such a set of tools
- Power state can be changed at different levels
-
haltor an other command (suspend ?)on the host - IMPI or RSA control of the chassis
- PDU control of power distribution.
- PDU control can control more than one host at a time
-
- At the higher level, the power state manipulation tools need to work with different groups of nodes
- user defined groups, with a flexible command line syntax to describe hosts to manipulate, or read from a file
- all the nodes of a given reservation
- all the nodes of a given OAR state
- all the nodes of a given IPMI status state
- the intersection of one or more of the above (all the nodes of reservation 345284 that are currently off (at chassis level)
- Power state change can be requested by programs (OAR, phoenix) as well as by users
- Rights to change the power state
- When run as g5kadmin or root, the tool should not check rights to manipulate the power state of nodes, but could check (optional) that the nodes are not currently reserved by other than the admin running the command.
- When run as an ordinary user, the tool should check that the user has
deployrights on the nodes.
We therefore need to design very carefully how such a set of tools should wait if a command on the PDU needs to wait that the other hosts impacted are free, and how it should retry in the event the desired power state can not be reached.
This diagrams is an overview of how tools could be combined with the expected users and hosts:
Specification
Command to list nodes and execute other windowed commands
A small tool designed by and for the technical staff, nodes-g5k can send requests to both the reference and the monitoring API, read node list from external files, give the list of selected nodes and execute any arbitrary windowed commands on them.
Command interface
nodes-g5k:
-h, --help Show this help
-s, --site SITE Select nodes in specified site
-c, --cluster CLUSTER Select nodes in specified cluster
-n, --nodes NODES Select nodes in a series, using c3 syntax
-j, --oarjob JOBID Select nodes from a OAR job
-o, --oarstate [!]free|busy|besteffort|unknown
Select all nodes with specified OAR system state
-r, --rawstate [!]alive|suspected|absent|dead
Select all nodes with specified OAR hardware state
-f, --file FILE Select nodes out of a file ('-' means stdin)
-l, --logic and|or The logical connective between the various filters, 'and' by default
-d, --debug Turn on debug mode and print verbose output
-v, --version Print the version of nodes-g5k
Configuration file
There is no need for a configuration file for this tool.
Use cases
# List all nodes from node-1 to node-30 and node-50
$ nodes-g5k --nodes node-[1-30,50]
# List all nodes in site siteA
$ nodes-g5k --site siteA
# List all nodes in cluster clusterZ
$ nodes-g5k --cluster clusterZ
# List all nodes in both job 123456 and clusterX
$ nodes-g5k --oarjob 123456 --cluster=clusterX
# List all nodes dead nodes in site siteB
$ nodes-g5k --site siteB --rawstate dead
# List all nodes in clusterY OR not free
$ nodes-g5k --cluster clusterY --oarstate !free --logic or
# List all nodes in both file /tmp/nodes and in besteffort
$ nodes-g5k --file /tmp/nodes --oarstate besteffort --logic and
# Ping of all alive nodes in siteC
$ nodes-g5k --site siteC --rawstate alive | xargs -P 1 -i ping -c 1 {}
--- node-1.siteC.grid5000.fr ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.190/2.190/2.190/0.000 ms
PING node-1.siteC.grid5000.fr (172.24.1.12) 56(84) bytes of data.
64 bytes from node-1.siteC.grid5000.fr (172.24.1.12): icmp_seq=1 ttl=64 time=2.53 ms
--- node-2.siteC.grid5000.fr ---
[...]
High level commands
Command interface
A small tool designed by the technical staff, ilanpower abstracts all the differences between the various remote power control technologies (ipmitool, RSA).
Usage: /usr/bin/ilanpower (options)
-c, --command ACTION Action to perform [on, off, cycle, safety_cycle]
--config FILE Configuration file to use (default: /etc/ilanpower.json)
d, --debug Active debug mode
-m, --machine MACHINE Machine to work on
-n, --no-wait Do not wait the node to have the required power status
--sleep SEC Set how many seconds to sleep in mode safety_cycle (override value in /etc/ilanpower.json)
-s, --state Show the power status of the node
-v, --version Show ilanpower version
-h, --help Show this message
Configuration file
A configuration file /etc/ilanpower.json gives all the necessary passwords, IP addresses, mappings, etc. for all the nodes of the site.
{
"clusters": {
"capricorne": {
"series": "1-56",
"user": "root",
"password": "XXXXX",
"bmc": "rsa",
"suffix": "-bmc",
"sleep": "8"
},
"sagittaire": {
"series": "1-79",
"user": "NULL",
"password": "XXXXX",
"bmc": "ipmi",
"suffix": "-bmc",
"sleep": "8"
}
}
}
Use cases
# Print the power status of a node. # Output could be 'on', 'off' or 'unknown' # Return values are : # - 0 for 'on' # - 0 for 'off' # - 2 for 'unknown' $ ilanpower --state --machine node-1 node-1: on $ echo $? 0 # Turn off a node # Output could be 'ok' or 'unknown' # Return values are : # - 0 for 'ok' # - 1 for 'unknown' $ ilanpower --command off --machine node-1 node-1 off
User accessible command
Command interface
Kadeploy3 provides a command kapower3 that could shutdown and get the power status of a set of nodes:
$ kapower3 -h
Usage: kapower3 [options]
Contact: kadeploy3-users@lists.gforge.inria.fr
General options:
-d, --debug-mode Activate the debug mode
-f, --file MACHINELIST Files containing list of nodes (- means stdin)
-l, --level VALUE Level (soft, hard, very_hard)
-m, --machine MACHINE Operate on the given machines
--multi-server Activate the multi-server mode
-n, --output-ko-nodes FILENAME File that will contain the nodes on which the operation has not been correctly performed
-o, --output-ok-nodes FILENAME File that will contain the nodes on which the operation has been correctly performed
--off Shutdown the nodes
--on Power on the nodes
--status Get the status of the nodes
-v, --version Get the version
--no-wait Do not wait the end of the power operation
--server STRING Specify the Kadeploy server to use
-V, --verbose-level VALUE Verbose level between 0 to 4
Configuration file
# timeout after which harder shutdown command is tried if status is still 'on' timeout_shutdown = 200 # # power on # # soft poweron command cmd_soft_poweron = none # hard poweron command cmd_hard_poweron = ilanpower --command on --machine HOSTNAME_FQDN # very hard poweron command cmd_very_hard_poweron = power-g5k --command very-hard-off --machine HOSTNAME_FQDN # # reboot # cmd_soft_reboot_ssh = ssh -q [...] root@HOSTNAME_FQDN "nohup /sbin/reboot -f >/dev/null &" cmd_soft_reboot_rsh = rsh -l root HOSTNAME_FQDN /usr/local/bin/reboot_detach cmd_hard_reboot = ilanpower --command reboot --machine HOSTNAME_FQDN cmd_very_hard_reboot = power-g5k --command very-hard-reboot --machine HOSTNAME_FQDN # # shutdown # # soft shutdown command for both ssh and rsh cmd_soft_shutdown_ssh = ssh -q [...] root@HOSTNAME_FQDN "nohup /sbin/halt >/dev/null &" cmd_soft_shutdown_rsh = rsh -l root HOSTNAME_FQDN /usr/local/bin/shutdown_detach # hard shutdown command cmd_hard_shutdown = ilanpower --command off --machine HOSTNAME_FQDN # very hard shutdown command cmd_very_hard_shutdown = power-g5k --command very-hard-off --machine HOSTNAME_FQDN # # status # # Command to get the power status of a node # return values should be: # - 0 if 'on' # - 1 if 'off' # - 2 if 'unknown' cmd_power_status = ilanpower --status --type hard --machine HOSTNAME_FQDN
Use cases
$ kapower3 --machine node-[1-3,7] --power-status node-1 on node-2 on node-3 off node-7 on $ seq -f "node-%g" 5 9 | kapower3 --file - --power-status node-5 on node-6 off node-7 on node-8 unknown node-9 off $ seq -f "node-%g" 5 9 | kapower3 --file - --off # try shutdown node-[5-9] with 'cmd_soft_shutdown_ssh' # for all nodes not in status 'off' after 'timeout_shutdown' try shutdown node-[5-9] with 'cmd_hard_shutdown' # for all nodes not in status 'off' after 'timeout_shutdown' try shutdown node-[5-9] with 'cmd_very_hard_shutdown' nodes properly shutdown: node-5 node-7 node-8 nodes NOT properly shutdown: node-6 on node-9 unknown $ seq -f "node-%g" 5 9 | kapower3 --file - --off --level hard --output-ko-nodes /tmp/nodes # try shutdown node-[5-9] with 'cmd_hard_shutdown' # for all nodes not in status 'off' after 'timeout_shutdown' try shutdown node-[5-9] with 'cmd_very_hard_shutdown' nodes properly shutdown: node-5 node-7 node-8 nodes NOT properly shutdown: node-6 on node-9 unknown # save the list of nodes still in status 'on' or 'unknown' in /tmp/nodes $ seq -f "node-%g" 5 9 | kapower3 --file - --off --level soft --output-ko-nodes /tmp/nodes --no-wait # try shutdown node-[5-9] with 'cmd_soft_shutdown_ssh' # save the list of nodes still where 'cmd_soft_shutdown_ssh' returns a value different of 0 in /tmp/nodes



