Power Monitoring Devices

From Grid5000
Jump to: navigation, search


Note.png Note

This page is only useful as a reference for users requiring information about power monitoring devices, e.g. how to access them using SNMP or other low level details. It contains site-specific information and problems and shortcomings of power montoring. Users should look at Energy_consumption_monitoring_tutorial for up to date instructions in order to access power measurements.

Monitoring devices available

Lyon and Grenoble

These sites have dedicated "wattmetre" for power monitoring (see Lyon:Wattmetre and Grenoble:Wattmetre). Their usage is described in the Energy_consumption_monitoring_tutorial

Nancy

Nancy has two kind of PDUs; from EATON and APC vendors. Some of PDUs allow per-outlet power monitoring.

Rennes

Rennes has two kind of PDUs; from EATON and APC vendors. Some of PDUs allow per-outlet power monitoring.

Lille

Lille has of PDUs from APC vendor. These PDUs allow per-outlet power monitoring. A single node in Lille may be attached to two PDUs

How to access measures

You should first read Energy consumption monitoring tutorial

Using Grid'5000 API

For instance, using Curl : On a frontend :

Additional information at: API_all_in_one_Tutorial.


Using Kwapi

Kwapi is enabled on clusters where sufficiently accurate PDUs are available. The list of clusters where Kwapi is activated is available at: https://intranet.grid5000.fr/jenkins-status/?job=test_kwapi

You have access to:

  • Live monitoring: see power consumption in live.
  • Long term storage: retrieve raw measurements from past experiences with no alteration.
  • API: Measure access with a JSON API

Additional information at: Kwapi.

Using SNMP

SNMP may usually be used to directly query PDU. You should use Grid'5000 API to retrieve PDU names and mapping.

For instance:

  • Make an OAR reservation
Terminal.png frontend:
oarsub -I -t deploy -l nodes=1,walltime=2
  • Deploy a reference image (eg : debian10-x64-nfs)
Terminal.png frontend:
kadeploy3 -e debian10-x64-nfs -f $OAR_NODE_FILE -k
  • Login to the deployed node as root
Terminal.png frontend:
ssh root@node.nancy.grid5000.fr
  • Update package index files and install the SNMP packages
Terminal.png node:
apt-get update && apt-get install snmp snmp-mibs-downloader
  • Create the MIBs directory and change directory
Terminal.png node:
mkdir /usr/share/snmp/mibs
  • Move to the path:
Terminal.png node:
cd /usr/share/snmp/mibs
  • Download the SNMP MIB
  • Update the SNMP client configuration
Terminal.png node:
echo "mibs EATON-EPDU-MIB" >> /etc/snmp/snmp.conf
Terminal.png node:
echo "mibs PowerNet-MIB" >> /etc/snmp/snmp.conf
  • We use snmpget command to retrieve information from a PDU.
Terminal.png node:
snmpget -v1 -c public graphene-pdu[7-9].nancy.grid5000.fr outletWatts.0.1

The outletWatts entry depends on the PDU model. It may be a different entry name on other PDU vendor. See PDU's MIB.

In this case, outletWatts give a unique value for the active power sensor attached to the outlet. This value is reported in Watts.
The number 0.1 correspond to the first outlet of the PDU graphene-pdu9.nancy.grid5000.fr. And following the mapping of nodes to PDU, it correspond to the Outlet ID A1.

  • This command give the active power for each outlet on the ePDU.
Terminal.png node:
snmpwalk -v1 -c public graphene-pdu[7-9].nancy.grid5000.fr outletWatts.0</code>

measurement artifacts and pitfalls

Kwapi handling of nodes with multiple power supplies

Kwapi currently only has partial support for nodes with multiple power supplies. Kwapi has two user API for retrieving measurements:

  • The "near real-time" API (on port 5000), which returns the current (last) value of a power probe. For this API, there is some code in kwapi which is able to aggregate measurements but only when the power supplies are connected to the same PDU. But in almost all situations, multiple power supplies are used for redundancy so they will be connected to different PDUs, thus this kwapi code will be of no use. So, with the "real time" kwapi API, when asking the measurements for a node, kwapi will return the value of the last measure which was probed. As the measurements are run by independant processes for each PDU, there is no synchronization, so the last measure can be one PDU or another, depending on the one from which the measure was last retrieved. This can clearly be seen on the following plot where we can see the "kwapi api" value jumping randomly from the values of pdu-b3p1 to the values of pdu-b3p2-1:

chetemi-15 power plot

  • The "historical API (on port 12000, AKA the "HDF5 API"), which returns the values collected in the past and stored in databases (currently HDF5 files). For this API, multiple probes are correctly aggregated for a node, as can be seen on the "kwapi hdf5" plot of the above graph.

The list of nodes with multiple power supplies:

$ python -c "from execo_g5k import * ; import pprint ; pprint.pprint([ n for n in get_g5k_hosts(queues=None) if len(get_host_attributes(n)['sensors'].get('power',{}).get('via',{}).get('pdu',[])) > 1 ])"

Kwapi has been disabled for clusters with multiple power supplies per node.

Measurement thresholds on PDUs

APC PDUs do have a threshold (around 50W), under which they report a power consumption value of zero, whatever the actual power consumption is. This has been observed for example on APC model AP8659 or model AP8653 PDUs and is likely to occur on all APC PDUs. More information can be found in APC FAQ. This can of course only be observed when power consumption is below that threshold, so it won't occur for nodes that have an idle consumption above that threshold. Thus it is more likely to occur on nodes with multiple power supplies (because in such situations, power is shared between power supplies). One such example is:

chetemi-13 power plot

Note that this currently causes the kwapi jenkins test to fail with various messages such as:

ERROR: <node name>: initial value = 0

Note also that this behaviour, in conjunction with the aforementionned issue of kwapi not handling correctly nodes with multiple power supplies can cause strange effects such as measurements from kwapi api randomly jumping between a correct value to zero, due to kwapi api jumping between PDUS. The is another cause for the kwapi jenkins test failure, with messages such as:

values for grele-4 (kwapi uid: nancy.grele-4): 215 (idle) -> 0 (busy), ratio=0.0 (SNMP:  -> )
ERROR: ratio is too low

The list of all APC PDUs with attached clusters / nodes:

$ python -c "from execo_g5k import * ; import pprint ; pprint.pprint([ (site, p.get('uid'), p.get('vendor'), p.get('model'), ', '.join(['port %s / %s' % (n,m) for n,m in p.get('ports').items()])) for site in get_g5k_sites() if site not in ['luxembourg', 'nantes', 'sophia'] for p in get_resource_attributes('sites/%s/pdus' % (site,))['items'] if p.get('vendor') == 'APC' ])"

The list of all APC PDUS with attached dual alimentation nodes:

$ python -c "from execo_g5k import * ; import pprint ; pprint.pprint([ (site, p.get('uid'), p.get('vendor'), p.get('model'), ', '.join(['port %s / %s' % (n,m) for n,m in p.get('ports').items() if m in get_g5k_hosts(queues=None) and len(get_host_attributes(m)['sensors']['power']['via']['pdu']) > 1 ])) for site in get_g5k_sites() if site not in ['luxembourg', 'nantes', 'sophia'] for p in get_resource_attributes('sites/%s/pdus' % (site,))['items'] if p.get('vendor') == 'APC' ])"

The list of all nodes connected to an APC PDU:

$ python -c "from execo_g5k import * ; import pprint,itertools ; pprint.pprint(set([ n for site in get_g5k_sites() if site not in ['luxembourg', 'nantes', 'sophia'] for p in get_resource_attributes('sites/%s/pdus' % (site,))['items'] if p.get('vendor') == 'APC' for _,n in p.get('ports').items() ]))"

Measurement resolution and smoothing on some PDUs

Eaton PDUs seem to have a time resolution around 20 seconds and seem to return smoothed values by something like a weighted moving average on the current and 2 last values. No information could be found about this, but this can be somewhat guessed by looking at the data returned during a stress test of a connected node:

graphene-105 power plot

This used to cause the jenkins kwapi test to fail with message "ratio is too low", because in the reference API these PDUs were supposed to have a resolution of 1 second. This is now fixed: the resolution of these PDUs has been set to 60 seconds in the Grid5000 Reference API, and the jenkins check has been updated to take this resolution into account. Note that it causes the jenkins check to take much longer on these clusters.

The list of Eaton PDUs and their attached nodes:

$ python -c "from execo_g5k import * ; import pprint ; pprint.pprint([ (site, p.get('uid'), p.get('vendor'), p.get('model'), ', '.join(['port %s / %s' % (n,m) for n,m in p.get('ports').items()])) for site in get_g5k_sites() if site not in ['luxembourg', 'nantes', 'sophia'] for p in get_resource_attributes('sites/%s/pdus' % (site,))['items'] if p.get('vendor') == 'Eaton Corporation' ])"

The list of all nodes connected to an Eaton PDU:

$ python -c "from execo_g5k import * ; import pprint,itertools ; pprint.pprint(set([ n for site in get_g5k_sites() if site not in ['luxembourg', 'nantes', 'sophia'] for p in get_resource_attributes('sites/%s/pdus' % (site,))['items'] if p.get('vendor') == 'Eaton Corporation' for _,n in p.get('ports').items() ]))"

checking and calibration of the measurements

(WIP)

The measurements of an accurate power were used, to compare with the measurements of the wattmetre in Lyon. This power meter is a Zimmer LMG450 inserted between the wattmetre and the nova-19 node:

zimmer setup

The power consumption was monitored at the same time by:

  • the wattmetre, with a measuring frequency of 50Hz
  • kwapi, which gets its data from the wattmetre, but averaged each second (kwapi cannot handle the 50Hz measuring frequency)
  • the Zimmer LMG450, on one experiment with a 20Hz frequency, and on another experiment with a 10Hz frequency. The LMG450 has a high internal sampling frequency, but we can only get data from it at a 20Hz frequency, maximum.

(details about the zimmer LMG450, the experiment, and the raw data is available if needed, contact the Grid5000 support team)

This gave the following results.

First experiment (Zimmer at 20Hz)

Power-nova-19.lyon.grid5000.fr-2018-11-21 14-59-04 large.png

Power-nova-19.lyon.grid5000.fr-2018-11-21 14-59-04 zoom1.png

Power-nova-19.lyon.grid5000.fr-2018-11-21 14-59-04 zoom2.png

Second experiment (Zimmer at 10Hz)

Power-nova-19.lyon.grid5000.fr-2019-01-24 16-06-05 large.png

Power-nova-19.lyon.grid5000.fr-2019-01-24 16-06-05 zoom1.png

Power-nova-19.lyon.grid5000.fr-2019-01-24 16-06-05 zoom2.png