Energy consumption monitoring tutorial: Difference between revisions

From Grid5000
Jump to navigation Jump to search
Line 389: Line 389:
* The various monitoring devices used in Grid'5000 are presented in this page: [[Power_Measurement]]
* The various monitoring devices used in Grid'5000 are presented in this page: [[Power_Measurement]]
* More details about Kwapi's monitoring capabilities are available at: [[Monitoring]]
* More details about Kwapi's monitoring capabilities are available at: [[Monitoring]]
* More information about modifying CPU parameters on Grid'5000: [[CPU_parameters]]

Revision as of 15:19, 28 March 2018

Introduction

This tutorial will show how to monitor energy on Grid'5000. Electrical power consumption of nodes can be retrieved from their Power Distribution Units (PDU), the device which supply them with electrical power.

On Lyon site, special devices (called "wattmeter") allow fine grained measurements (one measure each second, with sub-watt resolution). While less precise, many clusters from other sites also has monitoring capabilities.

Grid'5000 also uses the Kwapi tool to provide a convenient and consistent way to monitor energy consumption in experiments.

In the tutorial, you will learn how to retrieve energy consumed by Grid'5000 nodes by querying their PDUs or by using Kwapi. The power consumption will be studied under various workload scenario and combinations of CPU energy saving parameters (P-State, C-State, etc.).

This tutorial requires a basic knowledge of Grid'5000 usage (i.e. having competed Getting Started tutorial).


Retrieving energy consumption data

By querying Power Distribution Units devices

On Lyon site

Lyon site provides dedicated devices, called "wattmeters" to monitor energy consumption (more information here). At Lyon, each node is supplied with electricity by a single plug coming from one of these devices.

Each node is monitored individually by wattmeters which provide the average electrical power consumed each second, with a precision of 0.01 watt.

From inside Grid'5000 network, wattmeters can be queried in real-time to retrieve current monitored values at URL http://wattmetre.lyon.grid5000.fr/GetWatts-json.php

For instance, from a frontend, use:

$ curl http://wattmetre.lyon.grid5000.fr/GetWatts-json.php | json_pp | less


Note: In this command, we used curl to perform the HTTP request, json_pp to format the returned JSON text and less to ease reading the output. Of course you can use any of your preferred tools instead. In the following examples we will only provide URLs and won't mention those tools anymore, but you can keep using them that way. It will also be assumed that commands must be run from inside Grid'5000 (on frontends or nodes).


On other sites

Some clusters are supplied by PDUs which allow per-plug electrical consumption monitoring. These are documented in reference API.

The monitoring device available for a specific node is available in node description. For instance, for Lille's chifflet-1, it is at URL:

https://api.grid5000.fr/stable/sites/lille/clusters/chifflet/nodes

In "sensors" entry of the returned JSON:

"sensors": {
  "power": {
    "available": true,
    "via": {
      "api": {
        "metric": "power"
      },
      "pdu": [
        {
          "port": 10,
          "uid": "pdu-b3p1"
        },
        {
          "port": 10,
          "uid": "pdu-b3p2-1"
        }
      ]
    }
  }
},


This means that chifflet-1 has two power supply units, one connected to port 10 of the PDU called "pdu-b3p1", the other to port 10 of "pdu-b3p2-1"

For nodes which don't have monitoring of their power consumption, the "sensors" part of the JSON would be empty. Some clusters only have "grouped" monitoring capabilities, meaning that energy consumption values are only available for groups of nodes, but is not available for individual nodes. Such nodes have a wattmeter: shared entry in their API description (as of March 2018, only hercule, parapide, griffon and some graphene nodes are concerned)

Now that we know which PDUs and ports are used to supply a node, how do we get the power consumed ? The way to build the appropriate request is also documented in the reference API, in the entry dedicated to the PDU. For instance, for "pdu-b3p1" PDU used by chifflet-1, the URL is:

https://api.grid5000.fr/stable/sites/lille/pdus/pdu-b3p1

This returns a URL containing various information about the PDU:

 "sensors": [
   {
     "power": {
       "per_outlets": true,
       "resolution": 1,
       "snmp": {
         "available": true,
         "outlet_prefix_oid": "iso.3.6.1.4.1.318.1.1.26.9.4.3.1.7",
         "total_oids": [
           "iso.3.6.1.4.1.318.1.1.12.1.16.0"
         ],
         "unit": "W"
       }
     }
   }
 ],


The power consumption is exposed using SNMP protocol at the OID specified at "outlet_prefix_oid" field. This OID is a prefix and must be appended with the PDU port number to monitor. For instance, we have seen that chifflet-1 is connected to PDU "pdu-b3p1" on the port number 10. So the corresponding OID is:

iso.3.6.1.4.1.318.1.1.26.9.4.3.1.7.10

We are now able to fetch the power consumption by using an SNMP request:

snmpget -v2c -c public pdu-b3p1.lille.grid5000.fr iso.3.6.1.4.1.318.1.1.26.9.4.3.1.7.10

Remember that chifflet-1 has two power supply units, thus its total power consumption is the sum of the power delivered by both PDU plugs it uses.

The power consumption value of the second PSU of chifflet-1 is available using the following SNMP request:

snmpget -v2c -c public pdu-b3p2-1.lille.grid5000.fr iso.3.6.1.4.1.318.1.1.26.9.4.3.1.7.10

Using Kwapi

Warning.png Warning

Kwapi is known to have performance and accuracy problems with power measurements. See bug #7815 and its Jenkins report. As of March 2018, it may be considered reliable on Lyon only and values retrieved from other sites should be carefully verified.

Kwapi is a tool dedicated to electrical power consumption and network traffic monitoring. On Grid'5000, it permanently collects these information (using same HTTP and SNMP requests presented above) on every nodes and store them in a long term storage (one year of data is kept). Collected metrics are exposed to users through several interfaces:

  • Grid'5000 API
  • Kwapi internal API
  • Web interface (for example at Lille)

The Web interface provides a "live" view of energy being consumed by a node or by a group of nodes inside an OAR reservation. However for experimenting purpose, it may be more useful to get access to raw values available using APIs.

The Grid'5000 API is particularly suited to get data for measures performed in the past. For instance, to get the power consumed by nodes "nova-1" and "nova-2" at Lyon, between 10:35 and 10:40 on March, 21, use the URL:

https://api.grid5000.fr/stable/sites/lyon/metrics/power/timeseries?resolution=1&only=nova-1,nova-2&from=1521624864&to=1521625164

(beware if using this URL on a command line, quote it to avoid '&' being interpreted as the job control operator to put the command in background)

(values 1521624864 and 1521625164 are Unix timestamps for March, 21 10:35 and March, 21 10:40 dates)

The Kwapi internal API is more appropriate if you need to get "instantaneous" values of energy currently consumed. On a particular all values collected by Kwapi are available at URL http://kwapi.<SITE>.grid5000.fr:5000/probes/. For instance, at Lyon it is:

http://kwapi.lyon:5000/probes/

It returns, for each metric available, the list of available probes. Note that kwapi not only stores power measures, but also network measures. For example, to get the power metric for lyon.nova-23:

http://kwapi.lyon:5000/probes/lyon.nova-23/power/

Power consumption under different workloads

In the previous section, we have learned how to retrieve energy consumption information: find on which nodes it is available, build requests to get consumption from PDU devices, use Kwapi to get the data.

In this part, we will illustrate these monitoring features in an example scenario: We will show how energy consumption evolves under different workload, and the impact of various CPU's energy-related parameters.

Preliminary remarks

  • In the examples given in this part, we will use the Kwapi interface exposed in Grid'5000 API. As stated earlier, Kwapi is currently only reliable on Lyon sites. So if you follow our implementation example, we encourage you to use Lyon site with a recent cluster such as nova.
  • In this scenario, you need to reserve one node and install some additional tools inside it. As you will require to be root, you can use sudo-g5k to get sudo rights, or use kadeploy to deploy your own environment. Then, you can install the required tools with the following command:
apt update && apt install linux-cpupower sysbench
  • The solutions are given in Python 3, can easily be copy/pasted to ipython3 interpreter.

Workload examples

We will consider 3 different workloads:

  1. Idle: Nothing is done of the machine
  2. CPU Intensive, mono-threaded: The machine run a CPU intensive application on one of its core. We will use the "sysbench" benchmarking tool to mimic this workload, invoked with:
sysbench --test=cpu --cpu-max-prime=50000 --num-threads=1 run
  1. CPU Intensive, multi-threaded: The machine run a CPU intensive application on all of its core. We will also use "sysbench", invoked with:
NUM_THREADS=$(getconf _NPROCESSORS_ONLN)
sysbench --test=cpu --cpu-max-prime=50000 --num-threads=$NUM_THREADS run

($NUM_THREAD is the number of threads to run, we will use the number of cores avaible on the node we use)

Impact of CPU parameters

Several CPU parameters tries are available to lower energy consumed under certain workload. In particular:

  • C-States configuration is the ability for processors and cores to go to energy saver "sleep states" when not being used.
  • P-States policy dynamically adjusts voltage and frequency of cores to fit workload
  • Turboboost allows cores to run at higher frequency while they stay under temperature specification limits.

In this example scenario, we will investigate two different C-States configuration : Partially enabled (the maximum authorized sleep state is C1, this is the default on Grid'5000) and fully enabled (all sleep states are allowed, the deeper sleep state on modern machine is usually C6). To change the maximum allowed sleep state allowed, we will use cpupower command. For instance, to allow all sleep states available, use:

cpupower idle-set -E

To disable sleep states that would require more than 20 microseconds to be awakened from it (i.e. disable C-States higher than C1):

cpupower idle-set -D 20

We will also study the impact of turboboost by enabling (which is the default on Grid'5000) or disabling it. To disable turboboost, the following command must be used:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo


Scenario implementation

We propose to study following metrics:

  • Average electrical power required to run workload
  • Time needed to run CPU workload
  • The ops per watt value, i.e. the average number of operation per second and per Watt, a metric reflecting the "energy efficiency" of machines

The average electrical power required to run the workload is the amount of electrical energy spent during its execution divided by the execution time. Its value can be approximated as the average of the power values which have been monitored during execution.

Using your favorite programming language, write a function that queries the Grid'5000 API to return the average power used by a Grid'5000 node between two dates (as Unix timestamps).


Solution (in Python)

import requests
# you may need to install requests for python3 with sudo-g5k apt install python3-requests
from statistics import mean

def get_power(node, site, start, stop):
    url = "https://api.grid5000.fr/stable/sites/%s/metrics/power/timeseries?resolution=1&only=%s&from=%s&to%s" \
            % (site, node, int(start), int(stop))
    data = requests.get(url, verify=False).json()
    return mean(data['items'][0]['values'])


Idle workload

First, we are going to investigate how C-States influence energy consumed when machine is idle.

Turn off C-States and leave the machine idle. What is the energy consumed during the last ten seconds ? Turn on C-States and repeat. How many Watts have been saved by C-States ?


Solution

from os import system
from time import sleep, time

# Turn off C-States
system("sudo cpupower idle-set -D0")
sleep(20)
power_cstate_off = get_power("nova-1", "lyon", time()-20, time()-10)

# Turn on C-States
system("sudo cpupower idle-set -E")
sleep(20)
power_cstate_on = get_power("nova-1", "lyon", time()-20, time()-10)

print(power_cstate_off - power_cstate_on)


CPU intensive, mono-threaded, workload

We are now going to run CPU intensive workload and see how CPU parameters influence the average power consumption but also the time spent to execute the workload.

For instance, turn off C-States and Turboboost and measure the workload runtime, and then get the average power consumed. Repeat with C-States turned on, with or without Turboboost. Which combination consumes less power ? Which one runs faster ? has the best ops/watt ratio ?


Solution

from os import system
from time import sleep, time

# Turn off C-States and Turboboost
system("sudo cpupower idle-set -D0")
system("echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=20000 run")
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_1 = "C-States OFF, Turbo OFF, Duration: %f, Power: %f" % (stop-start, power)


# Turn on C-States
system("sudo cpupower idle-set -E")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=20000 run")
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_2 = "C-States ON, Turbo OFF, Duration: %f, Power: %f" % (stop-start, power)


# Turn on Turboboost
system("echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=20000 run")
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_3 = "C-States ON, Turbo ON, Duration: %f, Power: %f" % (stop-start, power)

# Print results
print(result_1)
print(result_2)
print(result_3)


CPU intensive, multi-threaded, workload

We are now going to repeat the same experiment with a multi-threaded workload, running on every cores the machine has. Run the workload with or without C-States and Turboboost activated and observe runtime and power consumed. What can you say abount the influence of CPU parameters on multi-threaded, CPU intensive workload ? Is running multi-threaded is more energy efficient ?


Solution

from os import system
from time import sleep, time
import requests

# Get core count
core_count = requests.get(
                 "https://api.grid5000.fr/stable/sites/lyon/clusters/nova/nodes/nova-1",
                 verify=False
                 ).json()['architecture']['nb_cores']

# Turn off C-States and Turboboost
system("sudo cpupower idle-set -D0")
system("echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=50000 --num-threads=%s run" % core_count)
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_1 = "C-States OFF, Turbo OFF, Duration: %f, Power: %f" % (stop-start, power)


# Turn on C-States
system("sudo cpupower idle-set -E")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=50000 --num-threads=%s run" % core_count)
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_2 = "C-States ON, Turbo OFF, Duration: %f, Power: %f" % (stop-start, power)


# Turn on Turboboost
system("echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo")

# Run workload
start = time()
system("sysbench --test=cpu --cpu-max-prime=50000 --num-threads=%s run" % core_count)
stop = time()

# Get results
sleep(5)
power = get_power("nova-6", "lyon", start, stop)
result_3 = "C-States ON, Turbo ON, Duration: %f, Power: %f" % (stop-start, power)

# Print results
print(result_1)
print(result_2)
print(result_3)


Going further

  • An other tutorial about measurements (not specific to energy) is available: Measurements_tutorial
  • The various monitoring devices used in Grid'5000 are presented in this page: Power_Measurement
  • More details about Kwapi's monitoring capabilities are available at: Monitoring
  • More information about modifying CPU parameters on Grid'5000: CPU_parameters