API Metrology Practical

From Grid5000
Jump to: navigation, search

Presentation

This practical will explain how to use the Metrology API to fetch and graph node metrics. The Grid'5000 infrastructure uses 2 different backends to register metrics : Ganglia for general purpose metrics and Kwapi for network and energy monitoring. As such, a predefined set of metrics are registered for each node, for instance memory consumption, cpu consumption, bytes in, bytes out, etc. What is perhaps more interesting to you is that the existing ganglia infrastructure can also be used to register new metrics on your own: if you want to regularly record the number of requests per seconds for your HTTP server, or any other kind of experiment-specific metric, this practical will show you how to do this.

This practical is targeted at people who are already confident with the Grid'5000 APIs and want to remotely and programmatically fetch, register, and graph node metrics.

Prerequisites

  • It is highly recommended that you take on the Measurements tutorial first, most notably the cURL tutorial, so that you have your configuration in place.
  • A basic knowledge of the RRD format and the rrdtool package will help.

Getting Started

As all of the Grid'5000 APIs, the Metrology API is available over the HTTP protocol. The metrics are registered per site, therefore the URI to access each site's metrics is the following:

 https://api.grid5000.fr/2.0/grid5000/sites/site/metrics

Note that you can also find this link by navigating from the API entry point (https://api.grid5000.fr/2.0/grid5000), and following the links.

Let's see what we can get from the rennes site:

 curl -kn https://api.grid5000.fr/2.0/grid5000/sites/rennes/metrics

You should get a (not very nicely formatted) big payload back. That's where the cURL tool shows its limits, so let's use a tool that will abstract the HTTP requests and deserialize the responses for us. If you have taken the API Main Practical, you should already know the Restfully tool. If not, follow the first steps of the Restfully tutorial to get your configuration set up.

So, let's start the Restfully tool in interactive mode:

 restfully -c ~/.restfully/api.grid5000.fr.yml

=>

 Restfully/0.6.1 - The root resource is available in the 'root' variable.
 ruby-1.8.7-p249> 

Let's fetch again the rennes metrics:

 pp root.sites[:rennes].metrics

You should get back a much more digestible list of metrics for the rennes site:

 #<Restfully::Collection:0x8133d598
   @uri=#<URI::HTTPS:0x10267aae0 URL:https://api.grid5000.fr/2.0/grid5000/sites/rennes/metrics>
   LINKS
     @parent=#<Restfully::Resource:0x813110ec>
   PROPERTIES
     "total"=>40,
     "offset"=>0
   ITEMS (0..40)/40
     #<Restfully::Resource:0x81353b04 uid="cpu_idle">,
     #<Restfully::Resource:0x81351fac uid="mem_free">,
     #<Restfully::Resource:0x813505bc uid="mem_cached">,
     #<Restfully::Resource:0x8134ea8c uid="custom_metric_crohr">,
     #<Restfully::Resource:0x8134cfd4 uid="ib_bytes_in">,
     #<Restfully::Resource:0x8134b51c uid="load_one">,
     #<Restfully::Resource:0x813499ec uid="boottime">,
     #<Restfully::Resource:0x81347fc0 uid="pkts_out">,
     #<Restfully::Resource:0x813464b8 uid="swap_free">,
     #<Restfully::Resource:0x813449b0 uid="mem_buffers">,
     #<Restfully::Resource:0x81342f70 uid="proc_run">,
     #<Restfully::Resource:0x81341418 uid="disk_total">,
     #<Restfully::Resource:0x8133f974 uid="pkts_in">,
     #<Restfully::Resource:0x8133dea8 uid="mtu">,
     #<Restfully::Resource:0x8133c3dc uid="ib_pkts_out">,
     #<Restfully::Resource:0x8133a9d8 uid="mem_shared">,
     #<Restfully::Resource:0x81338ea8 uid="cpu_num">,
     #<Restfully::Resource:0x81337418 uid="load_five">,
     #<Restfully::Resource:0x81335960 uid="part_max_used">,
     #<Restfully::Resource:0x81333e08 uid="ambient_temp">,
     #<Restfully::Resource:0x81331ef0 uid="cpu_speed">,
     #<Restfully::Resource:0x8132ffd8 uid="cpu_wio">,
     #<Restfully::Resource:0x8132e458 uid="cpu_system">,
     #<Restfully::Resource:0x8132c9c8 uid="cpu_user">,
     #<Restfully::Resource:0x8132af4c uid="swap_total">,
     #<Restfully::Resource:0x813294d0 uid="sys_clock">,
     #<Restfully::Resource:0x81327a40 uid="my_metric">,
     #<Restfully::Resource:0x81325fd8 uid="custom_metric_pmorillo">,
     #<Restfully::Resource:0x81324548 uid="dummy_metric">,
     #<Restfully::Resource:0x81322ab8 uid="disk_free">,
     #<Restfully::Resource:0x8132103c uid="bytes_in">,
     #<Restfully::Resource:0x8131f5c0 uid="ib_pkts_in">,
     #<Restfully::Resource:0x8131db30 uid="bytes_out">,
     #<Restfully::Resource:0x8131c0a0 uid="load_fifteen">,
     #<Restfully::Resource:0x8131a638 uid="mem_total">,
     #<Restfully::Resource:0x81318ba8 uid="cpu_aidle">,
     #<Restfully::Resource:0x81317000 uid="ib_bytes_out">,
     #<Restfully::Resource:0x81315818 uid="proc_total">,
     #<Restfully::Resource:0x81314030 uid="cpu_nice">,
     #<Restfully::Resource:0x81312848 uid="custom_metric_crohr_2">>

You can request a specific metric as follows:

 pp root.sites[:rennes].metrics[:cpu_idle]

Result:

 #<Restfully::Resource:0x81353b04 uid="cpu_idle"
   @uri=#<URI::HTTPS:0x1026a72c0 URL:https://api.grid5000.fr/2.0/grid5000/sites/rennes/metrics/cpu_idle>
   LINKS
     @parent=#<Restfully::Resource:0x81352768>,
     @timeseries=#<Restfully::Collection:0x81352de4>
   PROPERTIES
     "step"=>15,
     "available_on"=>["paraquad-13.rennes.grid5000.fr",
      "paraquad-39.rennes.grid5000.fr",
      ...
      "parapide-13.rennes.grid5000.fr",
      "parapluie-25.rennes.grid5000.fr"],
     "uid"=>"cpu_idle",
     "type"=>"metric",
     "timeseries"=>[{"xff"=>0.5,
       "cf"=>"AVERAGE",
       "rows"=>244,
       "pdp_per_row"=>1},
      {"xff"=>0.5, "cf"=>"AVERAGE", "rows"=>244, "pdp_per_row"=>24},
      {"xff"=>0.5, "cf"=>"AVERAGE", "rows"=>244, "pdp_per_row"=>168},
      {"xff"=>0.5, "cf"=>"AVERAGE", "rows"=>244, "pdp_per_row"=>672},
      {"xff"=>0.5, "cf"=>"AVERAGE", "rows"=>374, "pdp_per_row"=>5760}]>

As you can see, a metric has a few properties:

  • uid: the name of the metric;
  • available_on: the list of nodes on which this metric is available;
  • step: the number of seconds between each new acquisition of a new value;
  • timeseries: the list of timeseries that are registered for this metric (at different resolutions). Each entry has the following attributes:
    • pdp_per_row: the number of Primary Data Points (PDPs) that compose a row. A value of 1 means that a new row is added each step seconds. A value of 10 means that a new row is added after 10 new values have been registered, and the value of the row is obtained by applying the cf function to the 10 values.
    • cf: the consolidation function of the archive. There are several consolidation functions that consolidate primary data points via an aggregate function: AVERAGE, MIN, MAX, LAST;
    • xff: the xfiles factor defines what part of a consolidation interval may be made up from *UNKNOWN* (null) data while the consolidated value is still regarded as known. It is given as the ratio of allowed *UNKNOWN* PDPs to the number of PDPs in the interval. Thus, it ranges from 0 to 1 (exclusive).
    • rows: the number of rows in the timeseries.

Another property that can be computed from the pdp_per_row and step properties is the resolution of the timeseries:

  resolution = pdp_per_row*step

More information can be found at http://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html.

Note.png Note

Note that the timeseries will always have the same number of rows since they are registered in a Round Robin Database (RRD), which guarantees that the database will never grow too big. But it also means that old values are constantly erased to make place for new values (the lower the resolution of the timeseries, the higher the replacement rate). Thus do not wait too long before fetching your timeseries, or you will lose in resolution (or may even lose all the data of the period that interests you). Note that the minimal resolution (or acquisition frequency) in Grid'5000 is 15 seconds.

To request the timeseries for a metric, just dereference the timeseries link:

 pp root.sites[:rennes].metrics[:cpu_idle].timeseries

Result:

 #<Restfully::Collection:0x81352de4
   @uri=#<URI::HTTPS:0x1026a5b78 URL:https://api.grid5000.fr/2.0/grid5000/sites/rennes/metrics/cpu_idle/timeseries>
   LINKS
     @parent=#<Restfully::Resource:0x811adf20>
   PROPERTIES
     "total"=>263,
     "offset"=>0
   ITEMS (0..263)/263
     #<Restfully::Resource:0x81322518 uid="paraquad-13">,
     #<Restfully::Resource:0x8131f8cc uid="paraquad-39">,
     #<Restfully::Resource:0x8131d338 uid="paramount-26">,
     ...
     #<Restfully::Resource:0x811b165c uid="parapide-7">,
     #<Restfully::Resource:0x811b043c uid="parapide-13">,
     #<Restfully::Resource:0x811af21c uid="parapluie-25">>
  => nil 
 

You can see that the timeseries for a metric are indexed by node, so let's choose parapluie-25, for example:

 pp root.sites[:rennes].metrics[:cpu_idle].timeseries[:'parapluie-25']

Result:

 #<Restfully::Resource:0x811af21c uid="parapluie-25"
   @uri=#<URI::HTTPS:0x10235e118 URL:https://api.grid5000.fr/2.0/grid5000/sites/rennes/metrics/cpu_idle/timeseries/parapluie-25>
   LINKS
     @parent=#<Restfully::Resource:0x811ae63c>
   PROPERTIES
     "resolution"=>360,
     "from"=>1303087320,
     "metric_uid"=>"cpu_idle",
     "to"=>1303130880,
     "uid"=>"parapluie-25",
     "type"=>"timeseries",
     "values"=>[nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      nil,
      99.0030769230769,
      100.0,
      99.9527777777778,
      99.9155555555556,
      100.0,
      99.9111111111111,
      99.9430555555556,
      99.9588888888889,
      99.9661111111111,
      100.0,
      100.0,
      99.9325,
      99.9836111111111,
      99.8875,
      99.955,
      100.0,
      100.0,
      99.9533333333333,
      100.0,
      99.9205555555556,
      100.0,
      99.9791666666667,
      100.0,
      99.9083333333333,
      99.9769444444445,
      99.9066666666667,
      99.96,
      100.0,
      99.9458333333333,
      99.9011111111111,
      100.0,
      99.9627777777778,
      99.9805555555556,
      100.0,
      99.9625,
      99.9555555555556,
      99.9077777777778,
      99.9375,
      100.0,
      100.0,
      99.9388888888889,
      99.9316666666667,
      100.0,
      99.9527777777778,
      100.0,
      99.9366666666667,
      100.0,
      99.9366666666667,
      99.9369444444444,
      100.0,
      99.9505555555555,
      99.9533333333333,
      99.965,
      99.9286111111111,
      99.9494444444444,
      99.9391666666667,
      100.0,
      100.0,
      100.0,
      99.9766666666667,
      99.9622222222222,
      98.7733333333334,
      99.8022222222222,
      99.8988888888889,
      99.9016666666667,
      100.0,
      99.9161111111111,
      99.9616666666667,
      99.9677777777778],
     "hostname"=>"parapluie-25.rennes.grid5000.fr">
  => nil 
Note.png Note

Sometimes, you will observe that the values property of the timeseries is only composed of nil (UNKNOWN) values. This happens when the node has been reserved and a custom environment has replaced the standard environment (the sending of the metrics is disabled in all environments other than the standard environment).

If you read the documentation for the Metrology API, you will learn that the timeseries resource accepts a few parameters:

  • only: return the timeseries only for the given comma-separated list of nodes;
  • resolution: the resolution (in seconds) that you want for your timeseries (the API will automatically adjust it if it is not available for the requested period);
  • from and to: these two options respectively specify the beginning and end of the period of time the timeseries must be restricted to. Both are expressed as UNIX timestamps (in seconds).

For example you can request a specific period of time at a specific resolution:

 pp root.sites[:rennes].metrics[:cpu_idle].timeseries.load(:query => {:resolution => 15, :from => Time.now.to_i-3600*1})[:'parapluie-25']

We will now demonstrate how you can register your own custom metrics using the Grid'5000 infrastructure (Ganglia + Metrology API).

Register your Own Metrics

To register metrics, you must have the ganglia-monitor package installed on your nodes, so that you can get access to the gmetric binary. On the default (production) environment, this is already done, so in this tutorial we'll start a job with the default environment.

We'll reserve 2 nodes and launch a very simple script that will do the following:

    #!/bin/bash

    if [ -n "$OAR_NODE_FILE" ]; then
      # launch the script on all the nodes of the job
      for node in $(cat $OAR_NODE_FILE | uniq); do
        echo "Launching on $node"
        oarsh $node "sh $0" &
      done
    else
      # launch the infinite loop that will send new value for the dummy_metric every 20 seconds
      while true; do
        gmetric --name dummy_metric --type uint16 --value $RANDOM
        sleep 20
      done
    fi
     => nil
  

In short, the master node will SSH to each node of the reservation to launch an infinite loop that will register a random value under the name dummy_metric, every 20 seconds.

For ease of use, I have put this script in my public home directory in lille, so that it can be easily reused:

 curl -kn https://api.grid5000.fr/2.0/grid5000/sites/lille/public/crohr/send-dummy-metric.sh

So, let's submit a job that will execute this script (notice how we encode the job description in JSON here):

 curl -kni https://api.grid5000.fr/2.0/grid5000/sites/nancy/jobs  -d '{"command": "curl -k https://api.grid5000.fr/2.0/grid5000/sites/lille/public/crohr/send-dummy-metric.sh > send-dummy-metric.sh && sh send-dummy-metric.sh && sleep 1800", "resources": "nodes=2"}' -H'Content-Type: application/json'

See if our job is running (replace the job ID with your job number):

 curl -kni https://api.grid5000.fr/2.0/grid5000/sites/nancy/jobs/325613

Once the job is running, wait a few minutes, then fetch our dummy metric (replace griffon-9,griffon-90 with the assigned_nodes of your job, and 1303141284 with the started_at value):

 curl -kn "https://api.grid5000.fr/2.0/grid5000/sites/nancy/metrics/dummy_metric/timeseries?resolution=15&from=1303141284&only=griffon-9,griffon-90"

If you regularly retry that query, you should see that the timeseries values are slowly updated with random values generated by our script.

Note.png Note

If you send real values, don't be surprised if the values registered in the RRD database are not exactly the same as what you sent with the gmetric tool. Indeed, the time at which you send your value is likely to be different than the time at which a new row is automatically added in the RRD database (depending on the acquisition frequency of the Ganglia daemons). Thus, the registered value may be averaged over the preceding values.

Automate the Whole Thing

In one of the tutorials of the API Main Practical, you were introduced to the g5k-campaign tool. The example engine already automates the process of installing the required packages and sending metrics, and you can pipe the output of this script to the g5k-graph tool, which takes a list of nodes as input and graphs a set of requested metrics.

So, install g5k-graph and g5k-campaign:

 gem install g5k-graph
 gem install g5k-campaign --source http://g5k-campaign.gforge.inria.fr/pkg

You can see the g5k-graph usage with:

 g5k-graph -h

This tool also requires rrdtool and unzip, so if you don't have them on your system, install them now. On debian-based systems these can be installed with:

 apt-get install rrdtool unzip

Finally, just reuse the command given in the g5k-campaign tutorial, piped into g5k-graph (you should replace crohr with your Grid'5000 username):

 g5k-campaign -i https://github.com/grid5000/tutorials/raw/master/api/2.0/g5k-campaign-tutorial.rb \
 --gateway access.lille.grid5000.fr SimpleCustomEngine | g5k-graph \
 --from=`date +%s` --resolution=15 \
 --metrics=bytes_in,bytes_out,mem_free,cpu_idle,custom_metric_crohr

At the end of the engine, you should see a message telling you that a number of PNG graphs are available in ./data. The program will also display a URL to the API metrics user interface.

You can have a look at the g5k-graph source code if you need more details about how this is done.

Note.png Note

Note that since you get access to the raw data, you can use any tool you want to graph your metrics (gnuplot for instance). You can also use the user interface: https://api.grid5000.fr/sid/ui/metrics.html.

Conclusion

This practical demonstrates how you can easily fetch and/or register standard and custom metrics using the Metrology API and the Ganglia infrastructure provided by Grid'5000.