G5k-checks: Difference between revisions

From Grid5000
Jump to navigation Jump to search
(→‎Installation: fix old package repository)
(36 intermediate revisions by 5 users not shown)
Line 18: Line 18:
G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 [[TechTeam:Reference_Repository|Reference_Repository]]. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.
G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 [[TechTeam:Reference_Repository|Reference_Repository]]. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.


== oar ==
== OAR ==


* The oar-node flavour of OAR installation embeds an hourly cron job, <code class="command">/usr/lib/oar/oarnodecheckrun</code>, which runs the executable file <code class="command">/etc/oar/check.d/start_g5kchecks</code>. Then the server periodically invokes remotely <code class="command">/usr/bin/oarnodecheckquery</code>. This command returns with status 1 if <code class="command">/var/lib/oar/check.d/</code> and 0 otherwise. So if <code class="command">/etc/oar/check.d/start_g5kchecks</code> finds something wrong, it simply has to create a log file in that directory.
* The oar-node flavour of OAR installation <code class="command">/etc/default/oar-node</code> is started at at boot time. It launches <code class="command">/usr/lib/oar/oarnodecheckrun</code>, which then runs the executable file <code class="command">/etc/oar/check.d/start_g5kchecks</code>. The OAR server periodically invokes remotely <code class="command">/usr/bin/oarnodecheckquery</code>. This command returns with status 1 if <code class="command">/var/lib/oar/check.d/</code> is not empty, 0 otherwise. So if <code class="command">/etc/oar/check.d/start_g5kchecks</code> finds something wrong, it simply has to create a log file in that directory.
* The version of <code class="command">/etc/default/oar-node</code> that g5k-checks installs runs both <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> scripts. If the latter fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".
* If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".


= Checks Overview =
This summarizes when g5kchecks is run:
* At service start with <code class="command">/etc/default/oar-node</code>
* Between (non-deploy) jobs with remote execution of <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> (In case of deploy jobs, the first type of execution takes place)
* Launched by user manually (for now, never happens)


== legend ==
G5kchecks is never run during users jobs.


{| class="program"
= Checks Overview =
|+ legend
|-
! scope=col | :-)
! scope=col | means
|-
| {{No}}
| width="100" |
no test
|-
| {{Yes}}
| width="100" |
test
|-
| {{Inprogress}}
| width="100" |
test but doesn't work on each cluster
|-
| {{NA}}
| width="100" |
don't know if we could test
|-
|}
 
== g5k-parts ==
 
'''g5k-parts is designed to run at both phases of g5k-checks''' (see above).
 
* In Phase 1, g5k-parts validates the partitioning of a Grid'5000 computational node against the G5K Node Storage convention: all partitions but /tmp are primary, and /tmp is a logical partition inside the only extended partition.
* It first compares /etc/fstab with its backup generated at deployment time. When errors are found at this level, /etc/fstab is reset and the machine reboots.
* Then for every partition given on the command line, it first matches its geometry on the hard drive with the partition layout saved at deployment time. In the new g5kchecks, we decide that no formating is doing after an error (let's do that with charon )
 
== Clock ==
 
G5kchecks ensure that the node is up to time by perform tree step:
* stop the ntp client;
* synchronize with the ntp server of the site
* start the client
 
If the OS clock is different from hardware clock than g5kchecks puts the good time on the hardware clock. It ensure that the hardware clock is right and was not set by another user during another deployment.
 
== Virtual Hardware ==
 
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
supported_job_types_virtual
| width="10" |
{{Yes}}
|
|-
|}
 
== Architecture ==
 
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
architecture_platform_type
| width="10" |
{{Yes}}
| platform type (x86_64 ...)
|-
| width="10" |
architecture_nb_procs
| width="10" |
{{Yes}}
| number of procs
|-
| width="10" |
architecture_nb_cores
| width="10" |
{{Yes}}
| number of cores
|-
| width="10" |
architecture_nb_threads
| width="10" |
{{Yes}}
| number of thread
|-
|}
 
== Bios ==
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
bios_version
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
bios_vendor
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
bios_release_date
| width="10" |
{{Yes}}
| width="10" |
|-
|}
 
== BMC ==
 
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
network_adapters_bmc_ip
| width="10" |
{{Yes}}
| bgcolor="" |
Can, but ipmitool is not present in standard environment
|-
| width="10" |
network_adapters_bmc_mac
| width="10" |
{{Yes}}
| bgcolor="" |
Can, but ipmitool is not present in standard environment
|-
| width="10" |
network_adapters_bmc_managment
| width="10" |
{{Yes}}
| bgcolor="" |
Can, but ipmitool is not present in standard environment
|-
|}
 
== Chassis ==
 
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
chassis_serial_number
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
chassis_manufacturer
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
chassis_product_name
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
|}
 
== Disk ==
 
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
storage_devices_*_device
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_size
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_model
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_rev
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_driver
| width="10" |
{{No}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_interface
| width="10" |
{{No}}
| bgcolor="" width="10" |
|-
| width="10" |
storage_devices_*_by_id
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
|}


== Memory ==
The following values are checked by g5k-checks:
{| class="program"
<pre>
|+ What is checks by the new version
|- #{dev[0]}, interface"


! scope=col | ref API
# Generated by g5k-checks (g5k-checks -m api)
! scope=col | check ?
---
! scope=col | comment(s)
network_adapters:
|-
  bmc:
| width="10" |
    ip: 172.17.52.9
main_memory_ram_size
    mac: 18:66:da:7c:96:1a
| width="10" |
    management: true
{{Yes}}
  eno1:
| width="10" |
    name: eno1
|-
    interface: Ethernet
|}
    driver: tg3
    mac: 18:66:da:7c:96:16
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno2:
    name: eno2
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:17
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno3:
    name: eno3
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:18
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno4:
    name: eno4
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:19
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  enp5s0f0:
    name: enp5s0f0
    interface: Ethernet
    ip: 172.16.52.9
    driver: ixgbe
    mac: a0:36:9f:ce:e4:24
    rate: 10000000000
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: true
    management: false
  enp5s0f1:
    name: enp5s0f1
    interface: Ethernet
    driver: ixgbe
    mac: a0:36:9f:ce:e4:26
    rate: 0
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: false
    management: false
operating_system:
  ht_enabled: true
  pstate_driver: intel_pstate
  pstate_governor: performance
  turboboost_enabled: true
  cstate_driver: intel_idle
  cstate_governor: menu
architecture:
  platform_type: x86_64
  nb_procs: 2
  nb_cores: 16
  nb_threads: 32
chassis:
  serial: 7W26RG2
  manufacturer: Dell Inc.
  name: PowerEdge R430
main_memory:
  ram_size: 68719476736
supported_job_types:
  virtual: ivt
bios:
  vendor: Dell Inc.
  version: 2.2.5
  release_date: '09/08/2016'
processor:
  clock_speed: 2100000000
  instruction_set: x86-64
  model: Intel Xeon
  version: E5-2620 v4
  vendor: Intel
  other_description: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  cache_l1i: 32768
  cache_l1d: 32768
  cache_l2: 262144
  cache_l3: 20971520
  ht_capable: true
storage_devices:
  sda:
    device: sda
    by_id: "/dev/disk/by-id/wwn-0x6847beb0d535ed001fa67d1a12d0d135"
    by_path: "/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0"
    size: 598879502336
    model: PERC H330 Mini
    firmware_version: 4.26
    vendor: DELL


== Network ==
</pre>
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
network_adapters_*_device
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_interface
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_ip4
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_ip6
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_switch
| width="10" |
{{No}}
| width="10" |
{{NA}}
|-
| width="10" |
network_adapters_*_switch_port
| width="10" |
{{No}}
| width="10" |
{{NA}}
|-
| width="10" |
network_adapters_*_bridged
| width="10" |
{{No}}
| width="10" |
|-
| width="10" |
network_adapters_*_driver
| width="10" |
{{No}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_mac
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_guid
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_rate
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
network_adapters_*_version
| width="10" |
{{No}}
| width="10" |
|-
| width="10" |
network_adapters_*_vendor
| width="10" |
{{No}}
| width="10" |
|-
| width="10" |
network_adapters_*_mounted
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
network_adapters_*_management
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
|}


== OS ==
This is an example of output file in API mode (g5k-checks launched with -m api option).
{| class="program"
|+ What is checks by the new version
|-
! scope=col | ref API
! scope=col | check ?
! scope=col | comment(s)
|-
| width="10" |
operating_system_name
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
operating_system_kernel
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
operating_system_version
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
|}


== Processor ==
In addition, not all tests are exporting data in this file. The following values are also checked:


{| class="program"
* Grid5000 standard environment version
|+ What is checks by the new version
* Grid5000 post-install scripts version
|-
* Usage of sudo-g5k (failed if used, could be destructive to other parts of the system)
! scope=col | ref API
* Correct mode of /tmp/
! scope=col | check ?
* Fstab partitions mounted and valids
! scope=col | comment(s)
* All partitions have expected size, position, offset, mount options, ...
|-
* Correct KVM driver
| width="10" |
processor_clock_speed
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_instruction_set
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_model
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_version
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_vendor
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_description
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_cache_l2
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_cache_l3
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
processor_cache_l1
| width="10" |
{{Yes}}
| bgcolor="" width="10" |
|-
| width="10" |
processor_cache_l1d
| width="10" |
{{Yes}}
| width="10" |
|-
| width="10" |
turboboost_enabled
| width="10" |
{{Yes}}
| width="10" |
|-
|}


= Simple usage =
= Simple usage =
== Installation ==
== Installation ==


G5kchecks is has been tested for wheezy and jessie on grid5000 debian repository, just add on /etc/apt/sources.list
G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list
  deb http://apt.grid5000.fr/debian sid main
  deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster /


Get grid5000 keyring (A5ED59A7AF7F6E3B):
  {{Term|location=node|cmd=<code class="command">apt-get</code> update }}
  {{Term|location=node|cmd=<code class="command">apt-get</code> update ;  <code class="command">apt-get</code> install grid5000-keyring && <code class="command">apt-get</code> update}}
Install it:
Install it:
  {{Term|location=node|cmd=<code class="command">apt-get</code> install g5kchecks}}
  {{Term|location=node|cmd=<code class="command">apt-get</code> install g5kchecks}}
Line 499: Line 181:
== Get sources ==
== Get sources ==


  {{Term|location=node|cmd=<code class="command">git </code> clone https://github.com/grid5000/g5k-checks.git
<code> git clone https://github.com/grid5000/g5k-checks.git </code>


== Run g5k-checks ==
== Run g5k-checks ==
Line 505: Line 187:
If you want to check your node just run:
If you want to check your node just run:
   
   
   {{Term|location=node|cmd=<code class="command">g5k-checks</code>}}
   {{Term|location=node|cmd=<code class="command">g5k-checks -v</code>}}


If some error occurs, g5k-checks puts file in /var/lib/g5kchecks/. For instance:
The output should highlight tests in error in red. Also, if some error occured, g5k-checks puts file in /var/lib/g5kchecks/. For instance:


   root@adonis-3:~# g5k-checks
   root@adonis-3:~# g5k-checks
Line 513: Line 195:
   OAR_Architecture_should_have_the_correct_number_of_thread
   OAR_Architecture_should_have_the_correct_number_of_thread


  root@'''adonis-3''':~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread
You can see the detail of the values checked this way:
  {"started_at":"2013-09-25 15:07:16 +0200","exception":"'''16''', '''8''', architecture, '''nb_threads'''",
    "status":"failed","finished_at":"2013-09-25 15:07:16 +0200","run_time":0.000155442}


This means that '''adonis-3''' haven't good number of thread ('''nb_threads''' is '''16''' instead of '''8''').
  root@'''adonis-3''':~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread


== Get node description ==
== Get node description ==
G5k-checks has a double utility. It can check a node description against our reference API and detect errors.
But it can also generate the data to populate this reference API.


If you want a exact node description you can run:
If you want a exact node description you can run:


   {{Term|location=node|cmd=<code class="command">g5k-checks</code> -m api}}
   {{Term|location=node|cmd=<code class="command">g5k-checks</code> -m api}}
(If launched with -v verbose mode, you can see that almost all tests are failing and it is normal as empty values are checked instead of real ones)


Then g5k-checks put a json and a yaml file in /tmp/
Then g5k-checks put a json and a yaml file in /tmp/
Line 529: Line 214:
   root@adonis-3:~# g5k-checks -m api
   root@adonis-3:~# g5k-checks -m api
   root@adonis-3:~# ls /tmp/
   root@adonis-3:~# ls /tmp/
   adonis-3.grenoble.grid5000.fr.json  adonis-3.grenoble.grid5000.fr.yaml lost+found
   adonis-3.grenoble.grid5000.fr.json  adonis-3.grenoble.grid5000.fr.yaml


= Write your own checks/description =  
= Write your own checks/description =  
Line 536: Line 221:
G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.
G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.


On Debian, installed files are stored in /usr/lib/ruby/vendor_ruby/g5kchecks. Tree is:
File tree is:


   ├── ohai # Add information to ohai, those informations are use by g5k-checks after
   ├── ohai # Ohai plugins, those informations are use by g5k-checks after
   ├── rspec # Add Rspec formatter (store informations in different way)
   ├── rspec # Add Rspec formatter (store informations in different way)
   ├── spec # Checks directory
   ├── spec # Checks directory
Line 546: Line 231:


[http://docs.opscode.com/ohai.html Ohai] is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:
[http://docs.opscode.com/ohai.html Ohai] is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:
<pre>
Ohai.plugin(:Packages) do


   provides "packages"
   provides "packages"
   packages Mash.new
 
  packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
   collect_data do
      packages Mash.new
      packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
      packages
  end
end
</pre>


== Play with Rspec ==  
== Play with Rspec ==  
Line 584: Line 278:
Example: I want to add bogomips of node:
Example: I want to add bogomips of node:


First we should add information in ohai description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/cpu.rb at line 58:
First we should add information in ohai description. To do this we add in the file ohai/cpu.rb after line 80:


<pre>
     if line =~ /^BogoMIPS/
     if line =~ /^BogoMIPS/
       cpu[:Bogo] = line.chomp.split(": ").last.lstrip
       cpu[:Bogo] = line.chomp.split(": ").last.lstrip
     end
     end
</pre>


Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
 
<pre>
     it "should have BogoMIPS" do
     it "should have BogoMIPS" do
       bogo_ohai = @system[:cpu][:Bogo]
       bogo_ohai = @system[:cpu][:Bogo]
       bogo_ohai.should be_nil, "#{bogo_ohai}, don't have information, processor, bogoMIPS"
       #First value is system, second is from API, thirs is the YAML path in the created '/tmp/' file for -m api mode.
      #Last argument is false to export value in API mode, true to skip
      Utils.test(bogo_ohai, nil, 'processor/bogoMIPS', false) do |v_ohai, v_api, error_msg|
          expect(v_ohai).to eql(v_api), error_msg
      end
     end
     end
</pre>


Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:
Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:
Line 611: Line 312:
= Releasing and testing =
= Releasing and testing =


== Tests ==
== Tests and reference-repository update ==


Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins tests: https://intranet.grid5000.fr/jenkins/job/test_g5kchecks
Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins test: https://intranet.grid5000.fr/jenkins/job/test_g5kchecksdev


This test can reserve all or the maximum possible nodes (targets cluster-BEST) on each cluster of Grid5000.
This test can reserve all or the maximum possible nodes (targets cluster-ALL and cluster-BEST) on each cluster of Grid5000.


It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.
It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.
Line 621: Line 322:
The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).
The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).


Also, the Yaml output of the "-m api" mode will be written to the ''$HOME/g5k-checks-output'' directory of the ajenkins user on the target site.
Also, the Yaml output of the "-m api" option of g5k-checks will be written to ''$HOME/g5k-checks-output'' directory of the ''ajenkins'' user on the target site.
 
Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:
 
<pre>
  cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'custom', 'dev_feature', 'dev_feature_refrepo')"
</pre>


For example, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.


Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:
===== Updating the reference-repository =====
 
Once the tests are finished on the desired clusters, generated Yaml files must be imported manually.


''cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'dev_feature', 'dev_feature_refrepo')"''
* In the reference repository, go in the ''generators/run-g5kchecks'' directory.
* Now get yaml files you want to include. For example:
<pre>
rsync -va "rennes.adm:/home/ajenkins/g5k-checks-output/paravance*.yaml" ./output/
</pre>
The ''output'' directory hold the temporary files that will be included as ''input'' in the reference-repository.
* Then import YAML files into the reference-repository with:
<pre>
rake g5k-checks-import SOURCEDIR=<path to the output dir>


For exemple, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.
If values seem correct, generate JSON and commit:
<pre>
  rake reference-api
  git diff data/
  git add data input
  git commit -m"[SITE] g5k-checks updates"
</pre>


== Release a new version ==
== Release a new version ==


Once modifications are tested correct, a new version must be released.
Once modifications are tested correct on a maximum of clusters, a new version can be released.
 
See [[TechTeam:Git_Packaging_and_Deployment#D.C3.A9tails_du_workflow_de_release_.28et_configuration_de_gitlab-ci.29|here]] for general instructions about the release workflow.
 
== Environment update ==
 
The version of g5k-checks included in standard environment is defined in the following file:
''steps/data/setup/puppet/modules/env/manifests/common/software_versions.pp''
 
Once the environment is correct and its version updated, it can be generated with the automated jenkins job:
https://intranet.grid5000.fr/jenkins/job/env_generate/
 
== New environment release and reference-api update guidelines ==
 
The following procedure summarizes the steps taken to test and deploy a new environment with g5k-checks.
 
G5k-checks relies on the reference-api to check system data against it. Data from the reference-api must be up-to-date for tests to succeed but most of this data is generated by g5k-checks itself, creating a sort of 'circular dependency'.
To avoid dead nodes, g5k-checks data from all nodes should be gathered before pushing a new environment.
 
* Do a reservation of all nodes of G5K, for example: ''oarsub -t placeholder=maintenance -l nodes=BEST,walltime=06:00 -q admin -n 'Maintenance' -r '2017-08-31 09:00:00'''
 
The reservation should happen early enough to ensure most (ideally all) of the resources will be available at that time.
 
* Prepare and release a new debian package of g5k-checks (see [[#Release a new version]])
 
* Prepare a new standard environment with this new g5k-checks version (see [[#Environment update]])
 
* Now g5k-checks should be run on all reserved nodes in 'api' mode in order to retrieve the yaml description that will be used to update the reference-api.
This step might be the most tedious one but can be done before the actual deployment.
See [[#Tests and reference-repository update]]
 
* Commit and push theses changes to ''master'' branch of the reference-repository
 
* Soon after, push new environment version to all sites using the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_push/
The jenkins job does a oar reservation of type 'destructive' that will force the deployment of the new environment.
 
* If not all nodes were available at the time of the new g5k-checks data retrieval (which is often the case) or during environment update, open a {{Bug|}} for all sites to let site administrators finish running g5k-checks on remaining nodes.
 
== Run G5k-checks on non-reservable nodes ==
 
It is common to update the reference-repository values of nodes whose state are 'Dead' on OAR.


Rake tasks are provided to ease this process.
An adaptation of the jenkins g5k-checks test has been made to allow running the same test without doing a OAR reservation.


The first step is to increase the version number with those rake tasks:
The only difference is that instead of using OAR to reserve nodes and Kadeploy API to deploy, the nodes are given directly as arguments and kadeploy is called directly from site's frontends.
''rake package:bump:*''


Then the debian package can be created using this task:
This scripts must be run on the jenkins machine:


''rake package:build''
''cd /srv/jenkins-scripts ''


And finally the debian package can be built and published on the Grid'5000 apt repository:  
''ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.from_nodes_list()" grisou-{15,16,18}.nancy.grid5000.fr''


''rake package:publish''
Once done, the procedure is the same as described in [[#Updating the reference-repository]].

Revision as of 14:23, 27 October 2020


Description

Overview

  • g5k-checks is expected to be integrated into the standard environment of the Grid'5000 computational nodes. It checks that a node meets several basic requirements before it declares itself as available to the OAR server.
  • This lets the admins enable some checkers which may be very specific to the hardware of a cluster.

Architecture

G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 Reference_Repository. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.

OAR

  • The oar-node flavour of OAR installation /etc/default/oar-node is started at at boot time. It launches /usr/lib/oar/oarnodecheckrun, which then runs the executable file /etc/oar/check.d/start_g5kchecks. The OAR server periodically invokes remotely /usr/bin/oarnodecheckquery. This command returns with status 1 if /var/lib/oar/check.d/ is not empty, 0 otherwise. So if /etc/oar/check.d/start_g5kchecks finds something wrong, it simply has to create a log file in that directory.
  • If oarnodecheckquery fails, then the node is not ready to start, and it loops on running those scripts until either oarnodecheckquery returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".

This summarizes when g5kchecks is run:

  • At service start with /etc/default/oar-node
  • Between (non-deploy) jobs with remote execution of oarnodecheckrun and oarnodecheckquery (In case of deploy jobs, the first type of execution takes place)
  • Launched by user manually (for now, never happens)

G5kchecks is never run during users jobs.

Checks Overview

The following values are checked by g5k-checks:


# Generated by g5k-checks (g5k-checks -m api)
---
network_adapters:
  bmc:
    ip: 172.17.52.9
    mac: 18:66:da:7c:96:1a
    management: true
  eno1:
    name: eno1
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:16
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno2:
    name: eno2
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:17
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno3:
    name: eno3
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:18
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno4:
    name: eno4
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:19
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  enp5s0f0:
    name: enp5s0f0
    interface: Ethernet
    ip: 172.16.52.9
    driver: ixgbe
    mac: a0:36:9f:ce:e4:24
    rate: 10000000000
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: true
    management: false
  enp5s0f1:
    name: enp5s0f1
    interface: Ethernet
    driver: ixgbe
    mac: a0:36:9f:ce:e4:26
    rate: 0
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: false
    management: false
operating_system:
  ht_enabled: true
  pstate_driver: intel_pstate
  pstate_governor: performance
  turboboost_enabled: true
  cstate_driver: intel_idle
  cstate_governor: menu
architecture:
  platform_type: x86_64
  nb_procs: 2
  nb_cores: 16
  nb_threads: 32
chassis:
  serial: 7W26RG2
  manufacturer: Dell Inc.
  name: PowerEdge R430
main_memory:
  ram_size: 68719476736
supported_job_types:
  virtual: ivt
bios:
  vendor: Dell Inc.
  version: 2.2.5
  release_date: '09/08/2016'
processor:
  clock_speed: 2100000000
  instruction_set: x86-64
  model: Intel Xeon
  version: E5-2620 v4
  vendor: Intel
  other_description: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  cache_l1i: 32768
  cache_l1d: 32768
  cache_l2: 262144
  cache_l3: 20971520
  ht_capable: true
storage_devices:
  sda:
    device: sda
    by_id: "/dev/disk/by-id/wwn-0x6847beb0d535ed001fa67d1a12d0d135"
    by_path: "/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0"
    size: 598879502336
    model: PERC H330 Mini
    firmware_version: 4.26
    vendor: DELL

This is an example of output file in API mode (g5k-checks launched with -m api option).

In addition, not all tests are exporting data in this file. The following values are also checked:

  • Grid5000 standard environment version
  • Grid5000 post-install scripts version
  • Usage of sudo-g5k (failed if used, could be destructive to other parts of the system)
  • Correct mode of /tmp/
  • Fstab partitions mounted and valids
  • All partitions have expected size, position, offset, mount options, ...
  • Correct KVM driver

Simple usage

Installation

G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list

deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster /
Terminal.png node:
apt-get update

Install it:

Terminal.png node:
apt-get install g5kchecks

Get sources

git clone https://github.com/grid5000/g5k-checks.git

Run g5k-checks

If you want to check your node just run:

Terminal.png node:
g5k-checks -v

The output should highlight tests in error in red. Also, if some error occured, g5k-checks puts file in /var/lib/g5kchecks/. For instance:

 root@adonis-3:~# g5k-checks
 root@adonis-3:~# ls /var/lib/oar/checklogs/
 OAR_Architecture_should_have_the_correct_number_of_thread

You can see the detail of the values checked this way:

 root@adonis-3:~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread

Get node description

G5k-checks has a double utility. It can check a node description against our reference API and detect errors. But it can also generate the data to populate this reference API.

If you want a exact node description you can run:

Terminal.png node:
g5k-checks -m api

(If launched with -v verbose mode, you can see that almost all tests are failing and it is normal as empty values are checked instead of real ones)

Then g5k-checks put a json and a yaml file in /tmp/

 root@adonis-3:~# g5k-checks -m api
 root@adonis-3:~# ls /tmp/
 adonis-3.grenoble.grid5000.fr.json  adonis-3.grenoble.grid5000.fr.yaml

Write your own checks/description

G5k-checks internal

G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.

File tree is:

 ├── ohai # Ohai plugins, those informations are use by g5k-checks after
 ├── rspec # Add Rspec formatter (store informations in different way)
 ├── spec # Checks directory
 └── utils # some useful class

Play with ohai

Ohai is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:

Ohai.plugin(:Packages) do

  provides "packages"

  collect_data do
      packages Mash.new
      packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
      packages
  end
end

Play with Rspec

Rspec is a framework for testing ruby programs. G5k-checks use Rspec, not to test a ruby program, but to test host. Rspec is simple to read and write. For instance if we want to ensure that bash version is the good one, you can create a file /usr/lib/ruby/vendor_ruby/g5kchecks/spec/packages/packages_spec.rb with :

 describe "Packages" do
                                                                                                                                           
   before(:all) do                                                                                                                         
     @system = RSpec.configuration.node.ohai_description
   end
   
   it "bash should should have the good version" do                                                                                        
     puts @system[:packages][:bash].to_yaml
     bash_version = @system[:packages][:bash].strip                                                                                        
     bash_version.should eql("4.2+dfsg-0.1"), "#{bash_version}, 4.2+dfsg-0.1, packages, bash"                                              
   end
       
 end

Add checks

Example: I want to check if flag "acpi" is available on the processor:

Add to /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:

 it "should have apci" do
   acpi_ohai = @system[:cpu][:'0'][:flags].include?('acpi')
   acpi_ohai.should_not be_false, "#{acpi_ohai}, is not acpi, processor, acpi"
 end

Add informations in description

Example: I want to add bogomips of node:

First we should add information in ohai description. To do this we add in the file ohai/cpu.rb after line 80:

    if line =~ /^BogoMIPS/
      cpu[:Bogo] = line.chomp.split(": ").last.lstrip
    end

Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:

    it "should have BogoMIPS" do
      bogo_ohai = @system[:cpu][:Bogo]
      #First value is system, second is from API, thirs is the YAML path in the created '/tmp/' file for -m api mode.
      #Last argument is false to export value in API mode, true to skip
      Utils.test(bogo_ohai, nil, 'processor/bogoMIPS', false) do |v_ohai, v_api, error_msg|
          expect(v_ohai).to eql(v_api), error_msg
      end
    end

Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:

   root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# g5k-checks -m api
   root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# grep -C 3 bogo /tmp/graphene-100.nancy.grid5000.fr.yaml 
     ram_size: 16860348416
   processor:
     clock_speed: 2530000000
     bogoMIPS: 5053.74
     instruction_set: x86-64
     model: Intel Xeon
     version: X3440

Releasing and testing

Tests and reference-repository update

Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins test: https://intranet.grid5000.fr/jenkins/job/test_g5kchecksdev

This test can reserve all or the maximum possible nodes (targets cluster-ALL and cluster-BEST) on each cluster of Grid5000.

It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.

The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).

Also, the Yaml output of the "-m api" option of g5k-checks will be written to $HOME/g5k-checks-output directory of the ajenkins user on the target site.

Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:

  cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'custom', 'dev_feature', 'dev_feature_refrepo')"

For example, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.

Updating the reference-repository

Once the tests are finished on the desired clusters, generated Yaml files must be imported manually.

  • In the reference repository, go in the generators/run-g5kchecks directory.
  • Now get yaml files you want to include. For example:
rsync -va "rennes.adm:/home/ajenkins/g5k-checks-output/paravance*.yaml" ./output/

The output directory hold the temporary files that will be included as input in the reference-repository.

  • Then import YAML files into the reference-repository with:
rake g5k-checks-import SOURCEDIR=<path to the output dir>

If values seem correct, generate JSON and commit:
<pre>
  rake reference-api
  git diff data/ 
  git add data input
  git commit -m"[SITE] g5k-checks updates"

Release a new version

Once modifications are tested correct on a maximum of clusters, a new version can be released.

See here for general instructions about the release workflow.

Environment update

The version of g5k-checks included in standard environment is defined in the following file:

steps/data/setup/puppet/modules/env/manifests/common/software_versions.pp

Once the environment is correct and its version updated, it can be generated with the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_generate/

New environment release and reference-api update guidelines

The following procedure summarizes the steps taken to test and deploy a new environment with g5k-checks.

G5k-checks relies on the reference-api to check system data against it. Data from the reference-api must be up-to-date for tests to succeed but most of this data is generated by g5k-checks itself, creating a sort of 'circular dependency'. To avoid dead nodes, g5k-checks data from all nodes should be gathered before pushing a new environment.

  • Do a reservation of all nodes of G5K, for example: oarsub -t placeholder=maintenance -l nodes=BEST,walltime=06:00 -q admin -n 'Maintenance' -r '2017-08-31 09:00:00'

The reservation should happen early enough to ensure most (ideally all) of the resources will be available at that time.

  • Now g5k-checks should be run on all reserved nodes in 'api' mode in order to retrieve the yaml description that will be used to update the reference-api.

This step might be the most tedious one but can be done before the actual deployment. See #Tests and reference-repository update

  • Commit and push theses changes to master branch of the reference-repository

The jenkins job does a oar reservation of type 'destructive' that will force the deployment of the new environment.

  • If not all nodes were available at the time of the new g5k-checks data retrieval (which is often the case) or during environment update, open a bug # for all sites to let site administrators finish running g5k-checks on remaining nodes.

Run G5k-checks on non-reservable nodes

It is common to update the reference-repository values of nodes whose state are 'Dead' on OAR.

An adaptation of the jenkins g5k-checks test has been made to allow running the same test without doing a OAR reservation.

The only difference is that instead of using OAR to reserve nodes and Kadeploy API to deploy, the nodes are given directly as arguments and kadeploy is called directly from site's frontends.

This scripts must be run on the jenkins machine:

cd /srv/jenkins-scripts 
ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.from_nodes_list()" grisou-{15,16,18}.nancy.grid5000.fr

Once done, the procedure is the same as described in #Updating the reference-repository.