Kernel development testing
This practice was designed for the old OAR1 infrastructure of Grid'5000 and is now deprecated.
To help the kernel development, tools allow to run it in the user level. They allow to trace specific part of the code and to better handle kernel crash (aka kernel panics) using a user space debugger. For instance, the User Mode Linux (UML) project provides this feature to Linux.
However, it is always needed to perform final tests in the real environment. In the HPC context, Grid'5000 offers this possibility, thanks to the Kadeploy software.
This practice proposes to deploy on the grid a Linux kernel containing a special 'made to crash' module, and shows how to handle crashed kernel, fetching the information given at that time. It also show the how to get the crashed system rebooted, with the limitation of the platform (IPMI, EFI, etc).
The kaconsole and the kareboot are mainly addressed.
Dedicated images have been done for Sophia's and Orsay's cluster. As the largest part of the grid5000, these sites exploit IPMI management facilities.
Notes : The icluster2 architecture (IA64, located at Grenoble) use EFI management cards. If you want to exploit kaconsole command, please contact the icluster staff (firstname.lastname@example.org) since your account should be activated into the management system.
|Cluster||orsay (oar.orsay)||sophia (oar.sophia)|
|Deployement partitions||Please refer to the cluster specifications|
- Go to the selected cluster :
Orsay : ssh email@example.com Sophia : ssh firstname.lastname@example.org From the Grenoble's frontal: ssh frontal38.grid5000.fr then ssh oar.orsay ou ssh oar.sophia
- Book one node for two hours with the deploy option:
oarsub -q deploy -l walltime=2 -I
- Check your assigned node :
- Take a look at the environment description:
kaenvironments -e debian-tpkernel -l alebre
- Deploy the debian-tpkernel on the right partition (replace hdaX by the right ones):
kadeploy -l alebre -e debian-tpkernel -p hdaX -m `uniq $OAR_NODEFILE`
In the following part, you should open 3 terminals:
- ssh connection to $MYNODE
- kaconsole connection to $MYNODE
- ssh connection to the frontal of the selected cluster.
- In the first terminal, please go to the deployed node (ssh, login: root / password: grid5000). This node is called $MYNODE in the following sections.
- In the second terminal, go to the frontal and launch kaconsole to monitor $MYNODE:
kaconsole -m $MYNODE # telnet connection to the $MYNODE management card (IPMI interface for Orsay and Sophia) # Press 'Entrer' and login with root (linux console, root/grid5000)
- In the first terminal, write a message inside the console:
echo "This message appears on the console" >> /dev/ttyS0 # All kernel events will also appear on the console
- Still in the first terminal, stop the network interfaces
- From the frontal (the third terminal), check that you could not access to $MYNODE (ssh should stay suspend)
- The first terminal should also not respond
- Restart the network interfaces from the second terminal (kaconsole)
/etc/init.d/networking restart # After that, all connections should be ok. The three terminals are usable.
- A similar operation could be done by removing the network kernel module:
rmmod tg3 && /etc/init.d/networking restart
In the second step of the pratice, we exploit kaconsole to monitor a kernel module. (a 'verbose' version of the aIOLi module).
- From terminal 1, stop the nfs server (use the ./stopnfs.sh script available in the /root directory and load the aIOLi module (aioli_nfsd_verbose.ko)
~/stopnfs.sh && insmod -f /root/aioli_nfsd_verbose.ko && /etc/init.d/nfsaioli-kernel-server start # Some outputs should appear on the console (terminal 2).
- Create a temporary directory and mount the /tmp partition exported by the server aIOLi nfs.
mkdir ./mytmp && mount $MYNODE:/tmp ./mytmp ls > /tmp/fichiertest cat ./mytmp/fichiertest # The console should provide lot outputs. These messages come from the 'debug' logs of the aIOLi module. # By such a way, you could debug and/or finalize specific kernel developments.
kaconsole and kareboot
In this part, we address the problem of a kernel 'crash'. The kernel 'OOPS' is generated by the load of two modules which provide the same functionalities (the nfsd module et the nfsd-aioli module). When the use of kaconsole is insufficient, the command kareboot enable to reboot the node. Note: remember to stop both the former nfs server and th aIOLi module ;)
- on $MYNODE, generate the 'OOPS'
insmod -f aioli_nfsd_v0.31.ko && /etc/init.d/nfs-kernel-server start # or : modprobe nfsd && modprobe exportfs && insmod -f aioli_nfsd_v0.31.ko && modprobe nfsd # After this command, you should notice a first 'OOPS' and different error messages on the console. # launch this command while you could do it ;)
- When the node as the console as unreacheable, you have to use the kareboot command from the cluster frontal.
kareboot -h -m $MYNODE -p hdaX # The node restarts et the usual messages appear on the console.
In some particular case, the management card does not respond (even with the -h option) and the node could not be rebooted. This happens sometimes and we did not find the reason. In such a case, the only way to retrieve the node consist of informing the cluster staff to do a manual reboot.
- The command kaconsole et kareboot are enabled on your deployed nodes. When you make a simply reservation (for instance, oarsub -I -c), you will get such a message :
WARNING : "userX" does not have rights on node-X.XXXX.grid5000.fr ERROR : you have no rights on node node-X.XXXX.grid5000.fr
This message could also appear when the 'hostname' is wrong.