Non-Linux system deployment
Author: Lucas Nussbaum (Université Lyon 1 / LIP, équipe RESO).
Grid'5000 is not bound to GNU/Linux: deploying other operating systems, like FreeBSD or Solaris, is also possible. This allows Grid'5000 users to use software that is not available or compatible with Linux, such as network emulation tools Dummynet and Modelnet (FreeBSD) or DTrace (Solaris Dynamic Tracing Framework). In this tutorial, you will learn how to deploy FreeBSD on Grid'5000 (but this knowledge is transferable to other operating systems). You will also learn (or be reminded of) details about disk partitioning, the boot process, and the internals of Kadeploy.
This tutorial is split into three parts:
- a theoretical part about disk partitioning, the boot process, and Kadeploy internals.
- the deployment and customization of an environment using a dd image.
- a quick tour of two recent FreeBSD additions, imported from OpenSolaris:
- DTrace, a dynamic tracing framework.
- the ZFS filesystem.
During the second part of the tutorial, the transfer of the image to the nodes can take a lot of time (up to 10 minutes). It is recommended to start with the second part, and go back to reading the first part during the deployment.
Disk partitioning, the boot process, and Kadeploy
Hard disks, partitioning, and file systems
To understand how Kadeploy achieves the deployment of non-Linux systems, some details about disk partitions and the boot process need to be understood.
Hard disks and partitions
On the x86 and amd64 architectures, in the DOS world (that includes Linux), hard disks are divided into partitions, which can be of two types:
- primary partitions, described in the Master Boot Record (first 512 bytes of the hard disk)
- logical partitions (contained into one extended partition), described in Extended Master Boot records at the beginning of each logical partition (think of it as a chained list).
Because of the structure of the MBR, there can be at most 4 primary partitions (3 if one uses an extended partition, because the location of the first EMBR is given instead of the 4th partition in the MBR).
For more details, refer to:
- Wikipedia: Partition
- Wikipedia: Master Boot Record (MBR)
- Wikipedia: Extended Master Boot Record (EMBR)
Partitioning on BSD and Solaris
BSDs and Solaris use a different view of hard disks. They can only be installed in a primary partition (that they name slice), and create inside it partitions that work similarly to logical partitions. (See this example of the FreeBSD documentation).
Once partitions have been created on a disk, the operating system needs to organize the files and directories on it. This work is done by the file system. File systems are specific to each operating system, but some operating systems provide support for file systems from other OS. Linux, for example, supports many file systems, but often incompletely or experimentally (it might not be possible to create a file system of another type, or to modify or add data on the file system).
Kadeploy and file systems
When Kadeploy is used to deploy a Linux environment, it does the following operations on the partitions:
- It reboots the node on a deployment system (a very small Linux system), that is sent via the network.
- It optionally repartitions the hard disk.
- It creates the file system specified by the environment.
- It mounts this file system and uncompresses the .tar.gz file specified by the environment.
- It reboots the node on the deployed system.
Deployment of other systems with Kadeploy: .dd.gz images
The process used to deploy Linux environments cannot be used for other systems, because Linux might not necessarily know how to create, mount or write on the specified file system. Instead, Kadeploy allows using .dd.gz images: written directly on the target partition, they allow the abstraction of the file system used. However:
- It is not possible to modify the configuration once deployed, because the target file system is never mounted during the deployment process.
- It is not possible to use the image if it is not of the same size as the target partition.
- The deployment is not very efficient, because the whole partition needs to be transferred and written: both the network's bandwidth and the disk's write speed can become bottlenecks.
Other deployment systems (Frisbee, PartImage) use the same principle, but, contrary to dd, only allow writing the blocks which are really used, instead of the whole partition. The implementation of a similar approach in Kadeploy is not currently planned.
The boot process
Since the operating system on Grid'5000 nodes can be remotely modified, booting a Grid'5000 node is much more complex than booting your laptop.
To be able to remotely choose how the system should be started, Kadeploy uses PXE (Preboot Execution Environment). For Linux environments, the kernel and init ram disk (initrd) are extracted for the .tar.gz archive during the deployment, and made available by TFTP. When booting, the node then fetches the pxelinux program, its configuration, and the kernel and initrd pointed by the configuration file. (This leads to an interesting problem: if, after your node has been deployed, you modify your kernel (by recompiling it, for example), this change won't be visible until you redeploy your node: if you simply reboot it, the old kernel (the one in the .tar.gz archive) will be used).
To boot FreeBSD and other systems, using pxelinux is not an option, obviously. Instead, the idea is to "chainload" a bootloader that will be installed on the partition (and stored inside the .dd.gz). To achieve that, a Grub floppy is sent instead of the linux kernel, which contains the following instructions:
title $title parttype (hd$letter,$part) 0x$hexfdisktype rootnoverify (hd$letter,$part) makeactive chainloader +1 boot
This way of booting non-Linux systems is system-independent: the only requirement is that a boot loader is installed on the partition.
Deployment and modification of a FreeBSD environment
In this part, we will deploy FreeBSD on nodes of the Grid Explorer platform (gdx cluster, in the Orsay site). It is not possible currently to deploy FreeBSD on all Grid'5000 clusters.
- Connect to the orsay site, and reserve 3 nodes for the duration of the practical session, in the deploy queue. To exclude netgdx nodes (which have 3 NICs, and might cause problem), specify that you want the gdx cluster. To only include nodes with a remote console, specify the :
ssh orsay oarsub -t deploy -l nodes=3,walltime=2 -p "cluster='gdx' and rconsole='YES'" 'sleep 86400'
- Get the list of nodes that where allocated:
oarstat -fj $YOURJOBID
- We will now deploy the <tt>freebsd7.1-basic environment, owned by the user lnussbaum. Look at its description:
kaenv3 -l -u lnussbaum kaenv3 -p freebsd7.1-basic -u lnussbaum
- Deploy the environment on two of your nodes. First, start a console on one of the nodes you are going to deploy to monitor the deployment:
kaconsole3 -m NODE
Note that to exit kaconsole, refer to Escape sequence for every sites.
- Now, deploy the environment using Kadeploy. It is a very good idea to specify --verbose-level 4 parameter, to increase the verbosity.
kadeploy3 --verbose-level 4 -u lnussbaum -e freebsd7.1-basic -m NODE1 -m NODE2
- During the deployment, it is a good idea to go back to the first part of this tutorial, and continue reading.
- On the console, during the deployment and the boot, try to find:
- when the transfer of the .dd.gz environment is taking place
- during the final boot, when grub is being used to chainload FreeBSD
- the nice logo of the FreeBSD bootloader
- Connect to one of the deployed nodes via SSH (login: root / password: grid5000).
- Copy your SSH key to the node:
Register the image for future deployments
After modifying your deployed system (for example, after adding your SSH key), you can register it for future uses. To do that, the following process must be used for .dd.gz images:
- Make sure that all pending writes have been committed to disk on the node. On the node, run:
- Reboot the node on the deployment system (follow this process using kaconsole). kareboot's -d option specifies that the system should be rebooted on the deployment system. On the frontend, run:
kareboot3 -d -m NODE -r simple_reboot
- You need to connect to the deployment system using SSH. The key can be extracted from the deployment system archive (/var/lib/tftpboot/PXEClient/images_grub/deploy-vmlinuz-220.127.116.11), or found directly at ~lnussbaum/FreeBSD/id_deploy. Once the system is ready (on the console, look for the initialization of the disk controllers), verify that you can connect to it. First, you need to copy the SSH private key to your own home directory, because SSH will refuse to use it if it's not owned by you.
cp ~lnussbaum/FreeBSD/id_deploy ~/ chmod 600 ~/id_deploy ssh -i ~/id_deploy root@NODE id
- To register the FreeBSD partition to a .dd.gz image, use:
ssh -i ~/id_deploy root@NODE "dd if=/dev/sda3 | gzip" | pv > my_freebsd_image.dd.gz
pv is a nice tool to display the current bandwidth of a pipe. The compressed image uses about 700 MB.
- Register your environment with Kadeploy. First, get a working configuration for a FreeBSD image, with:
kaenv3 -p freebsd7.1-basic -u lnussbaum > my_freebsd_image.dsc
- Modify the description file to change the filebase parameter.
- Record your environment using karecordenv:
karecordenv -fe my_freebsd_image.dsc
- Deploy your environment on the third node you reserved.
DTrace and ZFS
If you followed the instructions correctly up to there, you should have a spare node deployed with FreeBSD. In the following two parts of this tutorial, we will examine two recent new features that were imported from OpenSolaris to FreeBSD: DTrace, a dynamic instrumentation framework, and ZFS, a new file system. The following two sections are independent: feel free to skip directly to the ZFS section if you are more interested in ZFS than in DTrace.
FreeBSD has an history of good support for performance monitoring and analysis. For example, a useful feature is that pressing CTRL+T while any application is running will activate a default SIGINFO handler that displays the current application's name, together with other useful information, like the current system call, system load, process' CPU and memory usage, etc.
- Run sleep 10 in a shell, and press CTRL and T. Observe the default SIGINFO handler.
- Run openssl speed in a shell, and press CTRL+T again.
- Run find / &>/dev/null in a shell, and press CTRL+T again. (Stop it with CTRL+C)
Recently, work has been done to port OpenSolaris' famous tracing framework, DTrace, to FreeBSD. DTrace provides a framework to dynamically instrument a running kernel, without any performance cost when the instrumentation is not activated. It allows its users to trace the execution of the kernel, exporting internal counters, and aggregating statistics, even if there's no predefined interface to access the data (like pseudo-files in /proc or /sys). DTrace provides it's own programming language, D. (For more information about DTrace, please refer to Wikipedia or this page). While this FreeBSD port is a work in progress, and doesn't provide all the features that the OpenSolaris version provides, it is already interesting to take a look at it. Also, it is worth noting that Linux provides a similar framework, using Linux's KProbes and a userland tool named systemtap (Refer to this LWN article for more information about Systemtap).
We will now load DTrace, and play with a few D scripts.
- Load the DTrace kernel module:
- Change to the ~/dtrace directory.
- First, the syscallbysysc.d script intercepts all system calls, and, after it has been stopped (with CTRL+C), summarize how many times each system call was called.
- Look at the content of syscallbysysc.d.
- Run ./syscallbysysc.d. Stop it after a few seconds.
- Run it again, run some syscall-intensive tasks in another terminal. Observe the results.
- The newproc.d script intercepts exec() syscalls, and displays some information about the created processes.
- Look at its content, then run ./newproc.d, and run some shell commands that will create processes.
- Now have a look at the content of execsnoop, and run it. execsnoop which is a more complex version of the same script.
- Change to the ~/DTraceToolkit-0.99 directory. This directory contains a lot of example DTrace scripts. Unfortunately, they were written for OpenSolaris, and the FreeBSD kernel differs in a number of ways from the OpenSolaris one (different system calls names, different structures, etc). Also, some of OpenSolaris DTrace's probes were not ported to FreeBSD yet. As a result, most of the scripts there need some work before they can work on FreeBSD.
Another interesting new feature of FreeBSD, that was recently imported from OpenSolaris, is the ZFS file system. This file system provides a number of modern features, like built-in checksums, redundancy, compression, and snapshots. In the Linux world, it can be compared to a merge of software RAID, LVM, and EXT4. (Useful information about ZFS: Wikipedia, and the ZFS Administration Guide.
- First, we will change the disk partitioning on our Grid'5000 nodes to create a ZFS partition.
- Run fdisk -u
- Answer NO to the first question (BIOS view of the disk)
- Answer NO to questions about partitions 1, 2, 3
- Change Partition 4:
- Change sysid to 165 (enter 165, then press Enter)
- Do not change the start and the size (just press Enter)
- Do not explicitely specify beg/end address (just press Enter)
- Confirm (press Y)
- Do not change the active partition (press N)
- Write the new partition table (press Y)
- Now, we will create a storage pool using partition 4:
zpool create tank ad4s4
- Let's see if it was properly created, using various tools:
zpool status mount zfs mount
- We will now experiment with ZFS' compression. We will create two file systems (one normal, the other compressed), and copy a subset of FreeBSD's sources on it.
- Create the file systems, and enable compression:
zfs create tank/normal zfs create tank/compressed zfs set compression=on tank/compressed
- Copy a subset of FreeBSD sources on them (this takes a few seconds):
cp -r /usr/src/sys /tank/normal cp -r /usr/src/sys /tank/compressed
- Compare the disk usage of both filesystems:
du -hs /tank/normal /tank/compressed
- ZFS allows to increase of number of on-disk copies of the content of a file system. This can be used to reduce the changes that important files will get lost because of a disk failure.
- Create another file system, and set the number of copies to 3.
zfs create tank/important zfs set copies=3 tank/important
- Again, we will copy the FreeBSD sources to that file system, and examine the disk usage
cp -r /usr/src/sys /tank/important du -hs /tank/important
- ZFS also provides built-in snapshots.
- Create another file system
zfs create tank/snap
- Go to that file system, and create a file on it
cd /tank/snap echo foo > bar
- Snapshot the file system
zfs snapshot tank/snap@before_something_risky
- Modify the file again, and check the content of the file
echo baz >> bar cat bar
- Change directory to /tank/snap/.zfs/snapshot/, and examine the snapshots
- Examine the snapshots and their contents
ls cat before_something_risky/bar
- List all the created ZFS file systems (including snapshots)
- Rollback the snapshot (restore it instead of the current state of the file system) (you might need to change directory outside of the file system first)
cd / zfs rollback tank/snap@before_something_risky
- Check the content of the file
cd /tank/snap cat bar
We modified the partition table. On Grid'5000, the partition table is not rewritten at the end of deployment jobs, but it is rewritten when an environment is deployed. To restore the partition table on our nodes, we will now deploy a standard environment on our nodes.
kadeploy -e sid-x64-base-1.1 -m NODE1 -m NODE2 -m NODE3
During this tutorial:
- You learnt how to deploy .dd.gz Kadeploy images, allowing the deployment of other operating systems.
- You learnt how to register changes to such environments.
- You tried two new FreeBSD features: DTrace and ZFS.
Creating the initial image of a non-linux system
Creating the initial image is the most complex task. The best way to proceed is to install directly on a node with a keyboard and a screen, until a satisfying image is obtained. If it's not possible, one can use qemu, which supports both x86 and x86_64 platforms. Some commands:
- Start qemu with the local HD: (grub must be installed on the local HD)
qemu -hda /dev/sda
- Start qemu on amd64 with an ISO image, and activating the local HD:
qemu-system-x86_64 -hda /dev/sda -cdrom 6.0-RELEASE-i386-disc1.iso -boot d
- Boot on the installed system, activing serial consoles 0 to 3 (accessible via Ctrl+Alt+1-4):
qemu-system-x86_64 -hda /dev/sda -boot c -serial vc -serial -vc -serial vc -serial vc
- To test the installed system without qemu, your must have to reboot the system, and pressing 'c' during the grub prompt.
- With this technique, the system boots with qemu devices. The first step is to manage to active the serial console. Then, you have to enable to disk controller, the network card, etc.