Training labs VMs ================= Nesting virtual machines for fun and profit! Things to worry about are marked below in ********************* The goal ======== We have a need for an easy-to-use environment in a security training lab setup. The training is designed to educate engineers about various common security problems and (more importantly) how to fix them. Some online training courses from Azeria labs [1] are the initial exercises, but the work is expected to expand into other areas too. The goal is simple (make it easy!), but the setup needed for this is less so. We want to have have tools to provision the following: 1. a ready-to-use VM environment that engineers can deploy on their existing x86-based computers (Windows/Linux/Mac), including all the toolchain etc. needed for the work (called "toolchain VM" from here on) 2. a further emulated machine where Arm programs can be run, tested and debugged (called "runtime VM" from here on) For simplicity of deployment, it makes sense to have the runtime VM hosted *within* the toolchain VM, with automatic configuration. That way, the engineers using the training labs will have one logical piece to worry about: they will not need to worry about the details of setting up and configuring emulation, for example. More (obvious?) terminology: we'll call the engineer's machine the "host". [1] https://azeria-labs.com/writing-arm-shellcode/ Planning and design =================== Vagrant [2] seems to be a good fit for deploying and controlling the toolchain VM. It can use a wide range of VM and container technologies as a backend, but the best supported is Virtualbox [3], and that is cross-platform - it can run on all three of our desired platforms. Once Virtualbox is installed, Vagrant deals with all the details of downloading a "box" (a configured VM image) and starting it. It uses a simple "Vagrantfile" to define how a box is defined, configured and provisioned. Out of the box, Vagrant sets up a shared directory from the host to the box (useful for data sharing), and also ssh access from the host to the box (useful for running command line tools). For our runtime VM, we will use qemu [4]. It's a powerful emulator platform that supports all manner of different architectures. It can run individual binaries in emulation ("qemu-user") and this uses libraries etc. as normal for the emulated platform and translates instructions and system calls as needed. But for our purposes it looks better to run in "qemu-system" mode; this emulates a complete machine, then runs a kernel and userland on top of that. While this takes a little more setup initially, it's often more reliable (qemu-user is known to struggle with more complicated binaries using threads, for example). We'll therefore also need to provide an Arm VM image of the runtime VM for qemu to use. This is a little more involved than just running Vagrant, but not too difficult. [2] https://www.vagrantup.com/ [3] https://www.virtualbox.org/ [4] https://www.qemu.org/ Expected workflow ================= As Vagrant will share a directory between the host and the toolchain VM, this will allow engineers to use their normal native editor and other tools on their host machine, but also to use the cross-toolchains in the toolchain VM to compile them for the Arm target. We will *also* try to share that same directory with the runtime VM. That will give us a simple view of the whole project in terms of editing, building and running code. Deployment should be as simple as possible. It might be time-consuming on first run due to the need to download two VM images, but that's OK so long as they're not *too* big. The Vagrantfile for the toolchain VM will be the only piece needed by the end user, and it should be responsible for doing everything else from there down: 1. Create and run a Virtualbox VM, using an existing stable Ubuntu image. This is trivial out-of-the box Vagrant usage. 2. Configure that VM for data sharing and access to the toolchain VM. Again, this is common Vagrant usage. 3. Inside the toolchain VM, apply any updates that might be needed then install the extra packages we need (qemu and cross-toolchains) 4. Download the runtime VM image from our own source (TBD, probably Sharepoint somewhere?) 5. Start the runtime VM image and *also* set up data sharing and access to it. This will need a little bit more configuration in Vagrant, but should be simple enough. 6. Download the desired training materials - source code, docs, etc. 7. Tell the user that they're good to go. Point to the start of the training material ************************************ Step #5 is (by far) the hardest piece here. Either we need to preconfigure the runtime VM in certain ways, *or* we'll need to download a generic image and modify/configure it in the field before we start it. The first option is easier to achieve for now, and much faster to deploy - we don't end up installing packages at runtime onto an emulated system. BUT: it also means that we'll have to maintain that runtime VM image separately rather than using a generic image. If we deploy multiple different sets of training material using this setup, we could end up having to maintain multiple slightly-different versions of the runtime VM. ************************************ Technical details and (possible) troubles ========================================= SSH access ---------- As it starts the toolchain VM, Vagrant generates a throwaway SSH key. It stores the private key in .vagrant/machines/default/virtualbox/private_key and injects the public key into the toolchain VM at startup. It is stored under /home/vagrant/.ssh/authorized_keys, as you'd normally expect for ssh key access. Vagrant also sets up port forwarding between port 2222 of the host machine and port 22 on the toolchain VM. "vagrant ssh" is then a simple wrapper around ssh to use the right key, username and IP address etc. To enable consistent-ish access to the toolchain VM and the runtime VM, I first tried to set up extra forwarding at the Virtualbox layer: HOST TOOLCHAIN VM RUNTIME VM 2222 <-> 22 --- 2223 <-> 2222 <-> 22 ... but that did not work - connections from to port 2223 on the host would fail with very little diagnostics available. In the end, I went with using the toolchain VM as a proxy or "jump host". This involves adding some extra ssh configuration. As I was already thinking about adding an extra wrapper script to help with consistent access *anyway*, this is not too difficult to set up. To make authentication work in both VMs, we use the same SSH keypair. When starting the runtime VM inside the toolchain VM, we simply copy the SSH public key into the shared data directory /vagrant/runtime. The runtime VM is configured to use that location for the authorized_keys file - see below. The local ssh config we're using specifies the same private key for both. Easy! The provided script "vm_ssh" does the right thing on Linux (and MacOS). ************************************ We may need a tweaked equivalent "vm_ssh.bat" for Windows to use the right style of file name, let's see ************************************ Data access ----------- By default, Vagrant shares the "project" directory (i.e. the directory where the Vagrantfile lives) into the toolchain VM as /vagrant [5]. This is a really useful feature. We can extend this feature ourselves, sharing the same directory from the toolchain VM to the runtime VM. To do that, we use qemu's built-in support for a "Plan 9" filesystem export (the "-virtfs" command line option). Then we mount that filesystem inside the runtime VM on /vagrant too. This will give the user a consistent easy place to share their data, e.g. when compiling and running test programs. We can also use it for our own internal purposes, e.g. for sharing the SSH authorized_keys file. We will also try to download and store the runtime VM image and associated files here. That will save us having to make space for them inside the small toolchain VM. ************************************ I'm *not* sure how well this will work in performance terms - how fast is the virtualbox shared filesystem? Testing needed... ************************************ [5] https://www.vagrantup.com/intro/getting-started/synced_folders.html Setup of the toolchain VM ------------------------- This *should* be trivial, given the idea behind Vagrant. It's just a case of generating a Vagrantfile with some config in it. Put that in a git repo and tell the user: * install vagrant, virtualbox and git for their OS * git pull lab.git * cd lab.git * vagrant up ************************************ However, things did not work quite so smoothly during development. For the directory-sharing feature that we're expecting to use, Virtualbox depends on its guest (our toolchain VM) having a guest utilities package installed ("virtualbox-guest-utils" on Ubuntu"). Initial testing with a Debian image did not work. Virtualbox started up (with warnings about a version mismatch). Annoyingly, Vagrant apparently noticed the failure and fell back to using a one-time rsync at VM startup. This gave the appearance of working sharing, but did not stay in sync. I've added extra config to the Vagrantfile to force *only* Virtualbox-style sharing. Testing with some other Ubuntu boxes also failed - I tried a few variants of the (very new!) 20.04 release and they exhibited a range of problems giving unreliable startup. I've switched to 18.04 (aka "ubuntu/bionic64") and (so far!) that has worked flawlessly. ************************************ Setup of the runtime VM ----------------------- This is a little more involved, as we don't have easy-to-use tools like vagrant here. It's easy enough to write scripts to drive qemu virtual machines, but the images themselves need to be created or borrowed from elsewhere. I've simply created a Debian 10.3 (Buster) arm64 image for now, using qemu and kvm on an arm64 host. It's set up to boot via UEFI using qemu's pflash interface. Inside the machine I've made the following changes: * Set up to EFI boot via the removable media path (in case EFI boot variables get lost or corrupted) * Added an fstab entry to mount the /vagrant filesystem from the host using the plan9 fs. This does *not* always work automatically due to startup timing. Made it "noauto" and added an extra @reboot cron job for it * Added a "vagrant" user * Symlink that user's .ssh/authorized_keys to /vagrant/runtime/vagrant-pub-key, to allow passwordless ssh login * Added passwordless sudo access for the vagrant user * Add a startup script to mount /vagrant and do other startup stuff, then run any provisioning if needed * Add an @reboot cron job to run that script @reboot /usr/local/bin/runtime_vm_startup ************************************ We'll also need a similar 32-bit Arm image to support 32-bit labs. The driver script "start_runtime" can easily support that, but the image does not (yet!) exist. It's also possible to use a different image here, but we'll need to find and test with them. ************************************