Making OSv Run on Firecracker

Apr 19, 2019

By: Waldek Kozaczuk

Firecracker

Firecracker is a new light KVM-based hypervisor written in Rust and announced during last AWS re:Invent in 2018. But unlike QEMU, Firecracker is specialized to host Linux guests only and is able to boot micro VMs in ~ 125 ms. Firecracker itself can only run on Linux on bare-metal machines with Intel 64-bit CPUs or i3.metal or other Nitro-based EC2 instances.

Firecracker implements a device model with the following I/O devices:

paravirtual VirtIO block and network devices over MMIO transport
serial console
partial keyboard controller
PICs (Programmable Interrupt Controllers)
IOAPIC (Advanced Programmable Interrupt Controller)
PIT (Programmable Interval Timer)
KVM clock

Firecracker also exposes REST API over UNIX domain socket and can be confined to improve security through so called jailer. For more details look at the design doc and the specification.

If you want to hear more about what it took to enhance OSv to make it boot in 5 ms on Firecracker (total of 10 ms including the host side) which is ~20 times faster than Linux on the same hardware (5 years old MacBook Pro with Ubuntu 18.10), please read remaining part of this article. In the next paragraph I will describe the implementation strategy I arrived at. In the following three paragraphs I will focus on what I had to change in relevant areas - booting process, VirtIO and ACPI. Finally in the epilogue I will describe the outcome of this exercise and possible improvements we can make and benefit from in future.

If you want to try OSv on Firecracker before reading this article follow this wiki.

Implementation Strategy

OSv implements VirtIO drivers and is very well supported on QEMU/KVM. Given Firecracker is based on KVM and exposes VirtIO devices, at first it seemed OSv might boot and run on it out of the box with some small modifications. As first experiments and more research showed, the task in reality was not as trivial. The initial attempts to boot OSv on Firecracker caused KVM exit and OSv did not even print its first boot message.

For starters I had to identify which OSv artifact to use as an argument to Firecracker /boot-source API call. It could not be plain usr.img or its derivative used with QEMU as Firecracker expects 64-bit ELF (Executable and Linking Format) vmlinux kernel. The closest to it in OSv-land is loader.elf (enclosed inside of usr.img) - 64-bit ELF file with 32-bit entry point start32. Finally given it is not possible to connect to OSv running on Firecracker with gdb (like it is possible with QEMU), I could not use this technique to figure out where stuff breaks.

It became clear to me I should first focus on making OSv boot on Firecracker without block and network devices. Luckily OSv can be built with Ram-FS where application code is placed in bootfs part of loader.elf.

Then I should enhance VirtIO layer to make it support block and network devices with MMIO transport. Initially these changes seemed very reasonable to implement but they turned way more involved in the end.

Finally I had to tweak some parts of OSv to make it work without ACPI (Advanced Configuration and Power Interface) if unavailable.

Next three paragraphs describe each step of this plan in detail.

Booting

In order to make OSv boot on Firecracker, first I had to understand how current OSv booting process works.

Originally OSv had been designed to boot in 16-bit mode (aka real mode) when it expects hypervisor to load MBR (Master Boot Record), which is first 512 bytes of OSv image, at address 0x7c00 and execute it by jumping to that address. A this point OSv bootloader (code in these 512 bytes) loads command line found in next 63.5 KB of the image using interrupt 0x13. Then it loads remaining part of the image which is lzloader.elf (loader.elf + fastlz decompression logic) at address 0x100000 in 32KB chunks using the interrupt 0x13 and switching back and forth between real and protected mode. Next it reads the size of available RAM using the 0x15 interrupt and jumps to the code in the beginning of 1st MB that de-compresses lzloader.elf in 1MB chunks starting from the tail and going backwards. Eventually after loader.elf is placed in memory at the address 0x200000 (2nd MB), logic in boot16.S switches to protected mode and jumps to start32 to prepare to switch to long mode (64-bit). Please note that start32 is a 32-bit entry point of otherwise 64-bit loader.elf. For more details please read this Wiki.

Firecracker on other hand expects image to be a vmlinux 64-bit ELF file and loads its LOAD segments into RAM at addresses specified by ELF program headers. Firecracker also sets VM to long mode (aka 64-bit mode), state of relevant registers and paging tables to map virtual memory to physical one as expected by Linux. Finally it passes memory information and boot command line in the boot_params structure and jumps to vmlinux entry of startup_64 to let Linux kernel continue its booting process.

So the challenge is: how do we modify booting logic to support booting OSv as 64-bit vmlinux format ELF and at the same time retain ability to boot in real mode using traditional usr.img image file? For sure we need to replace current 32-bit entry point start32 of loader.elf with a 64-bit one - vmlinux_entry64 - that will be called by Firecracker (which will also load loader.elf in memory at 0x200000 as ELF header demands). At the same time we also need to change memory placement of start32 to be at some fixed offset so that boot16.S knows where to jump to.

So what exactly new vmlinux_entry64 should do? Firecracker sets up VMs to 64-bit state but OSv already provided 64-bit start64 function so one could ask - why not simply jump to it and be done with it?. Unfortunately this would not work (as I tested) because of slightly different memory paging tables and CPU setup between what Linux and OSv expects (and Firecracker sets up for Linux). So possibly vmlinux_entry64 needs to reset paging tables and CPU the OSv way? Alternatively vmlinux_entry64 could switch back to protected mode and jump to start32 and let it setup VM OSv way. I tried that as well but it did not work for some reason either.

Luckily we do not need to worry about the segmentation which is setup by Firecracker to flat memory model which is typical in long mode and what OSv expects.

At the end based on many trial-and-error attempts I came to conclusion that vmlinux_entry64 should do following:

Extract command line and memory information from Linux boot_params structure whose address is passed in by Firecracker in RSI register and copy to another place structured same way as if OSv booted through boot16.S (please see extract_linux_boot_params for details).
Reset CR0 and CR4 control registers to reset global CPU features OSv way.
Reset CR3 register to point to OSv PML4 table mapping first 1GB of memory with 2BM medium size pages one-to-one (for more information about memory paging please read this article).
Finally jump to start64 to complete boot process and start OSv.

The code below is slightly modified version of vmlinux_entry64 in vmlinux-boot64.S that implements the steps described above in GAS (GNU Assembler) language.

# Call extract_linux_boot_params with the address of
# boot_params struct passed in RSI register to 
# extract cmdline and memory information
mov %rsi, %rdi
call extract_linux_boot_params

# Reset paging tables and other CPU settings the way 
# OSv expects it
mov $BOOT_CR4, %rax
mov %rax, %cr4

lea ident_pt_l4, %rax
mov %rax, %cr3

# Enable long mode by writing to EFER register by setting
# LME (Long Mode Enable) and NXE (No-Execute Enable) bits
mov $0xc0000080, %ecx
mov $0x00000900, %eax
xor %edx, %edx
wrmsr

mov $BOOT_CR0, %rax
mov %rax, %cr0

# Continue 64-bit boot logic by jumping to start64 label
mov $OSV_KERNEL_BASE, %rbp
mov $0x1000, %rbx
jmp start64

As you can see making OSv boot on Firecracker was the most tricky part of whole exercise.

Virtio

Unlike booting process, enhancing virtio layer in OSv was not as tricky and hard to debug, but it was the most labor intensive and required a lot of research that included reading the spec and Linux code for comparison.

Before diving in, let us first get a glimpse of VirtIO and its purpose. VirtIO specification defines standard virtual (sometimes called paravirtual) devices including network, block, scsi, etc ones. It effectively dictates how hypervisor (host) should expose those devices as well as how guest should detect, configure and interact with them in runtime in form of a driver. The objective is to define devices that can operate in most efficient way and minimize number of costly performance-wise exits from guest to host.

Firecracker implements virtio MMIO block and net devices. The MMIO (Memory-Mapped IO) is one of three VirtIO transport layers (MMIO, PCI, CCW) and was modeled after PCI and differs mainly in how MMIO devices are configured and initialized. Unfortunately to my despair OSv only implemented PCI transport and was missing mmio implementation. On top of that to make things worse it implemented the legacy (pre 1.0) version of virtio before it was finalized in 2016. So two things had to be done - refactor OSv virtio layer to support both legacy and modern PCI devices and implement virtio mmio.

In order to design and implement correct changes first I had to understand existing implementation of virtio layer. OSv has two orthogonal but related abstraction layers in this matter - driver and device classes. The virtio::virtio_driver serves as a base class with common driver logic and is extended by virtio::blk, virtio::net, virtio::scsi and virtio::rng classes to provide implementations for relevant type. For better illustration please look at this ascii art:

 hw_device <|---
               | 
       pci::function <|--- 
                         |
                  pci::device
                         ^                 |-- virtio::net
                  (uses) |                 |
                         |                 |-- virtio::blk
 hw_driver <|--- virtio::virtio_driver <|--|
                                           |-- virtio::scsi
                                           |
                                           |-- virtio::rng

As you can tell from the graphics above, virtio_driver interacts directly with pci::device so in order to add support of MMIO devices I had to refactor it to make it transport agnostic. From all the options I took into consideration, the least invasive and most flexible one involved creating new abstraction to model virtio device. To that end I ended up heavily refactoring virtio_driver class and defining following new virtual device classes:

virtio::virtio_device - abstract class to model interface of virtio device intended to be used by refactored virtio::virtio_driver
virtio::virtio_pci_device - base class extending virtio_device and implementing common virtio PCI logic that delegates to pci_device
virtio::virtio_legacy_pci_device - class extending virtio_pci_device and implementing legacy PCI device
virtio::virtio_modern_pci_device - class extending virtio_pci_device implementing modern PCI device; most differences between modern and legacy PCI devices lie in the initialization and configuration phase with special configuration register
virtio::mmio_device - class extending virtio_device and implementing mmio device

The method is_modern() declared in virtio_device class and overridden in its subclasses is used in few places in virtio_driver and its subclasses to mostly drive slightly different initialization logic of legacy and modern virtio devices.

For better illustration of the changes and relationship between new and old classes please see the ascii-art UML-like class diagram below:

               |-- pci::function <|--- pci::device
               |                              ^
               |               (delegates to) |
               |                              |        |-- virtio_legacy_pci_device
 hw_device <|--|             --- virtio_pci_device <|--|
               |             |                         |-- virtio_modern_pci_device
               |             _ 
               |             v
               |-- virtio::virtio_device <|--- virtio::mmio_device
                   ---------------------
                   | bool is_modern()  |
                   ---------------------
                             ^             |-- virtio::net
                      (uses) |             |
                             |             |-- virtio::blk
 hw_driver <|--- virtio::virtio_driver <|--|
                                           |-- virtio::scsi
                                           |
                                           |-- virtio::rng

To recap most of the coding went into major refactoring of virtio_driver class to make it transport agnostic and delegate to virtio_device, extracting out PCI logic from virtio_driver into virtio_pci_device and virtio_legacy_pci_device and finally implementing new virtio_modern_pci_device and virtio::mmio_device classes. Thanks to this approach changes to the subclasses of virtio_driver (virtio::net, virtio::block, etc) were pretty minimal and one of the critical classes - virtio::vring - stayed pretty much intact.

Big motivation for implementing modern virtio PCI device (as opposed to implementing legacy one only) was to have a way to exercise and test modern virtio device with QEMU. That way I could have extra confidence that most heavy refactoring in virtio_driver was correct even before testing it with Firecracker which exposes modern MMIO device. Also there is great chance it will make easier enhancing virtio layer to support new VirtIO 1.1 spec once finalized (for good overview see here).

Lastly given that MMIO devices cannot be detected in similar fashion as PCI ones and instead are passed by Firecracker as part of command line in format Linux kernel expects, I also had to enhance OSv command line parsing logic to extract relevant configuration bits. On top of that I added boot parameter to skip PCI enumeration and that way save extra 4-5 ms of boot time.

ACPI

The last and simplest part of the exercise was to fill in the gaps in OSv to make it deal with situation when ACPI is unavailable.

Firecracker does not implement ACPI which is used by OSv to implement power handling and to discover CPUs. Instead OSv had to be changed to boot without ACPI and read CPU info from MP table. For more information about MP table read here or there. All in all I had to enhance OSv in following ways:

modify ACPI related logic to detect if it is present
modify relevant places (CPU detection, power off) that rely on ACPI to continue and use alternative mechanism if ACPI not present instead of aborting
modify pvpanic probing logic to skip if ACPI not available

Epilogue

With all changes implemented as described above OSv can boot on Firecracker.

OSv v0.53.0-6-gc8395118
2019-04-17T22:28:29.467736397 [anonymous-instance:WARN:vmm/src/lib.rs:1080] Guest-boot-time =   9556 us 9 ms,  10161 CPU us 10 CPU ms
	disk read (real mode): 0.00ms, (+0.00ms)
	uncompress lzloader.elf: 0.00ms, (+0.00ms)
	TLS initialization: 1.13ms, (+1.13ms)
	.init functions: 2.08ms, (+0.94ms)
	SMP launched: 3.43ms, (+1.35ms)
	VFS initialized: 4.12ms, (+0.69ms)
	Network initialized: 4.45ms, (+0.33ms)
	pvpanic done: 5.07ms, (+0.62ms)
	drivers probe: 5.11ms, (+0.03ms)
	drivers loaded: 5.46ms, (+0.35ms)
	ROFS mounted: 5.62ms, (+0.17ms)
	Total time: 5.62ms, (+0.00ms)
Hello from C code

The console log with bootchart information above from an example run of OSv with Read-Only-FS on Firecracker, shows it took slightly less than 6 ms to boot. As you can notice OSv spent no time loading its image in real mode and decompressing it which is expected because OSv gets booted as ELF and these two phases completely bypassed.

Even though 5 ms is already very low number, one can see that possibly TLS initialization and ‘SMP lauched’ phases need to be looked at to see if we can optimize it further. Other areas of interest to improve are memory utilization - OSv needs minimum of 18MB to run on Firecracker and network performance which suffers a little comparing to QEMU/KVM which might need to be optimized on Firecracker itself.

On other hand it is worth noting that block device seems to work much faster - for example mounting ZFS filesystem is at least 5 times faster on Firecracker - on average 60ms on firecracker vs 260 ms on QEMU.

Looking toward future, Firecracker team is working on ARM support and given OSv already unofficially supports this platform and used to boot on XEN/ARM at some point, it might not be that difficult to make OSV boot on future Firecracker ARM version.

Finally this work might make it easier to boot OSv on NEMU and QEMU 4.0 in Linux direct kernel mode. It might also make it easier to implement support of new Virtio 1.1 spec.

Serverless Computing With OSv

Jun 12, 2017

By: Nadav Har’El, Benoît Canet

Serverless computing, a.k.a. Function-as-a-Service

The traditional approach to implementing applications on the cloud is the IaaS (Infrastructure-as-a-Service) approach. In a IaaS cloud, application authors rent virtual machines and install their own software to run their application. However, when an application needs, for example, a database, the application writer often does not have the necessary expertise to choose the database, install it, configure and tweak it, and dynamically change the number of VMs running this database. This is where the “PaaS” (Platform-as-a-Service) cloud steps in: The PaaS cloud does not give application writers virtual machines, but rather a new platform with various services. One of these services can be a database service: The application makes database requests - could be one each second or a million each second - and does not have to care or worry whether one machine, or 1000 machines, are actually needed to provide this service. The cloud provider charges the application owner for these requests, and the amount of work they actually do.

But it is not enough that the PaaS cloud provides building blocks such as databases, queue services, object stores, and so on. An application also needs glue code combining all these building blocks into the operation which the application needs to do. So even on PaaS, application writers start virtual machines to run this glue code. Yes, again VMs and all the problems associated with them (installation, scaling, etc.). But recently, there is a trend towards a serverless PaaS cloud, where the application developer does not need to rent VMs. Instead, the cloud provides Function-as-a-Service (FaaS). FaaS implementations (such as Amazon Lambda, Google Cloud Functions or Microsoft Azure Functions), run short functions which the application author writes in high-level languages like Javascript or Java, in response to certain events. These functions in turn use the various PaaS services (such as database requests) to perform their job. The application author is freed from worrying how or where these functions are run - it is up to cloud implementation to ensure that whether one or a million of these functions need to run per second, they will get the necessary resources to do so.

Implementation, and why OSv is a winner

How could function-as-a-service be implemented by the cloud provider?

It is very inefficient to start a VM for every invocation of a function, which could last for a fraction of a second. A more reasonable approach is to start a VM running the runtime environment, e.g., Node.js or Java, and then send to it many different requests. But if we were to start a single instance of the runtime environment to run the functions of many different clients, this would carry significant security risks: An exploit found in the runtime implementation may lead to one application being able to view or modify the functions run by another application.

So instead of having one VM serve multiple applications of different clients, it is safer to start separate VMs for each application: A single VM will run multiple functions before shutting down, but all of these functions will be the same one, or at least belong to the same application. Having a VM dedicated to the application and its small set of functions also makes it more efficient to run these functions - this VM can load and compile the functions and relevant libraries once, before running the same function or functions many times. Having the VM dedicated to the client also makes it easier to charge the client by actual CPU usage and memory usage of the VMs started for him.

But the hard part of this implementation is scaling: When the number of functions being run by one application changes from second to second, we also need to change the number of VMs dedicated to running these functions. Leaving behind too many of these VMs as spares cost money as resources (especially memory) are being wasted. Moreover, in the event of cloud bursting - a sudden unexpected burst of requests, we may need to start many more VMs than we had previously. For these two reasons, it is very important that we are able to boot and shut down these function-running VMs as quickly as possible, preferably in a fraction of a second.

OSv, similar to other unikernels, boots and shuts down very quickly. But what makes OSv a better fit for this use case than any of the other unikernels is the fact that it can run unmodified Linux executables, and in particular the complex run-time environments and languages we wish FaaS to support, such as Node.js and Java, as well as user-provided native code.

A FaaS implementation using OSv might work as follows:

When the FaaS needs to run a certain application’s function, if a VM belonging to this application is ready to accept more requests, we send it the request to run the function. Otherwise, when all the application’s VMs are busy, we start a new VM:
Starting a new VM will take only a fraction of a second. Beyond OSv’s quick boot, another reason for this quickness is that the VM image will not have to be sent over the network: All these VMs, regardless of which application they work for, boot from the same identical image (containing OSv, Node.js or Java, and the FaaS glue), and the image is immutable - these VMs cannot write back to it. This immutable image also means that for this use case, OSv does not need the read-write ZFS file system, and that further reduces OSv’s boot time and memory overhead.
To ensure that the end-user doesn’t experience even a fraction-of-a-second latency when a new VM is started, we may choose to preemptively start new VMs as soon as the existing VMs are about to get filled up, before they actually do get filled up. The fact we can start new VMs very quickly allows us to keep the number of spare VMs low.
When the rate of function executions for a particular application diminishes, the FaaS system will stop sending new requests to some of the VMs, and very soon such VMs will become idle and can be shut down. OSv’s shutdown is very quick, but in this case we don’t even have to bother with a clean shutdown - we can stop an idle VM instantaneously because we know there is not even a disk needed to be flushed.

Existing FaaS implementations, like Amazon’s Lambda, charge the application for each function’s wall-clock run time (and in large 100ms ticks). Paying for idle time makes it very expensive to run functions which need to make a request, wait for its response, and do something with it. We’ve seen bloggers recommend working around this problem by tricks such as starting multiple unrelated requests in the same lambda and then waiting for all of them to respond. We believe, however, that FaaS needs to have more natural support for functions which block, which we believe will be the typical use of FaaS. This natural support could be done with Node.js’s futures and continuations (the application starts an asynchronous operation, and runs a non-blocking function when it completes. https://serverless.com does this on Amazon Lambda), or alternatively by the implementation transparently running multiple application functions in parallel on the same VM. In any case, the client should pay only for actual CPU time used by the function or VM bringup, as well as for the memory used by those VMs.

Note that although the FaaS implementation we propose is very scalable, at the low end of the scale - e.g., just one request each second - it is not cost-effective: It does not make sense to bring up the VM and the runtime environment each second, as a better part of that second will be wasted just for this bringup; The alternative is to leave the VM up but idle most of the time. In either case, the memory required by the runtime enviroment will be reserved for the application continuously, so the cost of this memory will put a lower limit on the price of low-usage function. Note that if the function’s usage becomes even lower - say just once a minute - it again becomes a cost-effective option to bring the VMs up and down each time.

Epilogue

We believe that the difficulties of running code on VMs will drive more and more application developers to look for alternatives for running their code, alternatives such as Function-as-a-Service (FaaS). We already explored this and related directions in the past in this paper from 2013.

We showed in this post that it makes sense to implement FaaS on top of VMs, and that OSv is a better fit for running those VMs than either Linux or other unikernels. That is because OSv has the unique combination of allowing very fast boot and instantaneous shutdowns, at the same time as being able to run the complex runtime environments we wish to support (such as Node.js and Java).

An OSv-based implementation of FaaS will support “cloud bursting” - an unexpected, sudden, increase of load on a single application, thanks to our ability to boot many new OSv VMs very quickly. Cloud bursting is one of the important use cases being considered by the MIKELANGELO project, a European H2020 research project which the authors of this post contribute to, and which is based on OSv as we previously announced.

NFS on OSv or “How I Learned to Stop Worrying About Memory Allocations and Love the Unikernel”

Apr 21, 2016

By Benoît Canet and Don Marti

A new type of OSv workload

The MIKELANGELO project aims to bring High Performance Computing (HPC) to the cloud. HPC traditionally involves bleeding edge technologies, including lots of CPU cores, Infiniband interconnects between nodes, MPI libraries for message passing, and, surprise—NFS, a very old timer of the UNIX universe.

In an HPC context this networked filesystem is used to get the data inside the compute node before doing the raw computation, and then to extract the data from the compute node.

Some OSv NFS requirements

For HPC NFS is a must, so we worked to make it happen. We had some key requirements:

The NFS driver must go reasonably fast
The implementation of the NFS driver must be done very quickly to meet the schedule of the rest of the MIKELANGELO project
There is no FUSE (Filesystem in User Space) implementation in OSv
OSv is a C++ unikernel, so the implementation must make full usage of its power
The implementation must use the OSv VFS (Virtual File System) layer, and so be transparent for the application

Considering alternatives

The first possibility that we can exclude right away is doing an NFS implementation from scratch. This subproject is simply too short on time.

The second possibility is to leverage an implementation from an existing mainstream kernel and simply port it to OSv. The pro would be code reuse, but this comes with a lot of cons.

Some implementation licenses do not match well with the unikernel concept where everything can be considered a derived work of the core kernel
Every operating system has its own flavor of VFS. The NFS subproject would be at risk of writing wrappers around another operating system’s VFS idiosyncrasies
Most mainstream kernel memory allocators are very specific and complex, which would leads to more insane wrappers.

The third possibility would be to use some userspace NFS implementation, as their code is usually straightforward POSIX and they provide a nice API designed to be embedded easily in another application. But wait! Didn’t we just say the implementation must be in the VFS, right in the middle of the OSv kernel? There is no FUSE on OSv.

Enter the Unikernel

Traditional UNIX-like operating system implementations are split in two:

Kernel space: a kernel doing the low level plumbing everyone else will use
User space: a bunch of applications using the facilities provided by the kernel in order to accomplish some tasks for the user

One twist of this split is that kernel space and user space memory addresses are totally separated by using the MMU (Memory Management Unit) hardware of the processor. It also usually implies two totally different sets of programing APIs, one for kernel space and one for user space, and needless to say a lot of memory copies each time some data must cross the frontier from kernel space to userspace.

A unikernel such as OSv is different. There is only one big address space and only one set of programing APIs. Therefore you can use POSIX and Linux userspace APIs right in an OSv driver. So no API wrappers to write and no memory copies.

Another straightforward consequence of this is that standard memory management functions including malloc(), posix_memalign(), free() and friends, will just work inside an OSv driver. There are no separate kernel-level functions for managing memory, so no memory allocator wrappers needed.

Meet libnfs

libnfs, by Ronnie Sahlberg, is a user space NFS implementation for Linux, designed to be embedded easilly in an application.

It’s already used in successful programs like Fabrice Bellard’s QEMU, and the author is an established open source developer who will not disappear in a snap.

Last by not last, the libnfs license is LGPL. So far so good.

The implementation phase

The implementation phase went fast for a networked filesystem. Reviewing went smoothly thanks to Nadav Har’El’s help, and the final post on the osv-devel mailing list was the following:

OSV nfs client

Some extra days were spent to fix the occasional bugs and polish the result and now the MIKELANGELO HPC developers have a working NFS client.

Some code highlights

Almost 1:1 mapping

Given the unikernel nature of OSv, an useful system call like truncate(), used to adjust the size of a file, boils down to

static int nfs_op_truncate(struct vnode *vp, off_t length)
{
   int err_no;
   auto nfs = get_nfs_context(vp, err_no);

   if (err_no) {
       return err_no;
   }

   int ret = nfs_truncate(nfs, get_node_name(vp), length);
   if (ret) {
       return -ret;
   }

   vp->v_size = length;

   return 0;
}

OSv allowed us to implement this syscall with a very thin shim without involving any additional memory allocation wrapper.

C++ empowers you to do powerful things in kernel code

One of the known limitation of libnfs is that it’s not thread-safe. See this mailing list posting on multithreading and preformance. OSv is threaded—so heavily threaded that there is no concept of a process in OSv, just threads. Clearly this is a problem, but OSv is written in modern C++, which provides us with modern tools.

This single line allows us to work around the libnfs single threaded limitation.

thread_local std::unordered_map<std::string,
                               std::unique_ptr<mount_context>> _map;

Here the code makes an associative map between the mount point (the place in the filesystem hierarchy where the remote filesystem appears) and the libnfs mount_context.

The one twist to notice here is thread_local: this single C++ keyword automatically makes a separate instance of this map per thread. The consequence is that every thread/mount point pair can have its own separate mount_context. Although an individual mount_context is not thread-safe, that is no longer an issue.

Conclusion

As we have seen here, the OSv unikernel is different in a lot of good ways, and allows you to write kernel code fast.

Standard POSIX functions just work in the kernel.
C++, which is not used in other kernels, comes with blessings.

Scylla will keep improving OSv with the various MIKELANGELO partners, and we should see exciting new hot technologies like vRDMA on OSv in the not so distant future.

The MIKELANGELO research project is a three-year research project sponsored by the European Commission’s Horizon 2020 program. The goal of MIKELANGELO is to make the cloud more useful for a wider range of applications, and in particular make it easier and faster to run high-performance computing (HPC) and I/O-intensive applications in the cloud. For project updates, visit the MIKELANGELO site, or subscribe to this blog’s RSS feed.

Project Mikelangelo Update

Mar 8, 2016

By Nadav Har’El

A year ago, we reported (see Researching the Future of the Cloud) that ScyllaDB and eight other industrial and academic partners started the MIKELANGELO research project. MIKELANGELO is a three-year research project sponsored by the European Commission’s Horizon 2020 program. The goal of MIKELANGELO is to make the cloud more useful for a wider range of applications, and in particular make it easier and faster to run high-performance computing (HPC) and I/O-intensive applications in the cloud.

company logos

Last week, representatives of all MIKELANGELO partners (see company logos above, and group photo below) met with the Horizon 2020 reviewers in Brussels to present the progress of the project during the last year. The reviewers were pleased with the project’s progress, and especially pointed out its technical innovations.

project participants group photo

Represented by Benoît Canet and yours truly, ScyllaDB presented Seastar, our new C++ framework for efficient yet complex server applications. We demonstrated the sort of amazing performance improvements which Seastar can bring, with ScyllaDB - our implementation of the familiar NoSQL database Apache Cassandra with the Seastar framework. In the specific use case we demonstrated, an equal mixture of reads and writes, ScyllaDB was 7 times faster (!) than Cassandra. And we didn’t even pick ScyllaDB’s best benchmark to demonstrate (we’d seen even better speedups in several other use cases). Seastar-based middleware applications such as ScyllaDB hold the promise of making it significantly easier and cheaper to deploy large-scale Web or Mobile applications in the cloud.

Another innovation that ScyllaDB brought to the MIKELANGELO project is OSv, our Linux-compatible kernel specially designed and optimized for running on cloud VMs. Several partners demonstrated running their applications on OSv. One of the cool use cases was aerodynamic simulations done by XLAB and Pipistrel. Pipistrel is a designer and manufacturer of innovative and award-winning light aircraft (like the one in the picture below), and running their CFD simulations on the cloud, using OSv VMs and various automation tools developed by XLAB, will significantly simplify their simulation workflow and make it easier for them to experiment with new aircraft designs.

Pipistrel aircraft photo

Other partners presented their own exciting developments: Huawei implemented RDMA virtualization for KVM, which allows an application spread across multiple VMs on multiple hosts to communicate using RDMA (remote direct-memory-access) hardware in the host. In a network-intensive benchmark, virtualized RDMA improved performance 5-fold. IBM presented improvements to their earlier ELVIS research, which allow varying the number of cores dedicated to servicing I/O, and achieve incredible amounts of I/O bandwidth in VMs. Ben-Gurion University security researchers implemented a scary “cache side-channel attack” where one VM can steal secret keys from another VM sharing the same host. Obviously their next research step will be stopping such attacks! Intel developed a telemetry framework called “snap” to collect and to analyse all sorts of measurements by all the different cloud components - VM operating systems, hypervisors, and individual applications. HLRS and GWDG, the super-computer centers of the universities of Stuttgart and Göttingen, respectively, built the clouds on which the other partners’ developments will be run, and brought in use cases of their own.

Like ScyllaDB, all partners in the MIKELANGELO project believe in openness, so all technologies mentioned above have already been released as open-source. We’re looking forward to the next year of the MIKELANGELO project, when all these exciting technologies will continue to improve separately, as well as be integrated together to form the better, faster, and more secure cloud of the future.

For more updates, follow the ScyllaDB blog.

Building OSv Images Using Docker

Apr 27, 2015

By David Jorm and Don Marti

Why build OSv images under Docker?

Building OSv from source has several advantages, including the ability to build images targeting different execution environments. The CloudRouter project is working on integrating the build script into its continuous integration system, automatically producing nightly rebuilds of all supported OSv application images.

The main problem with this approach is that it requires a system to be appropriately configured with all the necessary dependencies and source code to run builds. To build a scalable and reproducible continuous integration system, we really want to automate the provisioning of new build servers. An ideal way to achieve this goal is by creating a Docker image that includes all the necessary components to produce OSv image builds.

osv-builder: a Docker based build/development environment for OSv

The osv-builder Docker image provides a complete build and development environment for OSv, including OSv application images and appliances. It has been developed by Arun Babu Neelicattu from IIX. To download the image, run:

docker pull cloudrouter/osv-builder

OSv includes a range of helper scripts for building and running OSv images. The build script will compile OSv from the local source tree, then create a complete OSv image for a given application, based on the application’s Makefile. For more on scripts/build, see the recent blog post on the OSv build system.

Capstan

The image comes with Capstan pre-installed. Note that to use Capstan, you’ll have to run the container with the –privileged option, as it requires the KVM kernel module. For example, to build and run the iperf application:

$ sudo docker run -it \
  --privileged \
  cloudrouter/osv-builder
bash-4.3# cd apps/iperf
bash-4.3# capstan build
Building iperf...
Downloading cloudius/osv-base/index.yaml...
154 B / 154 B [=================================================================================================================] 100.00 % 0
Downloading cloudius/osv-base/osv-base.qemu.gz...
20.09 MB / 20.09 MB [=======================================================================================================] 100.00 % 1m27s
Uploading files...
1 / 1 [=========================================================================================================================] 100.00 % bash-4.3# capstan run
Created instance: iperf
OSv v0.19
eth0: 192.168.122.15
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------

Launching an interactive session

HOST_BUILD_DIR=$(pwd)/build
docker run -it \
  --volume ${HOST_BUILD_DIR}:/osv/builder \
  cloudrouter/osv-builder

This will place you into the OSv source clone. You’ll see the prompt:

bash-4.3#

Now, you can work with it as you normally would when working on OSv source. You can build apps, edit build scripts, and so on. For example, you can run the following commands, once the above docker run commands has been executed, to build and run a tomcat appliance.

./scripts/build image=tomcat,httpserver
./scripts/run -V

The `osv` Command

Note that the commands you run can be prefixed with osv, the source for which is available at assets/osv. For example you can build by:

docker run \
  --volume ${HOST_BUILD_DIR}:/osv/build \
  osv-builder \
  osv build image=opendaylight

The osv script, by default, provides the following convenience wrappers:

Command	Mapping
build args	scripts/build
run args	scripts/run.py
appliance name components description	scripts/build-vm-img
clean	make clean

If any other command is used, it is simply passed on as scripts/$CMD "$@" where $@ is the arguments following the command.

You could also run commands as:

docker run \
  --volume ${HOST_BUILD_DIR}:/osv/build \
  osv-builder \
  ./scripts/build image=opendaylight

Building appliance images

If using the pre-built version from Docker Hub, use cloudrouter/osv-builder instead of osv-builder.

HOST_BUILD_DIR=$(pwd)/build
docker run \
  --volume ${HOST_BUILD_DIR}:/osv/build \
  osv-builder \
  osv appliance zookeeper apache-zookeeper,cloud-init "Apache Zookeeper on OSv"

If everything goes well, the images should be available in ${HOST_BUILD_DIR}. This will contain appliance images for QEMU/KVM, Oracle VirtualBox, Google Compute Engine and VMWare Virtual Machine Disk.

Note that we explicitly disable the build of VMware ESXi images since ovftool is not available.

Building locally

As an alternative, you can build locally with a docker build command, using the Dockerfile for osv-builder.

docker build -t osv-builder .

Then you can use a plain osv-builder image name instead of cloudrouter/osv-builder.

For more information regarding OSv Appliances and pre-built ones, check the OSv virtual appliances page.

Volume Mapping

Volume	Description
/osv	This directory contains the OSv repository.
/osv/apps	The OSv apps directory. Mount this if you are testing local applications.
/osv/build	The OSv build directory containing release and standalone directories.
/osv/images	The OSv image build configurations.

Sending OSv patches

If you’re following the Formatting and sending patches guide on the OSv web site, just copy your patches into the builder directory in the container, and they’ll show up under your $HOST_BUILD_DIR, ready to be sent to the mailing list.

Conclusion

With the osv-builder docker image, building your own OSv images is now easier than ever before. If you are looking for a high-performance operating system to run your applications, go ahead and give it a try!

Questions and comments welcome on the osv-dev mailing list.

About the authors

David is a product security engineer based in Brisbane, Australia. He currently leads product security efforts for IIX, a software-defined interconnection company. David has been involved in the security industry for the last 15 years. During this time he has found high-impact and novel flaws in dozens of major Java components. He has worked for Red Hat’s security team, led a Chinese startup that failed miserably, and wrote the core aviation meteorology system for the southern hemisphere. In his spare time he tries to stop his two Dachshunds from taking over the house.

Don is a technical marketing manager for Cloudius Systems, the OSv company. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen, which was acquired by VA Linux Systems. Don has served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.

Re-"make"-ing OSv

Apr 8, 2015

OSv 0.19 is out, with a rewrite of the build system. The old OSv build system was fairly complex, but the rewrite makes it simpler and faster.

Simpler Makefile

The old OSv build system had several makefiles including each other, playing tricks with the current directory and VPATH, dynamically rewriting makefiles, and running submakes.

The old Makefile was responsible not only for building the kernel, it also built tests, called various Python scripts to build modules for different applications, and carried out other tasks.

In the new build system, there is just one “Makefile” for building the entire OSv kernel. Everything is in one file, and also better commented.

Separate kernel building from application building

In the old build system, we used “make” to do everything from building the OSv kernel, building various applications, and building images containing OSv and a collection of applications. This complicated the Makefile, and resulted in unexpected build requirements. For example, building OSv always built some Java tests and thus required Maven and a working Internet connection).

In the new system, make only builds the OSv kernel, and scripts/build build applications and images. In the future, you could use Capstan instead of scripts/build to make an image that you would like to manage with Capstan.

Most make command lines that worked in the previous build system will continue to work unchanged with scripts/build. For example:

# build image with default OSv application
scripts/build

# build the rogue image
scripts/build image=rogue
# or
scripts/build modules=rogue

# clean kernel and all modules
scripts/build clean

# make parameters can also be given to build
scripts/build mode=debug

# build image with tests, and run them
scripts/build check

Additional benefits of this rewrite include:

Faster rebuilds. For example “touch loader.cc; scripts/build image=rogue” takes just 6 seconds (14 seconds previously). make after “make clean” (with ccache) is just 10 seconds (30 seconds previously).
It should be fairly easy to add additional build scripts which will build different types of images using the same OSv kernel. One popularly requested option is to have the ability to create a bootfs-only image, without ZFS.
Some smaller improvements, like more accurate setting of the desired image size (covered in issue #595), and supporting setting CROSS_PREFIX without also needing to specify ARCH.

What happened to `make test`?

The OSv test are a module like other modules - they won’t be compiled unless someone builds the tests module. They will have a separate Makefile in tests/. The ant and mvn tools, both currently used in our makefile just for building tests, will no longer be run every time the kernel is compiled, but just when the “tests” module is being built. To build and run the tests:

scripts/build check

Try it out

The good news is that now, running make is faster, and the image build process is simpler and easier to extend. Check it out – questions and comments welcome on the osv-dev mailing list.

Wiki Watch: CloudRouter Images Available

Apr 3, 2015

Interested in running the CloudRouter VMs that we covered in a blog post earlier this week? Details are available on the CloudRouter wiki. (Pre-built images are available on the CloudRouter site, or you can build one yourself with Capstan.)

Questions on these or any other images are welcome, on the osv-dev mailing list. For more info on working with the CloudRouter project, see the CloudRouter community page.

For future updates, please follow @CloudiusSystems on Twitter.

Software-defined Interconnection: The Future of Internet Peering, Powered by OpenDaylight and OSv

Mar 31, 2015

By David Jorm and Don Marti

Most users are aware of cloud computing as a general term behind such trends as “Software as a Service,” where sites such as Salesforce.com can replace software run by a company IT department, or “Infrastructure as a Service” where virtual machines rented by the hour can replace conventional servers. But today, the technologies behind the cloud are changing the way that we connect the Internet at the most fundamental level, through Software Defined Interconnection (SDI).

What is SDI? A lot of manual work goes into hooking up the Internet between providers. The routers that send Internet traffic from one place to another can be configured to use “paid transit”, where a single provider will route packets to any destination. But the more Internet traffic you’re responsible for, the more you can benefit from another arrangement, called “direct interconnection” where you set up your company’s routers to directly connect to another company’s. Most networks will always need to buy transit from somebody; the best you can hope for is that a portion of your traffic bypasses the transit provider and is directly delivered to the destination. Maximizing the amount of traffic that is directly peered leads to better performance, lower latency, lower packet loss, and greater security.

Today, most direct interconnection is typically set up manually, with a physical fiber cable connecting one organization’s network to another. Agreements to interconnect and peer are also reached manually, typically via email or face-to-face at peering conferences. When agreement is reached, network admins must ssh in to routers in order to manually configure such peering. It’s not efficient or scalable, and depends on individuals or select groups.

Once an organization has agreed that they want to directly connect with another organization, how do you handle changes to router and switch configuration? Probably the same way you used to manage your httpd.conf back in the 1990s! Network managers ssh in, and update config manually. Some networks have sophisticated management tools, but for many, “the state of the router is the canonical state.”

Software-defined interconnection

Software-defined interconnection, under test in the IIX lab

SDI aims to improve all that. The OpenDaylight project is a common platform for network management that facilitates breaking traditional network devices such as switches and routers into separate “data plane” devices that handle high traffic volume and “control plane” devices that do management. Because the control plane device, or software-defined networking (SDN) controller, does not have the extreme throughput requirements of the data plane, it’s easy to virtualize.

In the lab today IIX is currently prototyping this next generation of devices in the lab, while more traditional network gear runs in production. The prototype system uses an OpenFlow switch for data plane, and a separate OpenDaylight server for control plane. Switches rely on the SDN controller. In the event of an unknown packet, they forward it to the controller.

No configuration changes are needed on the data plane hardware, only on the SDN controller, which can be a virtual machine. OpenDaylight manages both layer 2, switching, and layer 3, routing, and the same OpenDaylight APIs can be used to change configuration at both levels.

OpenDaylight is a pure Java application. It only requires the ability to run a JVM on the virtual machine. For security and ease of management, it can be advantageous to run an individual controller per customer. This means a lightweight, easy-to-manage guest OS is a big advantage. With OSv, IIX can deploy identical simple VMs for each customer, and the OpenDaylight APIs can be used to configure each one appropriately.

OSv’s high performance and low overhead allows for high density of VMs on standard physical hardware. And any compromise or configuration error should only affect one customer, because strong isolation is provided by a standard hypervisor, without the complex security model of containerization.

Conclusion While Internet applications have gained from cloud technologies, the fundamental lower layers are still coming up to speed. OpenDaylight and OSv are bringing cloud economics to the lower levels of the stack.

About the authors

Seastar: New C++ Framework for Web-scale Workloads

Feb 20, 2015

Today, we are releasing Seastar, a new open-source C++ framework for extreme high-performance applications on OSv and Linux. Seastar brings a 5x throughput improvement to web-scale workloads, at millions of transactions per second on a single server, and is optimized for modern physical and virtual hardware.

seastar Memcache graph

Benchmark results are available from the new Seastar project site.

Today’s server hardware is substantially different from the machines for which today’s server software was written. Multi-core design and complex caching now require us to make new assumptions to get good performance. And today’s more complex workloads, where many microservices interact to fulfil a single user request, are driving down the latencies required at all layers of the stack. On new hardware, the performance of standard workloads depends more on locking and coordination across cores than on performance of an individual core. And the full-featured network stack of a conventional OS can also use a majority of a server’s CPU cycles.

Seastar reaches linear scalability, as a function of core count, by taking a shard-per-core approach. SeaStar tasks do not depend on synchronous data exchange with other cores which is usually implemented by compare-exchange and similar locking schemes. Instead, each core owns its resources (RAM, NIC queue, CPU) and exchanges async messages with remote cores. Seastar includes its own user-space network stack, which runs on top of Data Plane Development Kit (DPDK). All network communications can take place without system calls, and no data copying ever occurs. SeaStar is event-driven and supports writing non-blocking, asynchronous server code in a straightforward manner that facilitates debugging and reasoning about performance.

Seastar is currently focused on high-throughput, low-latency network applications. For example, it is useful for NoSQL servers, for data caches such as memcached, and for high-performance HTTP serving. Seastar is available today, under the Apache license version 2.0.

Please follow @CloudiusSystems on Twitter for updates.

Researching the Future of the Cloud

Feb 2, 2015

By Nadav Har’El

What will the IaaS cloud of the future look like? How can we improve the hypervisor to reduce the overhead it adds to virtual machines? How can we improve the operating system on each VM to make it faster, smaller, and more agile? How do we write applications that run more efficiently and conveniently on the modern cloud? How can we run on the cloud applications which traditionally required specialized hardware, such as supercomputers?

Project Mikelangelo

Cloudius Systems, together with eight leading industry and university partners, announced this month the Mikelangelo research project, which sets out to answer exactly these questions. Mikelangelo is funded by the European Union’s flagship research program, “Horizon 2020”.

Cloudius Systems brings to this project two significant technologies:

The first is OSv, our efficient and light-weight operating-system kernel optimized especially for VMs in the cloud. OSv can run existing Linux applications, but often with significantly improved performance and lower memory and disk footprint.

Our second contribution to the cloud of the future is Seastar, a new framework for writing complex asynchronous applications while achieving optimal performance on modern machines. Seastar could be used to write the building blocks of modern user-facing cloud applications, such as HTTP servers, object caches and NoSQL databases, with staggering performance: Our prototype implementations already showed a 4-fold increase in server throughput compared to the commonly used alternatives, and linear scalability of performance on machines with up to 32 cores.

The other companies which joined us in the Mikelangelo project are an exciting bunch, and include some ground-breaking European (and global) cloud researchers and practicioners:

• Huawei

• IBM

• Intel

• The University of Stuttgart’s supercomputing center (HLRS)

• The University of Goettingen’s computing center (GWDG)

• Ben-Gurion University

• XLAB, the coordinator of the project

• Pipistrel, a light aircraft manufacturer

Pipistrel’s intended use case, of moving HPC jobs to the cloud, is particularly interesting. Pipistrel is an innovative manufacturer of light aircraft that holds several cool world records, and won NASA’s 2011 “Green Flight Challenge” by building an all-electric airplane achieving the equivalent of 400 miles per gallon per passenger. The aircraft design process involves numerous heavy numerical simulations. If a typical run requires 100 machines for two hours, running it on the cloud means they would not need to own 100 machines, and rather just pay for the computer time they use. Moreover, on the cloud they could just as easily deploy 200 machines, and finish the job in half the time, for exactly the same price!

Last week, researchers from all these partners met to kick off the project, and also enjoyed a visit to Ljubljana which, as its name implies, is a lovely city. The project will span 3 years, but we expect to see some encouraging results from the project—and from the individual partners comprising it—very soon. The future of the cloud looks very bright!

Visit The Mikelangelo Project’s official site for updates.

Watch this space (feed), or follow @CloudiusSystems on Twitter, for more links to research in progress.

← Older Blog Archives

Firecracker

Implementation Strategy

Booting

Virtio

ACPI

Epilogue

Implementation, and why OSv is a winner

Epilogue

A new type of OSv workload

Some OSv NFS requirements

Considering alternatives

Enter the Unikernel

Meet libnfs

The implementation phase

Some code highlights

Almost 1:1 mapping

C++ empowers you to do powerful things in kernel code

Conclusion

Why build OSv images under Docker?

osv-builder: a Docker based build/development environment for OSv

Capstan

Launching an interactive session

The osv Command

Building appliance images

Building locally

Volume Mapping

Sending OSv patches

Conclusion

About the authors

Simpler Makefile

Separate kernel building from application building

What happened to make test?

Try it out

About the authors

The `osv` Command

What happened to `make test`?