Design

Virtual Memory

OSv differs from traditional systems in two respects:

First, OSv doesn’t have multiple address spaces. In an ordinary system, virtual address 0x200000 may be mapped to different physical pages (or the same page, sometimes, or to no page), depending on the process. In OSv an address is mapped to just one page. This means we don’t have to switch mappings when the processor stops running one process and starts running another.

The second difference is that OSv doesn’t maintain different permissions for the application and kernel. In a typical multiuser system, the kernel does not allow the the application to access kernel pages. This means it has to switch privilege levels when running kernel code, then switch back. OSv does not do this (expensive) switch; instead it relies on the hypervisor to protect applications from each other.

All user threads share a memory address space (they are threads, not processes). This allows for simpler code, and better performance, as we can avoid a page table switch involving a TLB flush (since Intel does not have a tagged TLB) that could occur when switching between processes.
There is no protection between userspace and kernel - calling the kernel is just making a function call, and userspace and the kernel can read and write each other’s memory. This brings performance benefits, and a lot of flexibility. But it means that even if we did have separate address space for different processes, they would not have UNIX-style isolation, because each process could read and write kernel data belonging to the other process.

Performance & Tracing

Performance measurements are based on static tracepoints, inserted into the code at strategic points. There are tools available from the shell to capture statistics on tracepoint hits, gather call-stack histograms on selected tracepoints, or simply capture a trace for later analysis.

Instructions: https://github.com/cloudius-systems/osv/wiki/Debugging-OSv

Fairness

We have two scheduling fairness requirements:

global fairness
cheap to compute

Each task has a measure of runtime it receives from the scheduler. We could use the task’s total runtime, t, but we would like to consider only the recent history so that a task that has slept for a long time would not dominate the processor. So we use a decaying average:

R(t + t) =(1 - kt)R(t) + p(t)r(t)kt(1)

Where

R is a measure of the runtime the thread has received in the recent past
p is the thread’s normalized priority
r describes the runtime history; it has the value 1 when the thread is running, and 0 otherwise
k is a constant determining how soon the scheduler “forgets” a thread’s history, typical value is 50 1/s (equivalent to forgetting history in a few multiples of 20 ms).
t is a small period

Since we don’t want to have periodic scheduler ticks, we take the limit t0, and integrate. This yields the normal continuous decaying average function:

Rt0=0t0p(t)r(t)ek(t-t0)dt(2)

The scheduler picks the runnable thread with smallest R, and runs it until some other thread has a smaller R. To prevent excessive context switches, hysteresis is employed.

Computing R continuously is expensive, but not needed. A little algebra shows how to compute R(t2) given R(t1), provided p(t) and r(t) have not changed between t1 and t2. This means we can update R only when the scheduler is invoked, or when priorities change, since r(t) only changes as a result of scheduler execution.

R(t2) = ek(t1-t2)R(t1) + 1kp(t2)r(t2)(1-ek(t1-t2))(3)

It is still impractical to compute R for all threads on every scheduler invocation; but we note that, for a given processor, r(t2)has the value 1 for at most one process; the others are not running (queued or sleeping). The second term of the equation vanishes; and the first is a multiplication by a constant (across all threads, for a given t2).

Let us introduce an unnormalized runtime measure, R’. This measure is local to a processor; when moving a thread to a different processor, we must normalize it again:

R’(t) = c(t)R(t)(4)

If we define

c(t2) =ek(t1-t2)c(t1) (5)

Then the unnormalized runtime measure is given by

R’(t2) = R’(t1) + 1kp(t2)r(t2)(e-k(t1-t2)-1)(6)

For non-running threads, r(t2) = 0, so:

R’(t2) = R’(t1)(7)

So on each scheduler invocation, we only need to update c and R’ for the running thread.

The value of c starts at 1 and decays towards zero; to avoid underflow we need to renormalize R’ periodically, by multiplying it (for all threads) by the current value of c, and then setting c to 1. This is done rarely enough so that the cost is amortized.

Runnable threads are stored in a sorted container, with R’ as the key. When a thread is run, it is taken out of the container, and has its R’ updated when it is stopped. When a thread is migrated, R’ is normalized first (in the cpu it was running on) and then unnormalized again (in the cpu it is migrated to).

To achieve hysteresis, the scheduler reduces the running thread’s R value by a constant tgranwhen it starts running the thread, and increases it back by the same amount when it stops running. This prevents the thread from being preempted immediately by a thread with the same or similar R value, yet preserves fairness.

When a runnable (but not running) thread (denoted ‘q’) becomes the one with the lowest R value, the scheduler computes the tsin which it would have an R value lower than the running thread (denoted ‘r’):

Rq(ts) = ek(t0 - ts)Rq(t0)

Rr(ts) = ek(t0 - ts)Rr(t0) +1kpr(ts)(1-ek(t1-t2))

Setting Rq(ts) = Rr(ts) and solving for ts, we obtain

ts =t0 - 1kln1kpr(t0)1kpr(t0) + Rq(t0) - Rr(t0)

Adjusting for the unnormalized R value:

ts =t0 - 1kln1kpr(t0)1kpr(t0) +Rq(t0) - Rr(t0)c(t0)

The scheduler sets a timer at ts, which is the next preemption point whenever one of the parameters in the equation changes - the runnable thread, or the priority of the running thread. A small overshoot is allowable, so approximations can be used.

The exponential function is easier to approximate in the form of a power-of-two function, so (6) becomes

R’(t2) = R’(t1) + 1kp(t2)r(t2)(2-(log2e)k(t1-t2)-1)