Christian Dietrich's Research Blog

A templated System-Call Interface for OO/MPStuBS

2015-12-23 00:00:00 +0100

We use OOStuBS/MPStuBS in our operating system course. In the first part of our two part lecture, the students implement basic IRQ handling, coroutine switching, and synchronisation primitives. We have no spatial or privilege isolation in place, since this is topic of the second lecture part.

Still, we want to differentiate between a user space and the kernel space. On a technical level, the kernel space or system level is defined by a big kernel lock; the so called guard. If a control flow enters the guard, it transitions to the kernel and leaves the kernel, when the guard is left. The guard concept is heavily coupled with our idea of IRQ handling in epilogues (similar to bottom-halves or deferred interrupt handlers).

Our proposed implementation of the system call interface uses facade pattern to expose some of the system functionality as "Guarded Services".

class Guarded_Scheduler {
     static void resume() {
          Secure section; // does guard.enter() in constructor
          scheduler.resume();
          // guard.leave() is called on Secure destructor
     }
}

The used Secure class uses a Resource Acquisition Is Initialisation pattern to enter the guard on construction, and to leave it upon destruction of the secure object. But, as you see, coding down this pattern is cumbersome and involves a lot of boilerplate. Nobody, especially interested students, want to write boilerplate. But, our OS is implemented in C++, so we have powerful abstractions to implement a usable abstraction. In the following, I will explain, how we can implement an easily extensible system-call interface for a library operating system (everything is linked together, and we have no spatial isolation).

A First Attempt

First, we start with a "simple" templated function that can wrap every member function of an object and call with the guard taken. The actual API usage looks like this:

 syscall(&Scheduler::resume, scheduler);
 syscall(&Scheduler::kill,   scheduler, &other_thread);

The first argument to syscall() might surprise some readers, since it is a seldom used C++ feature. It is a "Pointer to Member" that captures how we can access or call a member when having the corresponding object at hand. The datatype of &Scheduler::resume is void (Scheduler::*)(), which is similar to a function pointer returning nothing and taking no arguments. &Scheduler::kill has the datatype void (Scheduler::*)(Thread *); it is a pointer to a member function, which returns nothing but takes an Thread pointer as argument. Both pointers only make sense with a Scheduler object at hand. When we have a scheduler object at hand, we can use the rarely used .* operator:

 ((scheduler).*(&Scheduler.kill))(thread)

We now can combine this concept with C++11 templates to get the described syscall function:

template<typename Func, typename Class, typename ...Args>
inline auto syscall(Func func,  Class &obj, Args&&... args) -> decltype((obj.*func)(args...)){
    Secure secure;
    return (obj.*func)(args...);
};

Huh, what happens here? Let's take this monster apart to understand its working. So, it is a function template, it generates functions depending on the types it is specialized for. You can think of this specialization process like this: the compiler has a Schablone (german word for template, but with the notion of scissors and paper) at hand. When it sees a function call to syscall() it fills the missing parts in the Schablone with the argument types and compiles the result a new function.

syscall(Func arg0, Class arg1, Args... args2_till_9001)

So, our syscall function takes at least two arguments, but can consume arbitrarily many arguments in its variadic part at the end (the Args...). The type of the first argument is bound to the type "Func", the second argument type is bound to the type "Class", all others are collected in the variadic type "Args". The func argument, which type Func, is pointer-to-member object, the obj argument the actual system object. So, now we can call the function with the other arguments.

 (obj.*func)(args...)

But, our function, still has no return type. What to do? Here comes C++ auto and decltype to the rescue. When using auto as a return type, the compiler excepts -> Type after the closing parenthesis of the function. The decltype() built-in gives the type of the enclosed expresion. So decltype((obj.*func)(args...)) is exactly the return type of the given pointer-to-member-function argument.

Furthermore, we just have to allocate a Secure object to make the guard.enter() and guard.leave() calls. Voila, a system call interface. But it still has some problems. We can call every method on every object in the whole system. We have no notion of "allowed" system calls and forbidden ones. Of course, in a library operating system with no protection this is ok. Furthermore, we always have to give the system object (e.g., scheduler) on each system call. I think, we can do better here. So let's revisit our implementation.

A second Attempt

In our second attempt, we want to restrict the system-call interface to certain classes. This gives coarse-grained control about the methods that can be called via the syscall interface. As a side-effect, we can omit the actual system-object argument such that we can write:

syscall(&Scheduler::resume)
syscall(&Scheduler::kill, that)

We implement a system_object function that returns the system-object singleton instance when called for a given type. We implement this function only for those classes, we want to allow access via syscall. This gives us some control about the possible syscall targets.

template<typename Class>
Class& system_object();

// Get the scheduler singleton
Scheduler &scheduler = system_object<Scheduler>();

The template specialization can be done in the source file and does not have to be put into the header. This allows us to hide the actual system-object symbol from the rest of the system. For example, this could be located in the thread/scheduler.cc file:

static Scheduler instance;

template<>
Scheduler &system_object() {
    return instance;
}

We still have to call this function from our system call implementation. For this, we need to have the class type of the underlying system object at hand. The only thing we have is the pointer-to-member object that identifies the desired system-call (&Scheduler::resume). But, as you remember, the class type is part of the type of such pointer-to-member types (Func). We only have to extract that information from the given type.

The concept of accessing information about types is called type traits. This is grandiloquent word for "a template that takes a type and provides several types and constants". So let's look at our type trait:

// Fail for all cases...
template<typename> struct syscall_traits;

// ..., except for deconstructing a pointer to member type
template<typename ReturnType, typename ClassType, typename ...Args>
struct syscall_traits<ReturnType(ClassType::*)(Args...)> {
    typedef ReturnType result_type;
    typedef ClassType class_type;
};

This syscall_traits is only specialized for pointer-to-member types and destructs the type of our &Scheduler::resume argument (void (Scheduler::*)()) with the pattern ReturnType (ClassType::*)(Args...). As you see, the templates does only pattern matching on types and binds types to template parameters. This can generally said for templates: The <>-line after the template keyword defines type variables, which can be bound later on or have to be supplied by the user. With our type trait we can simply access the instance class of our pointer-to-member argument and can call system_object():

template<typename Func, typename ...Args>
inline auto syscall(Func func, Args&&... args) -> typename syscall_traits<Func>::result_type {
      // We do everything with a taken guard
      Secure secure;

      // Get traits of systemcall
      typedef typename syscall_traits<Func>::class_type system_object_type;

     // Get a singleton instance for the given base type.
     system_object_type &obj = system_object<system_object_type>();

     return (obj.*func)(args...);
};

The first thing we see is that the deduced return type has changed. It no has to use our type trait, since we have no system object at hand we can use with decltype (-> decltype((obj.*func)(args...)). Within the body of the syscall function, we use the trait to extract the system-object's class type from the Func type and call system_object to gain access to the singleton instance.

If we use syscall on a class that is not exposed via specializing system_object<>, we get an linker error and the developer is informed that he wants to do bullshit.

So, what have we achieved in the second attempt? We have a cleaner system-call interface and do not have to supply the system object directly, but it is deduced from the supplied arguments. Furthermore, only annotated classes are suitable for being called via this interface. Nevertheless, we can still call all functions on these classes. In the third attempt we want to solve this as well.

The third and final Attempt

How can we annotate functions as being system calls? The only real thing we have at hand in static C++ land are types. So we have to annotate the function type of our system call somehow. The type of a method is defined by only a few pieces of information: The argument types, the class type, and the return type. The one thing that is always there, and that is not shared among several functions is the return type. We use the return type for our annotation by wrapping it into an marker struct:

template <typename T=void> struct syscall_return;
template<> struct syscall_return<void> { void get() {}; };

template <typename T>
class syscall_return {
    T value;
public:
    syscall_return(T&& x) : value(x) {}
    operator T() { return value; }
    T get() { return value; }
};

The syscall_return wraps a type and contains a copy of it. Furthermore, it implements a get() method to access this inner object and has the cast operator for T overloaded for easier handling. The void type is special here, and has to be handled special, since it is a no-object type and cannot be instantiated.

We can no annotate functions in our Scheduler class:

struct Scheduler {
    syscall_return<void> resume() {
        printf("resume %d\n", (int)barfoo(23));
        return syscall_return<>();
    }

    virtual syscall_return<int> increment(int i) {
        return i+1;
    }
}

As you see, we have to special case for void again ("Damn you void, you and your voidness!"). But, the implicit cast via the constructor makes it easy to return all other types. But we also have to adapt the rest of our implementation. In the syscall_traits template, the matched pattern strips the syscall_return wrapper from the type. This will also cause all unwrapped return types to fail.

// ..., except for deconstructing a pointer to member type
template<typename ReturnType, typename ClassType, typename ...Args>
struct syscall_traits<syscall_return<ReturnType>(ClassType::*)(Args...)> {
    typedef ReturnType result_type;
    typedef ClassType class_type;
};

In the syscall template, we only have to additionally call .get() on the result:

template<typename Func, typename ...Args>
inline auto syscall(Func func, Args&&... args) -> typename syscall_traits<Func>::result_type {
      // We do everything with a taken guard
      Secure secure;

      // Get traits of systemcall
      typedef typename syscall_traits<Func>::class_type system_object_type;

      // Get a singleton instance for the given base type.
      system_object_type &obj = system_object<system_object_type>();

      return (obj.*func)(args...).get();
};

And voila, we have a system call interface with annotations that prevents the user to call unmarked functions via syscall. All abstractions from above come at zero run-time cost.

The only downside is that the user is still able to call the functions directly. But, this can never be solved in a library operating system.

I hope I could give you an impression what is possible with C++ templates in the context of a bare-metal operating system.

Testing Three-Valued Vectors for Compatibility

2015-12-16 00:00:00 +0100

For a colleagues project, we encountered the problem to check vectors of values for compatibility. The values are either set or undefined. An undefined value is compatible to everything; a set value is compatible to the same value. An example instance of this problem might look like this, when the possible values are 'a', 'b', and 'c'; undefined is indicated by 'U':

	Vector 1	Vector 2
	a	a \|
	b	U \|
	U	U \|
	a	c \|

These two vectors are compatible in their first three lines, since undefined is compatible to everything. How can be test these vectors for compatibility in a fast fashion? The first simple idea is to use an char array and compare it character by character and to encode undefined as 0:

char A[] = {'a', 'b', 0, 'a'};
char B[] = {'a',  0,  0, 'c'};

bool compatible = true;
for (unsigned i = 0; i < 4; i++) {
    if (A[i] != 0 && B[i] != 0 && A[i] != B[i]) {
        compatible = false; break;
    }
}

This would do the job and is already quite fast. Nevertheless, we can do it faster. We can avoid checking three different conditions (&&) by using multiplication and the zero element property of the number zero:

if (A[i] * B[i] * (A[i] - B[i]) != 0) {
    compatible = false; break;
}

This expression is exactly then non-zero, if both elements are non-zero and their difference is non-zero, a trait that is also known as inequality. But, as we learned from our processor design lecture, multiplications are expensive. So let's search for a way to do the same without multiplication. Our advantage is, that we are not interested in the result of the multiplication, but only in its property of not-being-zero. Perhaps we can do something with bit shifts. First, we encode our three elements in a more dense way, using two bits at most:

char A[] = {1, 2, 0, 1};
char B[] = {1, 0, 0, 3};

After fiddling around at the whiteboard I can up with the following solution for 3 possible values plus the undefined vector, which works quite well for our use case:

if (((A[i] << 1) & B[i]) ^ ((B[i] << 1) & A[i]) != 0) {
    compatible = false; break;
}

This expression implements, although it is not easily visible, the required behavior. We can see this easily by looking at the truth table of the function f(a, b)=(((a << 1) & b) ^ ((b << 1) & a)) == 0

 | a | b | f(a, b)|         | a | b | f(a, b)|
 |:-:|:-:|:------:|         |:-:|:-:|:------:|
 | 0 | 0 | 1      |         | 2 | 0 | 1      |
 | 0 | 1 | 1      |         | 2 | 1 | 0      |
 | 0 | 2 | 1      |         | 2 | 2 | 1      |
 | 0 | 3 | 1      |         | 2 | 3 | 0      |
 | 1 | 0 | 1      |         | 3 | 0 | 1      |
 | 1 | 1 | 1      |         | 3 | 1 | 0      |
 | 1 | 2 | 0      |         | 3 | 2 | 0      |
 | 1 | 3 | 0      |         | 3 | 3 | 1      |

So, this is a very limited Boolean function, since it works only for 2 bit wide A/B's. But, it is fast. And the best part is that it consists only of bit operations. This means, we can put many vector values into a single machine word and compare many of them in one step. Unfortunately we have to insert padding bits between the value bits to have zeroes that can be shifted in and out. So when we encode our example from above, we get the following bit vectors:

         [0]  [1]  [2]  [3]  | int
      A  001  010  000  001  | 641
      B  001  000  000  011  | 515
 ---------------------------------
 f(A, B) 000  000  000  010  |   2

As you see, the difference occurs in the [3] columns, where our values are both set, but different. With this neat trick, we can put 21(!) values in a single 64 bit word and compare them all at once. With this optimization, I could improve the runtime of our problem from 5 minutes to 1 minute; just to give you a qualitative idea of the improvement for our (unspecified) problem. This stems as well from the more densed coding (transferring less memory) and the faster operations (bit operations are cheap).

On Conference: PLDI and LCTES

2015-06-15 00:00:00 +0200

Currently, I'm attending the FCRC Multi-Conference in Portland, Oregon. I want to write a few paragraphs about contributions I found especially interesting, and this post is more a journal for myself, than written for wider audience. But, perhaps this is interesting to others as well.

Panchekha et al. introduced Herbie, which is an heuristic optimizer for floating-point expressions to increase precision. In computing floating-point expression, the ordering and selectiong of instructions is essential for the precision of the calculation. Herbie takes an actual R^n->R function and emits a partially defined function with a minimized imprecision introduced by the selected operations.

Lopes et al presented Alive, which is an verifier for peephole optimizations in compilers. A peephole optimizations looks at the immediate representation or the machine code and replaces templates of code with faster templates of code. Alive does use the C3 theorem prover to prove the correctness of such optimizations in LLVM and found 8 bugs.

Furthermore, I learned about the existence of Vickery auctions, which is a form of auction, where the highest bid wins, but the winner does pay the price of the second highest bid. In constrast to a normal auction, this auction type does maximize the social welfare instead of the revenue. Social wellfare is defined in this setting as: the bidder with the highest need to get the item will win.

Kanev et al. presented a hardware profiling of whole datacenters. And the results are rather amazing. They profiles a bunch of Google servers for a few weeks and examined the results. It is surprising that about 30 percent of all instructions are spent in the "datacenter tax" (allocation, memmove, rpc, protobuf, hash, compression). This is really a huge number. Furthermore, they could show that pipeline stalls due to instruction cache misses contribute largely to the waiting time in those large datacenter applications. The i-cache working sets are often larger than the L2 cache; the instructions have to compete with data cache lines. Perhaps we will see computers with split L2 cache in the future.

In the DCC keynote, John Wilkes talked about cluster management at the Google datacenters. And their approach is fascinating. The basic assumption is: a machine that is not running has a speed of 0. Therefore, we optimize for availability and we assume failure to be the normal operation mode. In an EuroSys'15 paper, Verma et al talk about the Borg cluster management software, Google uses internally for its management.

During the LCTES conference, Bardizbanyan et al. presented a processor modification to adapt the memory-fetch stage so it takes the need of the current memory operation into account. Not all memory operations need all features the addressing mode provides. For example, mov 4(%eax), %ebx doesn't need an offset from a register with scaling (in contrast to niv 4(%eax, %ecx, 4)). Therefore, they propsed to gate these addressing features within the memory fetch stage and do speculative address calculation to improve energy consumption and latency of the stage.

Baird et al. presented a method to optimize programs for static-pipeline processors. A static pipeline is similar to a horizontal-micro instruction CPU. For a static-pipeline CPU, the compiler doesn't emit a stream of instructions, where each token is one instruction, but it splits the effects upon several commands. Each command describes what all stages of the pipeline should do in the current instruction cycle. Statically pipelined processors, are hard to program, but reveal a high energy efficiency. Baird proposed methods to optimize transfer-of-control instructions for these command-packets.

From Ghosh et al., I learned that processors that do dynamic binary translation (e.g., Transmeta Crusoe) can to speculative alias analysis. For this, the processor has some alias registers and every instructions is marked to either update of check a specific alias register. If two instructions then have an aliased pointer, the CPU faults, and the program is translated without the optimization that lead to that fault.

With Clover, Liu et al. presented an hybrid approach to mitigate soft-errors. As a hardware plattform, they used a processor with sonic micro-detectors that can detect the impact of a cosmic particle. In software, they implemented checkpointing for code-regions. Since the detector has a delay due to the physical limiation of a sonic detector, they proposed a compiler-based approach to execute the last N instructions of each code region twice in order to cover the worst-case detection delay. Although they claimed to be free of SDCs, they have strong assumptions, about their fault-model (fault occur on chip and the memory is ECC protected) and control-flow errors (there is a working software-based control-flow error detection).

dOSEK Version 1.1

2015-04-01 00:00:00 +0200

The DanceOS team is proud to announce the release of dOSEK version 1.1. With the latest release, we added support for OSEK events and an improved ARM support.

OSEK events are a synchronization primitive provided by the kernel. Events are system objects, which are declared in the OIL configuration file. Each event belongs to exactly one task in the system. Only the owning task can clear the event or wait for its arrival. Events can be signaled by any other task in the system. Additionally an alarm can be configured to send an signal to a specific task.

The ARM support was improved and dOSEK runs now on a real hardware platform; the ZedBoard. This architecture port supports all dependability features of dOSEK besides the memory protection.

The additional dependability features of dOSEK include: an concurrent checker for data objects, replication of OS state, and retry of encoded schedule operations.

The source code can be obtained from github. For more details on the changes, have a look at the Changelog.

Waiting in dOSEK

2015-02-24 00:00:00 +0100

In operating systems, waiting states are an essential feature to keep up the illusion that every thread is alone on the machine. In general-purpose operating systems, waiting states occur when data is read from the hard drive or when data is written from a network socket. A thread can also wait for the completion of work executed in another thread. Here, one thread waits for a signal the other thread provides. Waiting states are also part of real-time operating systems, like the OSEK standard. We've now implemented this feature, which is required for the OSEK ECC1 conformance class.

Scheduling in dOSEK

The core of dOSEK is the priority-driven scheduler. Since OSEK is a static operating-system standard, we know, for a specific system, exactly how many threads exists. This number will never change, it is configured at compile time. The scheduler selects always the thread with the highest priority that is runnable and executes it. In dOSEK, a thread is runnable, if its priority is larger than the priority of the idle thread. In pseudo code, the scheduler/dispatcher looks like this:

schedule() {
   current_thread = idle_id;
   current_prio   = idle_prio;

   updateMax((current_thread, current_prio),
             (thread_1_id, thread_1_prio));

   updateMax((current_thread, current_prio),
             (thread_2_id, thread_2_prio));

   updateMax((current_thread, current_prio),
             (thread_3_id, thread_3_prio));

   switch_to_thread(current_thread);
}

The scheduler is generated for the specific system (in this case, for a system with 3 threads), and contains a updateMax() cascade. updateMax() is a hardened operation, that updates the first input tuple with the second one, iff the priority of the second argument-tuple (second tuple, second item) is higher than the priority of the first tuple. In the first cascade element, current_thread is set to thread_1_id, if current_prio < thread_1_prio. In pseudo code:

updateMax((a, b), (c, d)) {
  if (b < d) {
    (a, b) = (c, d);
  }
}

Events in OSEK

In OSEK, events are the only possibility for a thread to wait on something. Each thread can receive a number of event signals. With the system call WaitEvent(), a thread can wait for one or more events to happen. If any of the events from the list got signaled by another thread with SetEvent, the waiting thread unblocks. Signals are not automatically cleared, but must be cleared explicitly by ClearEvent.

A version with branches can be implemented by two bit masks:

 struct Thread {
  ...
  event_mask_t events_waiting;
  event_mask_t events_set;
  ...
};

SetEvent(Thread t, event_mask_t m) {
   t.event_set |= m;
}
WaitEvent(Thread t, event_mask_t m) {
   t.event_waiting = m;
}
ClearEvent(Thread t, event_mask_t m) {
   // Remove the event mask bitwise
   t.event_waiting &= ~m;
   t.event_set     &= ~m;
}

Schedule() {
   ...
   if (thread_1.event_waiting != 0
       || (thread_1.event_waiting & thread_1.event_set) != 0) {
      updateMax((current_thread, current_prio),
                (thread_1_id, thread_1_prio));
   }

In this simple variant, we maintain a event_waiting mask containing a bit mask of events a the thread is waiting for. The event_set bit mask holds a bit mask of signaled events. If a thread is waiting, and none of the waited signals is set, we exclude the thread from the updateMax() cascade. It is blocked.

But there is one problem with dependability: we have branches. Branches are evil; making them resilient against soft-errors is hard. Therefore, we want to have a branchless version.

Events in dOSEK

Shortly explained, in the branchless version, we let the priority of a thread drop below the idle priority, if it currently blocks. Therefore, we calculate a blocking term for every thread that is either zero or the highest priority in the system. This blocking term is subtracted from the thread priority, when calling updateMax():

updateMax((current_thread, current_prio),
          (thread_1_id,    thread_1_prio - blocking_term));

For each event, a thread can receive, we have two integer variables W (for waiting) and S (for set). Both variables can have two values: either 0 or High (for highest priority in the system).

In this diagram, we see all four states a event can have. A event is a tuple of (W, S). The set() and clear() operations set override the tuple. If we want to wait for an event mask, we set the W flag accordingly for all events a thread can wait for:

struct Event {
   int W;
   int S;
};

Event thread_1_event_a;
Event thread_1_event_b;

...
WaitEvent(Thread t, event_mask_t m) {
   // t is always known at compile time, and this cascade is generated for the system.
   if (t == thread_1) {
      if (m & 1)
         thread_1_event_a.W = High;
      else
         thread_1_event_a.W = 0;

      if (m & 2)
         thread_1_event_b.W = High;
      else
         thread_1_event_b.W = 0;
   }
}

But how can we now deduce the blocking_term from the event states? First we calculate the blocking_term for a single event. We use a matrix notation that captures all four states from the diagram shown before.

By the blocking term, we generate a term for each event that is only 0, if the event was used for blocking and is set. In all other cases, the blocking term is High. We achieved this by using only bit wise XOR and OR operation. We're still branchless! :-)

We combine now all blocking terms of all events a specific task can wait for with AND. The result is only zero if at least single event, which is on the waiting list, is set. Furthermore, we determine whether we can block in the first place, by combining all W states with OR. The should_wait variable is either High, if we're waiting; or 0 if we're not waiting.

does_block  = blocking_term(thread_1_event_a) & blocking_term(thread_1_event_b);
should_wait = thread_1_event_a.W | thread_1_event_b.W;
blocking_term = should_wait & blocking_term;

Combining both variables with AND, we achieve our blocking term. Branchless. And we can subtract it, before we call the updateMax(), from the threads priority.