<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title>Christian Dietrich's Research Blog</title>
		<description></description>
		<link>https://www4.cs.fau.de/~stettberger/blog</link>
		
			<item>
				<title>A templated System-Call Interface for OO/MPStuBS</title>
				<description>&lt;p&gt;We use OOStuBS/MPStuBS in our
&lt;a href=&quot;/Lehre/WS15/V_BS&quot;&gt;operating system course&lt;/a&gt;. In the first part of our
two part lecture, the students implement basic IRQ handling, coroutine
switching, and synchronisation primitives. We have no spatial or
privilege isolation in place, since this is topic of the
&lt;a href=&quot;/Lehre/SS15/V_BST&quot;&gt;second lecture part&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Still, we want to differentiate between a user space and the kernel
space. On a technical level, the kernel space or system level is
defined by a big kernel lock; the so called guard. If a control flow
enters the guard, it transitions to the kernel and leaves the kernel,
when the guard is left. The guard concept is heavily coupled with our
idea of IRQ handling in epilogues (similar to bottom-halves or
deferred interrupt handlers).&lt;/p&gt;

&lt;p&gt;Our proposed implementation of the system call interface uses
&lt;a href=&quot;https://en.wikipedia.org/wiki/Facade_pattern&quot;&gt;facade pattern&lt;/a&gt; to
expose some of the system functionality as &quot;Guarded Services&quot;.&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;class Guarded_Scheduler {
     static void resume() {
          Secure section; // does guard.enter() in constructor
          scheduler.resume();
          // guard.leave() is called on Secure destructor
     }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The used &lt;code&gt;Secure&lt;/code&gt; class uses a &lt;em&gt;R&lt;/em&gt;esource &lt;em&gt;A&lt;/em&gt;cquisition &lt;em&gt;I&lt;/em&gt;s
&lt;em&gt;I&lt;/em&gt;nitialisation pattern to enter the guard on construction, and to
leave it upon destruction of the &lt;code&gt;secure&lt;/code&gt; object. But, as you see,
coding down this pattern is cumbersome and involves a lot of
boilerplate. Nobody, especially interested students, want to write
boilerplate. But, our OS is implemented in C++, so we have powerful
abstractions to implement a usable abstraction. In the following, I
will explain, how we can implement an easily extensible system-call
interface for a library operating system (everything is linked
together, and we have no spatial isolation).&lt;/p&gt;

&lt;h1&gt;A First Attempt&lt;/h1&gt;

&lt;p&gt;First, we start with a &quot;simple&quot; templated function that can wrap every
member function of an object and call with the guard taken. The actual
API usage looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; syscall(&amp;amp;Scheduler::resume, scheduler);
 syscall(&amp;amp;Scheduler::kill,   scheduler, &amp;amp;other_thread);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first argument to &lt;code&gt;syscall()&lt;/code&gt; might surprise some readers, since
it is a seldom used C++ feature. It is a &quot;Pointer to Member&quot; that
captures how we can access or call a member when having the
corresponding object at hand. The datatype of &lt;code&gt;&amp;amp;Scheduler::resume&lt;/code&gt; is
&lt;code&gt;void (Scheduler::*)()&lt;/code&gt;, which is similar to a function pointer
returning nothing and taking no arguments. &lt;code&gt;&amp;amp;Scheduler::kill&lt;/code&gt; has the
datatype &lt;code&gt;void (Scheduler::*)(Thread *)&lt;/code&gt;; it is a pointer to a member
function, which returns nothing but takes an Thread pointer as
argument. Both pointers only make sense with a Scheduler object at
hand. When we have a scheduler object at hand, we can use the rarely
used &lt;code&gt;.*&lt;/code&gt; operator:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; ((scheduler).*(&amp;amp;Scheduler.kill))(thread)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We now can combine this concept with C++11 templates to get the
described syscall function:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;template&amp;lt;typename Func, typename Class, typename ...Args&amp;gt;
inline auto syscall(Func func,  Class &amp;amp;obj, Args&amp;amp;&amp;amp;... args) -&amp;gt; decltype((obj.*func)(args...)){
    Secure secure;
    return (obj.*func)(args...);
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Huh, what happens here? Let's take this monster apart to understand
its working. So, it is a &lt;em&gt;function template&lt;/em&gt;, it generates functions
depending on the types it is specialized for. You can think of this
specialization process like this: the compiler has a &lt;em&gt;Schablone&lt;/em&gt;
(german word for template, but with the notion of scissors and paper)
at hand. When it sees a function call to &lt;code&gt;syscall()&lt;/code&gt; it fills the
missing parts in the Schablone with the argument types and compiles
the result a new function.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;syscall(Func arg0, Class arg1, Args... args2_till_9001)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So, our syscall function takes at least two arguments, but can consume
arbitrarily many arguments in its variadic part at the end (the
Args...). The type of the first argument is bound to the type &quot;Func&quot;,
the second argument type is bound to the type &quot;Class&quot;, all others are
collected in the variadic type &quot;Args&quot;. The func argument, which type
Func, is pointer-to-member object, the obj argument the actual system
object. So, now we can call the function with the other arguments.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; (obj.*func)(args...)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But, our function, still has no return type. What to do? Here comes
C++ &lt;code&gt;auto&lt;/code&gt; and &lt;code&gt;decltype&lt;/code&gt; to the rescue. When using &lt;code&gt;auto&lt;/code&gt; as a return
type, the compiler excepts &lt;code&gt;-&amp;gt; Type&lt;/code&gt; after the closing parenthesis of
the function. The &lt;code&gt;decltype()&lt;/code&gt; built-in gives the type of the enclosed
expresion. So &lt;code&gt;decltype((obj.*func)(args...))&lt;/code&gt; is exactly the return
type of the given pointer-to-member-function argument.&lt;/p&gt;

&lt;p&gt;Furthermore, we just have to allocate a &lt;code&gt;Secure&lt;/code&gt; object to make the
&lt;code&gt;guard.enter()&lt;/code&gt; and &lt;code&gt;guard.leave()&lt;/code&gt; calls. Voila, a system call
interface. But it still has some problems. We can call every method on
every object in the whole system. We have no notion of &quot;allowed&quot;
system calls and forbidden ones. Of course, in a library operating
system with no protection this is ok. Furthermore, we always have to
give the system object (e.g., scheduler) on each system call. I think,
we can do better here. So let's revisit our implementation.&lt;/p&gt;

&lt;h1&gt;A second Attempt&lt;/h1&gt;

&lt;p&gt;In our second attempt, we want to restrict the system-call interface
to certain classes. This gives coarse-grained control about the
methods that can be called via the &lt;code&gt;syscall&lt;/code&gt; interface. As a
side-effect, we can omit the actual system-object argument such that
we can write:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;syscall(&amp;amp;Scheduler::resume)
syscall(&amp;amp;Scheduler::kill, that)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;We implement a &lt;code&gt;system_object&lt;/code&gt; function that returns the system-object
singleton instance when called for a given type. We implement this
function only for those classes, we want to allow access via
&lt;code&gt;syscall&lt;/code&gt;. This gives us some control about the possible syscall targets.&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;template&amp;lt;typename Class&amp;gt;
Class&amp;amp; system_object();

// Get the scheduler singleton
Scheduler &amp;amp;scheduler = system_object&amp;lt;Scheduler&amp;gt;();&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The template specialization can be done in the source file and does
not have to be put into the header. This allows us to hide the actual
system-object symbol from the rest of the system. For example, this
could be located in the &lt;code&gt;thread/scheduler.cc&lt;/code&gt; file:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;static Scheduler instance;

template&amp;lt;&amp;gt;
Scheduler &amp;amp;system_object() {
    return instance;
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;We still have to call this function from our system call
implementation. For this, we need to have the class type of the
underlying system object at hand. The only thing we have is the
pointer-to-member object that identifies the desired system-call
(&lt;code&gt;&amp;amp;Scheduler::resume&lt;/code&gt;). But, as you remember, the class type is part
of the type of such pointer-to-member types (&lt;code&gt;Func&lt;/code&gt;). We only have to
extract that information from the given type.&lt;/p&gt;

&lt;p&gt;The concept of accessing information about types is called &lt;code&gt;type
traits&lt;/code&gt;. This is grandiloquent word for &quot;a template that takes a type
and provides several types and constants&quot;. So let's look at our type
trait:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;// Fail for all cases...
template&amp;lt;typename&amp;gt; struct syscall_traits;

// ..., except for deconstructing a pointer to member type
template&amp;lt;typename ReturnType, typename ClassType, typename ...Args&amp;gt;
struct syscall_traits&amp;lt;ReturnType(ClassType::*)(Args...)&amp;gt; {
    typedef ReturnType result_type;
    typedef ClassType class_type;
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This &lt;code&gt;syscall_traits&lt;/code&gt; is only specialized for
pointer-to-member types and destructs the type of our
&lt;code&gt;&amp;amp;Scheduler::resume&lt;/code&gt; argument (&lt;code&gt;void (Scheduler::*)()&lt;/code&gt;) with the
pattern &lt;code&gt;ReturnType (ClassType::*)(Args...)&lt;/code&gt;. As you see, the
templates does only pattern matching on types and binds types to
template parameters. This can generally said for templates: The
&lt;code&gt;&amp;lt;&amp;gt;&lt;/code&gt;-line after the template keyword defines type variables, which can
be bound later on or have to be supplied by the user. With our type
trait we can simply access the instance class of our pointer-to-member
argument and can call &lt;code&gt;system_object()&lt;/code&gt;:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;template&amp;lt;typename Func, typename ...Args&amp;gt;
inline auto syscall(Func func, Args&amp;amp;&amp;amp;... args) -&amp;gt; typename syscall_traits&amp;lt;Func&amp;gt;::result_type {
      // We do everything with a taken guard
      Secure secure;

      // Get traits of systemcall
      typedef typename syscall_traits&amp;lt;Func&amp;gt;::class_type system_object_type;

     // Get a singleton instance for the given base type.
     system_object_type &amp;amp;obj = system_object&amp;lt;system_object_type&amp;gt;();

     return (obj.*func)(args...);
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The first thing we see is that the deduced return type has changed. It
no has to use our type trait, since we have no system object at hand
we can use with decltype (&lt;code&gt;-&amp;gt; decltype((obj.*func)(args...))&lt;/code&gt;. Within
the body of the syscall function, we use the trait to extract the
system-object's class type from the &lt;code&gt;Func&lt;/code&gt; type and call system_object
to gain access to the singleton instance.&lt;/p&gt;

&lt;p&gt;If we use syscall on a class that is not exposed via specializing
&lt;code&gt;system_object&amp;lt;&amp;gt;&lt;/code&gt;, we get an linker error and the developer is
informed that he wants to do bullshit.&lt;/p&gt;

&lt;p&gt;So, what have we achieved in the second attempt? We have a cleaner
system-call interface and do not have to supply the system object
directly, but it is deduced from the supplied arguments. Furthermore,
only annotated classes are suitable for being called via this
interface. Nevertheless, we can still call all functions on these
classes. In the third attempt we want to solve this as well.&lt;/p&gt;

&lt;h1&gt;The third and final Attempt&lt;/h1&gt;

&lt;p&gt;How can we annotate functions as being system calls? The only real
thing we have at hand in static C++ land are types. So we have to
annotate the function type of our system call somehow. The type of a
method is defined by only a few pieces of information: The argument
types, the class type, and the return type. The one thing that is
always there, and that is not shared among several functions is the
return type. We use the return type for our annotation by wrapping it
into an marker struct:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;template &amp;lt;typename T=void&amp;gt; struct syscall_return;
template&amp;lt;&amp;gt; struct syscall_return&amp;lt;void&amp;gt; { void get() {}; };

template &amp;lt;typename T&amp;gt;
class syscall_return {
    T value;
public:
    syscall_return(T&amp;amp;&amp;amp; x) : value(x) {}
    operator T() { return value; }
    T get() { return value; }
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;syscall_return&lt;/code&gt; wraps a type and contains a copy of
it. Furthermore, it implements a &lt;code&gt;get()&lt;/code&gt; method to access this inner
object and has the cast operator for T overloaded for easier
handling. The &lt;code&gt;void&lt;/code&gt; type is special here, and has to be handled
special, since it is a no-object type and cannot be instantiated.&lt;/p&gt;

&lt;p&gt;We can no annotate functions in our Scheduler class:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;struct Scheduler {
    syscall_return&amp;lt;void&amp;gt; resume() {
        printf(&amp;quot;resume %d\n&amp;quot;, (int)barfoo(23));
        return syscall_return&amp;lt;&amp;gt;();
    }

    virtual syscall_return&amp;lt;int&amp;gt; increment(int i) {
        return i+1;
    }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;As you see, we have to special case for void again (&quot;Damn you void,
you and your voidness!&quot;). But, the implicit cast via the constructor
makes it easy to return all other types. But we also have to adapt the
rest of our implementation. In the &lt;code&gt;syscall_traits&lt;/code&gt; template, the
matched pattern strips the &lt;code&gt;syscall_return&lt;/code&gt; wrapper from the
type. This will also cause all unwrapped return types to fail.&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;// ..., except for deconstructing a pointer to member type
template&amp;lt;typename ReturnType, typename ClassType, typename ...Args&amp;gt;
struct syscall_traits&amp;lt;syscall_return&amp;lt;ReturnType&amp;gt;(ClassType::*)(Args...)&amp;gt; {
    typedef ReturnType result_type;
    typedef ClassType class_type;
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;In the &lt;code&gt;syscall&lt;/code&gt; template, we only have to additionally call &lt;code&gt;.get()&lt;/code&gt;
on the result:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;template&amp;lt;typename Func, typename ...Args&amp;gt;
inline auto syscall(Func func, Args&amp;amp;&amp;amp;... args) -&amp;gt; typename syscall_traits&amp;lt;Func&amp;gt;::result_type {
      // We do everything with a taken guard
      Secure secure;

      // Get traits of systemcall
      typedef typename syscall_traits&amp;lt;Func&amp;gt;::class_type system_object_type;

      // Get a singleton instance for the given base type.
      system_object_type &amp;amp;obj = system_object&amp;lt;system_object_type&amp;gt;();

      return (obj.*func)(args...).get();
};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;And voila, we have a system call interface with annotations that
prevents the user to call unmarked functions via &lt;code&gt;syscall&lt;/code&gt;. All
abstractions from above come at zero run-time cost.&lt;/p&gt;

&lt;p&gt;The only downside is that the user is still able to call the functions
directly. But, this can never be solved in a library operating
system.&lt;/p&gt;

&lt;p&gt;I hope I could give you an impression what is possible with C++
templates in the context of a bare-metal operating system.&lt;/p&gt;
</description>
				<published>2015-12-23 00:00:00 +0100</published>
				<link>https://www4.cs.fau.de/~stettberger/blog/2015/Template-Syscall/</link>
			</item>
		
			<item>
				<title>Testing Three-Valued Vectors for Compatibility</title>
				<description>&lt;p&gt;For a colleagues project, we encountered the problem to check vectors
of values for compatibility.  The values are either set or
undefined. An undefined value is compatible to everything; a set value
is compatible to the same value. An example instance of this problem
might look like this, when the possible values are 'a', 'b', and 'c';
undefined is indicated by 'U':&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt; Vector 1 &lt;/th&gt;
&lt;th align=&quot;center&quot;&gt; Vector 2 &lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; a        &lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; a        |&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; b        &lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; U        |&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; U        &lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; U        |&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; a        &lt;/td&gt;
&lt;td align=&quot;center&quot;&gt; c        |&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;


&lt;p&gt;These two vectors are compatible in their first three lines, since
undefined is compatible to everything. How can be test these vectors
for compatibility in a fast fashion? The first simple idea is to use
an char array and compare it character by character and to encode
undefined as 0:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;char A[] = {'a', 'b', 0, 'a'};
char B[] = {'a',  0,  0, 'c'};

bool compatible = true;
for (unsigned i = 0; i &amp;lt; 4; i++) {
    if (A[i] != 0 &amp;amp;&amp;amp; B[i] != 0 &amp;amp;&amp;amp; A[i] != B[i]) {
        compatible = false; break;
    }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This would do the job and is already quite fast. Nevertheless, we can
do it faster. We can avoid checking three different conditions (&amp;amp;&amp;amp;) by
using multiplication and the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Semigroup#Identity_and_zero&quot;&gt;zero element&lt;/a&gt;
property of the number zero:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;if (A[i] * B[i] * (A[i] - B[i]) != 0) {
    compatible = false; break;
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This expression is exactly then non-zero, if both elements are
non-zero and their difference is non-zero, a trait that is also known
as inequality. But, as we learned from our processor design lecture,
multiplications are expensive. So let's search for a way to do the
same without multiplication. Our advantage is, that we are not
interested in the result of the multiplication, but only in its
property of not-being-zero. Perhaps we can do something with bit
shifts. First, we encode our three elements in a more dense way, using
two bits at most:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;char A[] = {1, 2, 0, 1};
char B[] = {1, 0, 0, 3};&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;After fiddling around at the whiteboard I can up with the following
solution for 3 possible values plus the undefined vector, which works
quite well for our use case:&lt;/p&gt;

&lt;div&gt;
  &lt;pre&gt;&lt;code class='c'&gt;if (((A[i] &amp;lt;&amp;lt; 1) &amp;amp; B[i]) ^ ((B[i] &amp;lt;&amp;lt; 1) &amp;amp; A[i]) != 0) {
    compatible = false; break;
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This expression implements, although it is not easily visible, the
required behavior. We can see this easily by looking at the truth
table of the function &lt;code&gt;f(a, b)=(((a &amp;lt;&amp;lt; 1) &amp;amp; b) ^ ((b &amp;lt;&amp;lt; 1) &amp;amp; a)) == 0&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; | a | b | f(a, b)|         | a | b | f(a, b)|
 |:-:|:-:|:------:|         |:-:|:-:|:------:|
 | 0 | 0 | 1      |         | 2 | 0 | 1      |
 | 0 | 1 | 1      |         | 2 | 1 | 0      |
 | 0 | 2 | 1      |         | 2 | 2 | 1      |
 | 0 | 3 | 1      |         | 2 | 3 | 0      |
 | 1 | 0 | 1      |         | 3 | 0 | 1      |
 | 1 | 1 | 1      |         | 3 | 1 | 0      |
 | 1 | 2 | 0      |         | 3 | 2 | 0      |
 | 1 | 3 | 0      |         | 3 | 3 | 1      |
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So, this is a very limited Boolean function, since it works only for 2
bit wide A/B's. But, it is fast. And the best part is that it consists
only of bit operations. This means, we can put many vector values into
a single machine word and compare many of them in one
step. Unfortunately we have to insert padding bits between the value
bits to have zeroes that can be shifted in and out. So when we encode
our example from above, we get the following bit vectors:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;         [0]  [1]  [2]  [3]  | int
      A  001  010  000  001  | 641
      B  001  000  000  011  | 515
 ---------------------------------
 f(A, B) 000  000  000  010  |   2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you see, the difference occurs in the &lt;code&gt;[3]&lt;/code&gt; columns, where our
values are both set, but different. With this neat trick, we can put
21(!) values in a single 64 bit word and compare them all at
once. With this optimization, I could improve the runtime of our
problem from 5 minutes to 1 minute; just to give you a qualitative
idea of the improvement for our (unspecified) problem. This stems as
well from the more densed coding (transferring less memory) and the
faster operations (bit operations are cheap).&lt;/p&gt;
</description>
				<published>2015-12-16 00:00:00 +0100</published>
				<link>https://www4.cs.fau.de/~stettberger/blog/2015/Vector-Compatibility/</link>
			</item>
		
			<item>
				<title>On Conference: PLDI and LCTES</title>
				<description>&lt;p&gt;Currently, I'm attending the &lt;a href=&quot;https://fcrc.acm.org/&quot;&gt;FCRC&lt;/a&gt;
Multi-Conference in Portland, Oregon. I want to write a few paragraphs
about contributions I found especially interesting, and this post is
more a journal for myself, than written for wider audience. But,
perhaps this is interesting to others as well.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://herbie.uwplse.org/pldi15-herbie.pdf&quot;&gt;Panchekha et al.&lt;/a&gt;
introduced Herbie, which is an heuristic optimizer for floating-point
expressions to increase precision. In computing floating-point
expression, the ordering and selectiong of instructions is essential
for the precision of the calculation. Herbie takes an actual R&lt;sup&gt;n-&gt;R&lt;/sup&gt;
function and emits a partially defined function with a minimized
imprecision introduced by the selected operations.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.cs.rutgers.edu/~santosh.nagarakatte/papers/alive-pldi15.pdf&quot;&gt;Lopes et
al&lt;/a&gt;
presented &lt;a href=&quot;http://blog.regehr.org/archives/1170&quot;&gt;Alive&lt;/a&gt;, which is an
verifier for peephole optimizations in compilers. A peephole
optimizations looks at the immediate representation or the machine
code and replaces templates of code with faster templates of
code. Alive does use the C3 theorem prover to prove the correctness of
such optimizations in LLVM and found 8 bugs.&lt;/p&gt;

&lt;p&gt;Furthermore, I learned about the existence of &lt;a href=&quot;https://en.wikipedia.org/wiki/Vickrey_auction&quot;&gt;Vickery
auctions&lt;/a&gt;, which is a
form of auction, where the highest bid wins, but the winner does pay
the price of the second highest bid. In constrast to a normal auction,
this auction type does maximize the social welfare instead of the
revenue. Social wellfare is defined in this setting as: the bidder
with the highest need to get the item will win.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.eecs.harvard.edu/~skanev/papers/isca15wsc.pdf&quot;&gt;Kanev et
al.&lt;/a&gt;
presented a hardware profiling of whole datacenters. And the results
are rather amazing. They profiles a bunch of Google servers for a few
weeks and examined the results. It is surprising that about 30 percent
of all instructions are spent in the &quot;datacenter tax&quot; (allocation,
memmove, rpc, protobuf, hash, compression). This is really a huge
number. Furthermore, they could show that pipeline stalls due to
instruction cache misses contribute largely to the waiting time in
those large datacenter applications. The i-cache working sets are
often larger than the L2 cache; the instructions have to compete with
data cache lines. Perhaps we will see computers with split L2 cache in
the future.&lt;/p&gt;

&lt;p&gt;In the DCC keynote, John Wilkes talked about cluster management at the
Google datacenters. And their approach is fascinating. The basic
assumption is: a machine that is not running has a speed
of 0. Therefore, we optimize for availability and we assume failure to
be the normal operation mode. In an EuroSys'15 paper, &lt;a href=&quot;http://www.e-wilkes.com/john/papers/2015-EuroSys-Borg.pdf&quot;&gt;Verma et
al&lt;/a&gt; talk
about the Borg cluster management software, Google uses internally for
its management.&lt;/p&gt;

&lt;p&gt;During the LCTES conference,
&lt;a href=&quot;http://www.sjalander.com/research/pdf/sjalander-lctes2015-ida.pdf&quot;&gt;Bardizbanyan et al.&lt;/a&gt;
presented a processor modification to adapt the memory-fetch stage so
it takes the need of the current memory operation into account. Not
all memory operations need all features the addressing mode
provides. For example, &lt;code&gt;mov 4(%eax), %ebx&lt;/code&gt; doesn't need an offset from
a register with scaling (in contrast to &lt;code&gt;niv 4(%eax, %ecx,
4)&lt;/code&gt;). Therefore, they propsed to gate these addressing features within
the memory fetch stage and do speculative address calculation to
improve energy consumption and latency of the stage.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.acm.org/citation.cfm?id=2754952&quot;&gt;Baird et al.&lt;/a&gt; presented a
method to optimize programs for static-pipeline processors. A
&lt;a href=&quot;http://www.cs.fsu.edu/~whalley/papers/cal11.pdf&quot;&gt;static pipeline&lt;/a&gt; is
similar to a horizontal-micro instruction CPU. For a static-pipeline
CPU, the compiler doesn't emit a stream of instructions, where each
token is one instruction, but it splits the effects upon several
commands. Each command describes what all stages of the pipeline
should do in the current instruction cycle. Statically pipelined
processors, are hard to program, but reveal a high energy
efficiency. Baird proposed methods to optimize transfer-of-control
instructions for these command-packets.&lt;/p&gt;

&lt;p&gt;From
&lt;a href=&quot;http://liberty.princeton.edu/Publications/lctes15_alias.pdf&quot;&gt;Ghosh et al.&lt;/a&gt;,
I learned that processors that do dynamic binary translation (e.g.,
Transmeta Crusoe) can to speculative alias analysis. For this, the
processor has some alias registers and every instructions is marked to
either update of check a specific alias register. If two instructions
then have an aliased pointer, the CPU faults, and the program is
translated without the optimization that lead to that fault.&lt;/p&gt;

&lt;p&gt;With Clover,
&lt;a href=&quot;http://www4.ncsu.edu/~dtiwari2/Papers/2015_LCTES_Compiler_Soft_Error.pdf&quot;&gt;Liu et al.&lt;/a&gt;
presented an hybrid approach to mitigate soft-errors. As a hardware
plattform, they used a processor with sonic micro-detectors that can
detect the impact of a cosmic particle. In software, they implemented
checkpointing for code-regions. Since the detector has a delay due to
the physical limiation of a sonic detector, they proposed a
compiler-based approach to execute the last N instructions of each
code region twice in order to cover the worst-case detection
delay. Although they claimed to be free of SDCs, they have strong
assumptions, about their fault-model (fault occur on chip and the
memory is ECC protected) and control-flow errors (there is a working
software-based control-flow error detection).&lt;/p&gt;
</description>
				<published>2015-06-15 00:00:00 +0200</published>
				<link>https://www4.cs.fau.de/~stettberger/blog/2015/On-Conference-PLDI-LCTES/</link>
			</item>
		
			<item>
				<title>dOSEK Version 1.1</title>
				<description>&lt;p&gt;The DanceOS team is proud to announce the release of dOSEK version
1.1. With the latest release, we added support for OSEK events and an
improved ARM support.&lt;/p&gt;

&lt;p&gt;OSEK events are a synchronization primitive provided by the
kernel. Events are system objects, which are declared in the OIL
configuration file. Each event belongs to exactly one task in the
system. Only the owning task can clear the event or wait for its
arrival. Events can be signaled by any other task in the
system. Additionally an alarm can be configured to send an signal to a
specific task.&lt;/p&gt;

&lt;p&gt;The ARM support was improved and dOSEK runs now on a real hardware
platform; the ZedBoard. This architecture port supports all
dependability features of dOSEK besides the memory protection.&lt;/p&gt;

&lt;p&gt;The additional dependability features of dOSEK include: an concurrent
checker for data objects, replication of OS state, and retry of
encoded schedule operations.&lt;/p&gt;

&lt;p&gt;The source code can be obtained from &lt;a href=&quot;https://www.github.com/danceos/dosek&quot;&gt;github&lt;/a&gt;. For more details on
the changes, have a look at the &lt;a href=&quot;https://github.com/danceos/dosek/blob/master/Changelog.md&quot;&gt;Changelog&lt;/a&gt;.&lt;/p&gt;
</description>
				<published>2015-04-01 00:00:00 +0200</published>
				<link>https://www4.cs.fau.de/~stettberger/blog/2015/dOSEK-Release-v1.1/</link>
			</item>
		
			<item>
				<title>Waiting in dOSEK</title>
				<description>&lt;p&gt;In operating systems, waiting states are an essential feature to keep
up the illusion that every thread is alone on the machine. In
general-purpose operating systems, waiting states occur when data is
read from the hard drive or when data is written from a network
socket. A thread can also wait for the completion of work executed in
another thread. Here, one thread waits for a signal the other thread
provides. Waiting states are also part of real-time operating systems,
like the OSEK standard. We've now implemented this feature, which is
required for the OSEK ECC1 conformance class.&lt;/p&gt;

&lt;h1&gt;Scheduling in dOSEK&lt;/h1&gt;

&lt;p&gt;The core of dOSEK is the priority-driven scheduler. Since OSEK is a
static operating-system standard, we know, for a specific system,
exactly how many threads exists. This number will never change, it is
configured at compile time. The scheduler selects always the thread
with the highest priority that is runnable and executes it. In dOSEK,
a thread is runnable, if its priority is larger than the priority of
the idle thread. In pseudo code, the scheduler/dispatcher looks like
this:&lt;/p&gt;

&lt;pre&gt;
schedule() {
   current_thread = idle_id;
   current_prio   = idle_prio;

   updateMax((current_thread, current_prio),
             (thread_1_id, thread_1_prio));

   updateMax((current_thread, current_prio),
             (thread_2_id, thread_2_prio));

   updateMax((current_thread, current_prio),
             (thread_3_id, thread_3_prio));

   switch_to_thread(current_thread);
}
&lt;/pre&gt;


&lt;p&gt;The scheduler is generated for the specific system (in this case, for a
system with 3 threads), and contains a &lt;code&gt;updateMax()&lt;/code&gt; cascade.
&lt;code&gt;updateMax()&lt;/code&gt; is a hardened operation, that updates the first input
tuple with the second one, iff the priority of the second argument-tuple
(second tuple, second item) is higher than the priority of the first
tuple. In the first cascade element, &lt;code&gt;current_thread&lt;/code&gt; is set to
&lt;code&gt;thread_1_id&lt;/code&gt;, if &lt;code&gt;current_prio&lt;/code&gt; &amp;lt; &lt;code&gt;thread_1_prio&lt;/code&gt;. In pseudo code:&lt;/p&gt;

&lt;pre&gt;
updateMax((a, b), (c, d)) {
  if (b &lt; d) {
    (a, b) = (c, d);
  }
}
&lt;/pre&gt;


&lt;h1&gt;Events in OSEK&lt;/h1&gt;

&lt;p&gt;In OSEK, events are the only possibility for a thread to wait on
something. Each thread can receive a number of event signals. With the
system call &lt;code&gt;WaitEvent()&lt;/code&gt;, a thread can wait for one or more events to
happen. If any of the events from the list got signaled by another
thread with &lt;code&gt;SetEvent&lt;/code&gt;, the waiting thread unblocks. Signals are not
automatically cleared, but must be cleared explicitly by &lt;code&gt;ClearEvent&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A version with branches can be implemented by two bit masks:&lt;/p&gt;

&lt;pre&gt;
 struct Thread {
  ...
  event_mask_t events_waiting;
  event_mask_t events_set;
  ...
};

SetEvent(Thread t, event_mask_t m) {
   t.event_set |= m;
}
WaitEvent(Thread t, event_mask_t m) {
   t.event_waiting = m;
}
ClearEvent(Thread t, event_mask_t m) {
   // Remove the event mask bitwise
   t.event_waiting &amp;= ~m;
   t.event_set     &amp;= ~m;
}

Schedule() {
   ...
   if (thread_1.event_waiting != 0
       || (thread_1.event_waiting &amp; thread_1.event_set) != 0) {
      updateMax((current_thread, current_prio),
                (thread_1_id, thread_1_prio));
   }
&lt;/pre&gt;


&lt;p&gt;In this simple variant, we maintain a &lt;code&gt;event_waiting&lt;/code&gt; mask containing
a bit mask of events a the thread is waiting for. The &lt;code&gt;event_set&lt;/code&gt;
bit mask holds a bit mask of signaled events. If a thread is waiting,
and none of the waited signals is set, we exclude the thread from the
&lt;code&gt;updateMax()&lt;/code&gt; cascade. It is blocked.&lt;/p&gt;

&lt;p&gt;But there is one problem with dependability: we have branches.
Branches are evil; making them resilient against soft-errors is hard.
Therefore, we want to have a branchless version.&lt;/p&gt;

&lt;h1&gt;Events in dOSEK&lt;/h1&gt;

&lt;p&gt;Shortly explained, in the branchless version, we let the priority of a
thread drop below the idle priority, if it currently blocks. Therefore,
we calculate a blocking term for every thread that is either zero or the
highest priority in the system. This blocking term is subtracted from
the thread priority, when calling &lt;code&gt;updateMax()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;
updateMax((current_thread, current_prio),
          (thread_1_id,    thread_1_prio - blocking_term));
&lt;/pre&gt;


&lt;p&gt;For each event, a thread can receive, we have two integer variables &lt;code&gt;W&lt;/code&gt;
(for waiting) and &lt;code&gt;S&lt;/code&gt; (for set). Both variables can have two values:
either 0 or &lt;code&gt;High&lt;/code&gt; (for highest priority in the system).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/~stettberger/blog/assets/posts/Waiting_in_dOSEK/state-diagram.png&quot;/&gt;&lt;/p&gt;

&lt;p&gt;In this diagram, we see all four states a event can have. A event is a
tuple of &lt;code&gt;(W, S)&lt;/code&gt;. The &lt;code&gt;set()&lt;/code&gt; and &lt;code&gt;clear()&lt;/code&gt; operations set override the
tuple. If we want to wait for an event mask, we set the &lt;code&gt;W&lt;/code&gt; flag
accordingly for all events a thread can wait for:&lt;/p&gt;

&lt;pre&gt;
struct Event {
   int W;
   int S;
};

Event thread_1_event_a;
Event thread_1_event_b;

...
WaitEvent(Thread t, event_mask_t m) {
   // t is always known at compile time, and this cascade is generated for the system.
   if (t == thread_1) {
      if (m &amp; 1)
         thread_1_event_a.W = High;
      else
         thread_1_event_a.W = 0;

      if (m &amp; 2)
         thread_1_event_b.W = High;
      else
         thread_1_event_b.W = 0;
   }
}
&lt;/pre&gt;


&lt;p&gt;But how can we now deduce the &lt;code&gt;blocking_term&lt;/code&gt; from the event states?
First we calculate the &lt;code&gt;blocking_term&lt;/code&gt; for a single event. We use a
matrix notation that captures all four states from the diagram shown
before.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/~stettberger/blog/assets/posts/Waiting_in_dOSEK/blocking_term.png&quot;/&gt;&lt;/p&gt;

&lt;p&gt;By the blocking term, we generate a term for each event that is
only 0, if the event was used for blocking and is set. In all other
cases, the blocking term is &lt;code&gt;High&lt;/code&gt;. We achieved this by using only
bit wise XOR and OR operation. We're still branchless! :-)&lt;/p&gt;

&lt;p&gt;We combine now all blocking terms of all events a specific task can wait
for with AND. The result is only zero if at least single event,
which is on the waiting list, is set. Furthermore, we determine whether
we can block in the first place, by combining all &lt;code&gt;W&lt;/code&gt; states with OR.
The &lt;code&gt;should_wait&lt;/code&gt; variable is either High, if we're waiting; or 0 if
we're not waiting.&lt;/p&gt;

&lt;pre&gt;
does_block  = blocking_term(thread_1_event_a) &amp; blocking_term(thread_1_event_b);
should_wait = thread_1_event_a.W | thread_1_event_b.W;
blocking_term = should_wait &amp; blocking_term;
&lt;/pre&gt;


&lt;p&gt;Combining both variables with AND, we achieve our blocking term.
Branchless. And we can subtract it, before we call the &lt;code&gt;updateMax()&lt;/code&gt;,
from the threads priority.&lt;/p&gt;
</description>
				<published>2015-02-24 00:00:00 +0100</published>
				<link>https://www4.cs.fau.de/~stettberger/blog/2015/Waiting_in_dOSEK/</link>
			</item>
		
	</channel>
</rss>
