uKernel DevLog #2

Avg. 13 minute(s) of reading

I'm currently enrolled in a class about embedded and real-time systems. For this class's final project, I'm developing a real-time kernel for Arduino UNO. I'll try to document the development in a series of posts. This is part #2.

On the last chapter, we've talked about basic concepts of real-time systems. Now, we'll talk about implementing multiple stacks, one for each task.

Why multiple stacks?

There are two reasons for us to want to implement a system where each task has its own stack:

  1. In the future, we'll implement mutexes which can lead to tasks being in a blocked state (priority inversion);
  2. We want our task functions to follow the prototypes of POSIX thread functions: void* task(void* arg).

On this DevLog, we'll focus on the second one.

The new task functions

Originally, a task for toggling an LED could look something like this:

// toggles an LED
void runTask() {
  digitalWrite(LEDPIN, !digitalRead(LEDPIN));
}

This function would be called for each activation of the task. Alongside this function, we would also benefit from having a setup function for each task, that would be called whenever the task is created:

// setup the LED pin as OUTPUT and turn the LED on
void setupTask() {
  pinMode(LEDPIN, OUTPUT);
  digitalWrite(LEDPIN, HIGH);
}

The plan is to join these methods (and the multiple calls to runTask) into a single method, that is called once:

void taskFunc(void *arg) {
  // set pin as output
  pinMode(LEDPIN, OUTPUT);
  digitalWrite(LEDPIN, HIGH); // turn on LED

  while (true) {
    // toggle LED
    digitalWrite(LEDPIN, !digitalRead(LEDPIN));

    Sched_Yield(); // finished execution for this activation
  }
}

In order for the function to only be called once, we wrap the contents that need to run on each activation into a while(true) loop.
Note: contrary to what POSIX threads, we return void from these functions, because they aren't meant to end.

Implementation

Now it's time for us to talk about how to implement this system. It should be noted that much of these ideas were taken from the documentation and code developed for FreeRTOS. The assembly code examples were also taken from their documentation (with some slight changes).

The stacks

As of now, each task has a task control block (TCB): a class containing the execution information of a task. The TCB of each task contains the stack of the given task. A task's stack is just an array of bytes (unsigned char). It is possible to calculate the maximum stack size needed for each task, but for now, we'll just assume it is always big enough for our needs.

typedef unsigned char byte;

class Task {
  void (* taskFunc)(void *);   // task's code
  const unsigned int period;  // task's period
  unsigned int delay;         // time left until next activation
  byte stack[BIG_ENOGH];      // task's stack

  /* ... */
};

Context switching

Whenever a task is blocked, preempted, or finishes the current activation, we need to suspend its execution. When we do this, we need to save its context, so we can restore it later.

Saving a task's context implies saving: its program counter (PC), the state of the CPU's general purpose registers (32 registers, from R0 to R31), the CPU's status register (SREG), the stack pointer (2 registers, SPH and SPL), and the contents of the stack.

Let's focus our discussion on the situation where a task is preempted, because of a timer interrupt. Interrupt handlers (ISR) are functions with some special properties:

  • Registers modified inside an ISR are restored to their original value (the value on ISR entry) on exit;
  • Interrupts are disabled on ISR entry;
  • ISRs end with a RETI (return from interrupt) instruction, instead of the usual RET (return) instruction. This instruction re-enables interrupts on ISR exit.

The fact that the registers modified inside the ISR are saved and restored automatically isn't enough for our use case, because during preemption other registers can be modified by new running task, without the compiler knowing. Still, this is interesting, because we can look at the code disassembly and learn how to save registers.

; ---------------------------------------
; CODE GENERATED BY THE COMPILER TO SAVE
; THE REGISTERS THAT GET ALTERED BY THE
; APPLICATION CODE DURING THE ISR.
PUSH    R1
PUSH    R0
IN      R0,0x3F
PUSH    R0
CLR     R1
PUSH    R18
PUSH    R19
PUSH    R20
PUSH    R21
PUSH    R22
PUSH    R23
PUSH    R24
PUSH    R25
PUSH    R26
PUSH    R27
PUSH    R30
PUSH    R31
; ---------------------------------------

CALL 0x0000029B ; Call a subroutine

; ---------------------------------------
; CODE GENERATED BY THE COMPILER TO
; RESTORE THE REGISTERS PREVIOUSLY
; SAVED.
POP     R31
POP     R30
POP     R27
POP     R26
POP     R25
POP     R24
POP     R23
POP     R22
POP     R21
POP     R20
POP     R19
POP     R18
POP     R0
OUT     0x3F,R0
POP     R0
POP     R1
RETI
; ---------------------------------------

As we can see, the code to save and the code to restore the registers is very similar: we start by pushing the registers, and then pop them in the opposite order.

Note: The behavior described above is caused by GCC's signal attribute. In Arduino, we can declare an ISR for timer1 as ISR(TIMER1_COMPA_vect);. This automatically includes the signal attribute. In some other programs you might need to do something like this: void SIG_TIMER1_COMPA_vect(void) __atribute__ (( signal ));.

GCC's naked attribute

If we add code to save all registers, this would result in some registers being saved twice: once by the compiler generated code, and again by our code. To avoid this, we use GCC's naked attribute. This attribute tells the compiler to not generate any function entry (prologue) or exit (epilogue) code for the given function. When using the naked attribute, the compiler generates the following code:

; ------------------------------------------------
; NO COMPILER GENERATED CODE TO SAVE THE REGISTERS
; ------------------------------------------------

CALL 0x0000029B ; Call a subroutine

; ------------------------------------------------
; NO COMPILER GENERATED CODE TO RESTOR THE
: REGISTERS OR RETURN FROM THE ISR
; ------------------------------------------------

As we can see, now the compiler generates neither the code to save/restore the registers, nor the instruction to return from the ISR. With this, we can safely add our code to save and restore the context (don't forget the return instruction):

ISR(TIMER1_COMPA_vect) __atribute__ (( naked ));

ISR(TIMER1_COMPA_vect) {
  // macro that explicitly saves the execution context
  SAVE_CONTEXT();

  handleTick(); // handle the timer tick

  // macro that explicitly restores the execution context
  RESTORE_CONTEXT();

  // return from interrupt
  asm volatile ( "reti" );
}

Saving the context

The following macro is responsible for saving the execution context. Note that the variable currStack contains the memory address of the current stack's stack pointer. We can think of it as a pointer to the stack pointer, which in turn points to the top of the stack. In the case of Arduino UNO, pointers are 16-bit wide, 1 byte for each Stack Pointer register (SPH, SPL).


#define SAVE_CONTEXT()                \
asm volatile (                        \
  "push  r0                     \n\t" \ (1)
  "in    r0, __SREG__           \n\t" \ (2)
  "cli                          \n\t" \ (3)
  "push  r0                     \n\t" \ (4)
  "push  r1                     \n\t" \ (5)
  "clr   r1                     \n\t" \ (6)
  "push  r2                     \n\t" \ (7)
  "push  r3                     \n\t" \
  "push  r4                     \n\t" \
  "push  r5                     \n\t" \

    :
    :
    :

  "push  r30                    \n\t" \
  "push  r31                    \n\t" \
  "lds   r26, currStack         \n\t" \ (8)
  "lds   r27, currStack + 1     \n\t" \ (9)
  "in    r0, __SP_L__           \n\t" \ (10)
  "st    x+, r0                 \n\t" \ (11)
  "in    r0, __SP_H__           \n\t" \ (12)
  "st    x+, r0                 \n\t" \ (13)
)
  • We save R0 (__tmp_reg__) first (1), because it is used when saving the status register (SREG);
  • We move SREG to R0 (2), so we can push it to the stack later (4);
  • After saving SREG, we disable interrupts. It is important that we save interrupts as soon as possible. We do this after saving SREG, because that's where the information about interrupts being enabled/disabled is stored. Although we're just analyzing the case where a task is preempted, because of a timer interrupt, there are other cases where we need to use this macro (task blocking, sleeping, yielding), so we need to disable interrupts explicitly;
  • The compiler assumes that R1 (__zero_reg__) to be zero. We save (5) its value before clearing (6) it;
  • Between (7) and (8), we save all general purpose registers in numerical order (from R2 to R31);
  • In (8) and (9) we take the address of where we store the stack pointer and store it in registers R26 and R27 (x+ register). This is done, so we can take the low byte (10 and 11) and the high nibble (12 and 13) of the stack pointer (SP) and store it there.

Restoring the context

The process to restore the context is similar to the one to save it: we mainly just pop the registers we pushed while saving the context in the reverse order.

#define RESTORE_CONTEXT()             \
asm volatile (                        \
  "lds  r26, currStack          \n\t" \ (1)
  "lds  r27, currStack + 1      \n\t" \ (2)
  "ld   r28, x+                 \n\t" \
  "out  __SP_L__, r28           \n\t" \ (3)
  "ld   r29, x+                 \n\t" \
  "out  __SP_H__, r29           \n\t" \ (4)
  "pop  r31                     \n\t" \
  "pop  r30                     \n\t" \

    :
    :
    :

  "pop  r1                      \n\t" \
  "pop  r0                      \n\t" \ (5)
  "out  __SREG__, r0            \n\t" \ (6)
  "pop  r0                      \n\t" \ (7)
)
  • As mentioned above, currStack contains the address from where the stack pointer can be retrieved. It is loaded into the x+ register (1 and 2);
  • The stack pointer is loaded into registers R28 and R29, and then loaded to the stack pointer registers (SPH, SPL);
  • All the general purpose registers are popped from the stack in the reverse order that they were push in;
  • While saving the context, the status register (SREG) was pushed between R0 and R1. We restore it before restoring R0.

Program counter

When an interrupt occurs, the program counter (PC) is automatically placed on the stack. This is the same thing that happens when we call a function. This leads to the following stack state upon context saving:

|      SPH       |
|----------------|
|      SPL       |
|----------------|
|    R1 - R31    |
|----------------|
|      SREG      |
|----------------|
|       R0       |
|----------------|
|       PC       | <-- Program Counter is pushed before the SAVE_CONTEXT macro
|----------------|
| Stack contents |
|----------------|

After the RESTORE_CONTEXT macro, the PC is at the top of the stack. When the code reaches the next RET/RETI instruction, it will pop the PC from the stack and continue executing the pointed-to instruction.

Initializing the stack

With the ideas from the previous chapter, we know how to save and restore the context of a task on its own stack: we just need to change the value of currStack to the one corresponding to the task that should execute right now. There are only two questions now:

  • Choosing the size of a task's stack;
  • What should be the initial state of the task;

Choosing the size of a task's stack

To do this, we can analyze the task's code and calculate the stack size for a given task, or determine the size experimentally.

Analyzing the code can be interesting. In fact, we did that for our first task and came close to the real value. The problem is that it is a somewhat time consuming process and small changes in a task's code can become cumbersome, because of compiler optimizations. Still, if we don't care about finding the exact size, and fine with having just an upper-bound, we can naively add the size of all local variables. With this, we will focus on the experimental analysis.

Some people suggest analyzing some of GCC's output, namely the output produced by -fdump-tree-all, and -fstack-usage. We decided to use a more brute-force approach.

Canaries

We add a few bytes of a compile-time known value to the beginning and the end of a stack. These are useful for run-time assertions that the stack hasn't been corrupted: we didn't try to write more than the maximum stack size.


We use the canaries described above to find the minimum size needed for a stack. We use a binary search where we start with upper-bound described previously (sum all needed local variables' sizes) and start lowering the stack size iteratively. When leave the task running for a while, checking if the canaries remain intact.

With this, we eventually reach the minimum stack size (with reasonable confidence). This method probably doesn't very well for tasks that have many branching paths, because it is hard to test all paths.

Note: the canaries and the context saving operation add a little more spacial overhead to the task's stack. The stack size that the user passes to the Task constructor doesn't need to consider this (the constructor adds this size automatically). As of writing, the canaries add 6 bytes, and the context switching adds 34 bytes (32 general purpose registers + the status register + the program counter).

What should be the initial state of the task?

When we create a task, we also need to initialize its stack. The first time the task starts executing, it will be because of a RESTORE_CONTEXT. This means that the initial state of the stack should match the state that would happen if we preempted the task just as it was about to start executing: context saved when the next instruction to execute would be the first instruction of the task.

To do this, we just need to mimic the behavior of SAVE_CONTEXT. Bellow is an adaptation of the code we use to do this:

void inline Task::push2stack(stack_t pushable) {
  *this->stackAddr = pushable;
  // the stack grows backwards in Arduino UNO
  this->stackAddr--;
}

void Task::initializeStack() {
  // this->stackAddr starts at the beginning of the stack

  // save task code (PC) at the bottom of the stack
  auto runAddr = (POINTER_SIZE_TYPE) this->run;
  this->push2stack((stack_t) (runAddr & (POINTER_SIZE_TYPE) 0x00ff));
  runAddr >>= 8;
  this->push2stack((stack_t) (runAddr & (POINTER_SIZE_TYPE) 0x00ff));

  this->push2stack((stack_t) 0x00); /* R0 */
  this->push2stack((stack_t) 0x80); /* SREG */
  this->push2stack((stack_t) 0x00); /* R1 - compiler expects it to be 0 */
  // debug values
  this->push2stack((stack_t) 0x02); /* R2 */
  this->push2stack((stack_t) 0x03); /* R3 */
  this->push2stack((stack_t) 0x04); /* R4 */
  this->push2stack((stack_t) 0x05); /* R5 */
  this->push2stack((stack_t) 0x06); /* R6 */
  this->push2stack((stack_t) 0x07); /* R7 */
  this->push2stack((stack_t) 0x08); /* R8 */
  this->push2stack((stack_t) 0x09); /* R9 */
  this->push2stack((stack_t) 0x10); /* R10 */
  this->push2stack((stack_t) 0x11); /* R11 */
  this->push2stack((stack_t) 0x12); /* R12 */
  this->push2stack((stack_t) 0x13); /* R13 */
  this->push2stack((stack_t) 0x14); /* R14 */
  this->push2stack((stack_t) 0x15); /* R15 */
  this->push2stack((stack_t) 0x16); /* R16 */
  this->push2stack((stack_t) 0x17); /* R17 */
  this->push2stack((stack_t) 0x18); /* R18 */
  this->push2stack((stack_t) 0x19); /* R19 */
  this->push2stack((stack_t) 0x20); /* R20 */
  this->push2stack((stack_t) 0x21); /* R21 */
  this->push2stack((stack_t) 0x22); /* R22 */
  this->push2stack((stack_t) 0x23); /* R23 */

  // place the parameter on the stack in the expected location
  auto paramAddr = (POINTER_SIZE_TYPE) this->params;
  this->push2stack((stack_t) (paramAddr & (POINTER_SIZE_TYPE) 0x00ff));
  paramAddr >>= 8;
  this->push2stack((stack_t) (paramAddr & (POINTER_SIZE_TYPE) 0x00ff));

  // more debug values on R26-R31
  this->push2stack((stack_t) 0x26); /* R26 */
  this->push2stack((stack_t) 0x27); /* R27 */
  this->push2stack((stack_t) 0x28); /* R28 */
  this->push2stack((stack_t) 0x29); /* R29 */
  this->push2stack((stack_t) 0x30); /* R30 */
  this->push2stack((stack_t) 0x31); /* R31 */
}
  • In Arduino UNO, the pointer size is 16-bit, so we define POINTER_SIZE_TYPE as uint16_t;
  • The type stack_t represents a byte using unsigned char;
  • I excluded the code that sets the canaries as it is optional;
  • Registers R24 and R25 contain the argument (void *) passed to the task;
  • Most registers don't have a meaningful value at the start of the task, so we fill them with debug values (values that are easily identifiable on a debugger);
  • In Arduino UNO, the stack grows backwards: from the end of the stack array (high addresses) to the beginning of the array (low addresses);
  • The PC is stored at the bottom of the stack and contains the address of the task's function: address of the first instruction.

Task yielding

When a task finishes an activation (executes all work for the current period), it needs to wait until the next activation to execute again. In the previous chapter, the tasks returned after each activation, thus signaling the kernel about the end of the activation. Since our task functions now contain a loop, we need a new way to signal the kernel to handle this. We saw 2 possible implementations:

  • The task would sleep for the duration until the beginning of the next period:
    • This would be difficult to do and would require more code from the programmer's side, so we decided to avoid it.
  • The task calls a method (yield) to signal the kernel that it is done executing.

We went with the second option, because it is simpler to implement, and also to use by the programmer.

Given the ideas on previous chapters, yielding is fairly straight forward: the task calls a method that:

  • Saves its context;
  • Sets it as not ready for execution;
  • Selects the next task to run: new highest priority task;
  • Restores the context to the new task.

Idle task

There may be CPU idle periods during execution. This is problematic, because a task that wants to yield, would have no task to yield to. In order to deal with this edge-case without implementing more code, we decided to implement a special task (idle task). This task is set up, so it is always ready (never yields), and the lowest priority task of all the ready tasks.

With this, whenever we reach a CPU idle period, this is the task that runs. This model resembles a background execution server, that is something that we might talk about if we ever discuss aperiodic tasks.

Conclusion

In this chapter, we saw how we can save and restore the execution context of our tasks. This allows us to have a single function for each task, that is only called once (similar to POSIX threads). In future chapters, we will use this to implement blocking, and sleeping of tasks.

In the next chapter, we will look into dynamic priorities for tasks, and implement the Earliest Deadline First (EDF) algorithm.

Stay safe :P