uKernel DevLog #2

Avg. 12 minute(s) of reading

I'm currently enrolled in a class about embedded and real-time systems. For this class's final project, I'm developing a real-time kernel for Arduino UNO. I'll try to document the development in a series of posts. This is part #2.

On the last chapter, we've talked about basic concepts of real-time systems. Now, we'll talk about implementing multiple stacks, one for each task.

Why multiple stacks?

There are two reasons for us to want to implement a system where each task has its own stack:

In the future, we'll implement mutexes which can lead to tasks being in a blocked state (priority inversion);
We want our task functions to follow the prototypes of POSIX thread functions: void* task(void* arg).

On this DevLog, we'll focus on the second one.

The new task functions

Originally, a task for toggling an LED could look something like this:

1// toggles an LED
2void runTask() {
3  digitalWrite(LEDPIN, !digitalRead(LEDPIN));
4}

This function would be called for each activation of the task. Alongside this function, we would also benefit from having a setup function for each task, that would be called whenever the task is created:

1// setup the LED pin as OUTPUT and turn the LED on
2void setupTask() {
3  pinMode(LEDPIN, OUTPUT);
4  digitalWrite(LEDPIN, HIGH);
5}

The plan is to join these methods (and the multiple calls to runTask) into a single method, that is called once:

1void taskFunc(void *arg) {
2  // set pin as output
3  pinMode(LEDPIN, OUTPUT);
4  digitalWrite(LEDPIN, HIGH); // turn on LED
5
6  while (true) {
7    // toggle LED
8    digitalWrite(LEDPIN, !digitalRead(LEDPIN));
9
10    Sched_Yield(); // finished execution for this activation
11  }
12}

In order for the function to only be called once, we wrap the contents that need to run on each activation into a while(true) loop.
Note: contrary to what POSIX threads, we return void from these functions, because they aren't meant to end.

Implementation

Now it's time for us to talk about how to implement this system. It should be noted that much of these ideas were taken from the documentation and code developed for FreeRTOS. The assembly code examples were also taken from their documentation (with some slight changes).

The stacks

As of now, each task has a task control block (TCB): a class containing the execution information of a task. The TCB of each task contains the stack of the given task. A task's stack is just an array of bytes (unsigned char). It is possible to calculate the maximum stack size needed for each task, but for now, we'll just assume it is always big enough for our needs.

1typedef unsigned char byte;
2
3class Task {
4  void (* taskFunc)(void *);   // task's code
5  const unsigned int period;  // task's period
6  unsigned int delay;         // time left until next activation
7  byte stack[BIG_ENOGH];      // task's stack
8
9  /* ... */
10};

Context switching

Whenever a task is blocked, preempted, or finishes the current activation, we need to suspend its execution. When we do this, we need to save its context, so we can restore it later.

Saving a task's context implies saving: its program counter (PC), the state of the CPU's general purpose registers (32 registers, from R0 to R31), the CPU's status register (SREG), the stack pointer (2 registers, SPH and SPL), and the contents of the stack.

Let's focus our discussion on the situation where a task is preempted, because of a timer interrupt. Interrupt handlers (ISR) are functions with some special properties:

Registers modified inside an ISR are restored to their original value (the value on ISR entry) on exit;
Interrupts are disabled on ISR entry;
ISRs end with a RETI (return from interrupt) instruction, instead of the usual RET (return) instruction. This instruction re-enables interrupts on ISR exit.

The fact that the registers modified inside the ISR are saved and restored automatically isn't enough for our use case, because during preemption other registers can be modified by new running task, without the compiler knowing. Still, this is interesting, because we can look at the code disassembly and learn how to save registers.

1; ---------------------------------------
2; CODE GENERATED BY THE COMPILER TO SAVE
3; THE REGISTERS THAT GET ALTERED BY THE
4; APPLICATION CODE DURING THE ISR.
5PUSH    R1
6PUSH    R0
7IN      R0,0x3F
8PUSH    R0
9CLR     R1
10PUSH    R18
11PUSH    R19
12PUSH    R20
13PUSH    R21
14PUSH    R22
15PUSH    R23
16PUSH    R24
17PUSH    R25
18PUSH    R26
19PUSH    R27
20PUSH    R30
21PUSH    R31
22; ---------------------------------------
23
24CALL 0x0000029B ; Call a subroutine
25
26; ---------------------------------------
27; CODE GENERATED BY THE COMPILER TO
28; RESTORE THE REGISTERS PREVIOUSLY
29; SAVED.
30POP     R31
31POP     R30
32POP     R27
33POP     R26
34POP     R25
35POP     R24
36POP     R23
37POP     R22
38POP     R21
39POP     R20
40POP     R19
41POP     R18
42POP     R0
43OUT     0x3F,R0
44POP     R0
45POP     R1
46RETI
47; ---------------------------------------

As we can see, the code to save and the code to restore the registers is very similar: we start by pushing the registers, and then pop them in the opposite order.

Note: The behavior described above is caused by GCC's signal attribute. In Arduino, we can declare an ISR for timer1 as ISR(TIMER1_COMPA_vect);. This automatically includes the signal attribute. In some other programs you might need to do something like this: void SIG_TIMER1_COMPA_vect(void) __atribute__ (( signal ));.

GCC's naked attribute

If we add code to save all registers, this would result in some registers being saved twice: once by the compiler generated code, and again by our code. To avoid this, we use GCC's naked attribute. This attribute tells the compiler to not generate any function entry (prologue) or exit (epilogue) code for the given function. When using the naked attribute, the compiler generates the following code:

1; ------------------------------------------------
2; NO COMPILER GENERATED CODE TO SAVE THE REGISTERS
3; ------------------------------------------------
4
5CALL 0x0000029B ; Call a subroutine
6
7; ------------------------------------------------
8; NO COMPILER GENERATED CODE TO RESTOR THE
9: REGISTERS OR RETURN FROM THE ISR
10; ------------------------------------------------

As we can see, now the compiler generates neither the code to save/restore the registers, nor the instruction to return from the ISR. With this, we can safely add our code to save and restore the context (don't forget the return instruction):

1ISR(TIMER1_COMPA_vect) __atribute__ (( naked ));
2
3ISR(TIMER1_COMPA_vect) {
4  // macro that explicitly saves the execution context
5  SAVE_CONTEXT();
6
7  handleTick(); // handle the timer tick
8
9  // macro that explicitly restores the execution context
10  RESTORE_CONTEXT();
11
12  // return from interrupt
13  asm volatile ( "reti" );
14}

Saving the context

The following macro is responsible for saving the execution context. Note that the variable currStack contains the memory address of the current stack's stack pointer. We can think of it as a pointer to the stack pointer, which in turn points to the top of the stack. In the case of Arduino UNO, pointers are 16-bit wide, 1 byte for each Stack Pointer register (SPH, SPL).

1
2#define SAVE_CONTEXT()                \
3asm volatile (                        \
4  "push  r0                     \n\t" \ (1)
5  "in    r0, __SREG__           \n\t" \ (2)
6  "cli                          \n\t" \ (3)
7  "push  r0                     \n\t" \ (4)
8  "push  r1                     \n\t" \ (5)
9  "clr   r1                     \n\t" \ (6)
10  "push  r2                     \n\t" \ (7)
11  "push  r3                     \n\t" \
12  "push  r4                     \n\t" \
13  "push  r5                     \n\t" \
14
15    :
16    :
17    :
18
19  "push  r30                    \n\t" \
20  "push  r31                    \n\t" \
21  "lds   r26, currStack         \n\t" \ (8)
22  "lds   r27, currStack + 1     \n\t" \ (9)
23  "in    r0, __SP_L__           \n\t" \ (10)
24  "st    x+, r0                 \n\t" \ (11)
25  "in    r0, __SP_H__           \n\t" \ (12)
26  "st    x+, r0                 \n\t" \ (13)
27)

We save R0 (__tmp_reg__) first (1), because it is used when saving the status register (SREG);
We move SREG to R0 (2), so we can push it to the stack later (4);
After saving SREG, we disable interrupts. It is important that we save interrupts as soon as possible. We do this after saving SREG, because that's where the information about interrupts being enabled/disabled is stored. Although we're just analyzing the case where a task is preempted, because of a timer interrupt, there are other cases where we need to use this macro (task blocking, sleeping, yielding), so we need to disable interrupts explicitly;
The compiler assumes that R1 (__zero_reg__) to be zero. We save (5) its value before clearing (6) it;
Between (7) and (8), we save all general purpose registers in numerical order (from R2 to R31);
In (8) and (9) we take the address of where we store the stack pointer and store it in registers R26 and R27 (x+ register). This is done, so we can take the low byte (10 and 11) and the high nibble (12 and 13) of the stack pointer (SP) and store it there.

Restoring the context

The process to restore the context is similar to the one to save it: we mainly just pop the registers we pushed while saving the context in the reverse order.

1#define RESTORE_CONTEXT()             \
2asm volatile (                        \
3  "lds  r26, currStack          \n\t" \ (1)
4  "lds  r27, currStack + 1      \n\t" \ (2)
5  "ld   r28, x+                 \n\t" \
6  "out  __SP_L__, r28           \n\t" \ (3)
7  "ld   r29, x+                 \n\t" \
8  "out  __SP_H__, r29           \n\t" \ (4)
9  "pop  r31                     \n\t" \
10  "pop  r30                     \n\t" \
11
12    :
13    :
14    :
15
16  "pop  r1                      \n\t" \
17  "pop  r0                      \n\t" \ (5)
18  "out  __SREG__, r0            \n\t" \ (6)
19  "pop  r0                      \n\t" \ (7)
20)

As mentioned above, currStack contains the address from where the stack pointer can be retrieved. It is loaded into the x+ register (1 and 2);
The stack pointer is loaded into registers R28 and R29, and then loaded to the stack pointer registers (SPH, SPL);
All the general purpose registers are popped from the stack in the reverse order that they were push in;
While saving the context, the status register (SREG) was pushed between R0 and R1. We restore it before restoring R0.

Program counter

When an interrupt occurs, the program counter (PC) is automatically placed on the stack. This is the same thing that happens when we call a function. This leads to the following stack state upon context saving:

1|      SPH       |
2|----------------|
3|      SPL       |
4|----------------|
5|    R1 - R31    |
6|----------------|
7|      SREG      |
8|----------------|
9|       R0       |
10|----------------|
11|       PC       | <-- Program Counter is pushed before the SAVE_CONTEXT macro
12|----------------|
13| Stack contents |
14|----------------|

After the RESTORE_CONTEXT macro, the PC is at the top of the stack. When the code reaches the next RET/RETI instruction, it will pop the PC from the stack and continue executing the pointed-to instruction.

Initializing the stack

With the ideas from the previous chapter, we know how to save and restore the context of a task on its own stack: we just need to change the value of currStack to the one corresponding to the task that should execute right now. There are only two questions now:

Choosing the size of a task's stack;
What should be the initial state of the task;

Choosing the size of a task's stack

To do this, we can analyze the task's code and calculate the stack size for a given task, or determine the size experimentally.

Analyzing the code can be interesting. In fact, we did that for our first task and came close to the real value. The problem is that it is a somewhat time consuming process and small changes in a task's code can become cumbersome, because of compiler optimizations. Still, if we don't care about finding the exact size, and fine with having just an upper-bound, we can naively add the size of all local variables. With this, we will focus on the experimental analysis.

Some people suggest analyzing some of GCC's output, namely the output produced by -fdump-tree-all, and -fstack-usage. We decided to use a more brute-force approach.

Canaries

We add a few bytes of a compile-time known value to the beginning and the end of a stack. These are useful for run-time assertions that the stack hasn't been corrupted: we didn't try to write more than the maximum stack size.

We use the canaries described above to find the minimum size needed for a stack. We use a binary search where we start with upper-bound described previously (sum all needed local variables' sizes) and start lowering the stack size iteratively. When leave the task running for a while, checking if the canaries remain intact.

With this, we eventually reach the minimum stack size (with reasonable confidence). This method probably doesn't very well for tasks that have many branching paths, because it is hard to test all paths.

Note: the canaries and the context saving operation add a little more spacial overhead to the task's stack. The stack size that the user passes to the Task constructor doesn't need to consider this (the constructor adds this size automatically). As of writing, the canaries add 6 bytes, and the context switching adds 34 bytes (32 general purpose registers + the status register + the program counter).

What should be the initial state of the task?

When we create a task, we also need to initialize its stack. The first time the task starts executing, it will be because of a RESTORE_CONTEXT. This means that the initial state of the stack should match the state that would happen if we preempted the task just as it was about to start executing: context saved when the next instruction to execute would be the first instruction of the task.

To do this, we just need to mimic the behavior of SAVE_CONTEXT. Bellow is an adaptation of the code we use to do this:

1void inline Task::push2stack(stack_t pushable) {
2  *this->stackAddr = pushable;
3  // the stack grows backwards in Arduino UNO
4  this->stackAddr--;
5}
6
7void Task::initializeStack() {
8  // this->stackAddr starts at the beginning of the stack
9
10  // save task code (PC) at the bottom of the stack
11  auto runAddr = (POINTER_SIZE_TYPE) this->run;
12  this->push2stack((stack_t) (runAddr & (POINTER_SIZE_TYPE) 0x00ff));
13  runAddr >>= 8;
14  this->push2stack((stack_t) (runAddr & (POINTER_SIZE_TYPE) 0x00ff));
15
16  this->push2stack((stack_t) 0x00); /* R0 */
17  this->push2stack((stack_t) 0x80); /* SREG */
18  this->push2stack((stack_t) 0x00); /* R1 - compiler expects it to be 0 */
19  // debug values
20  this->push2stack((stack_t) 0x02); /* R2 */
21  this->push2stack((stack_t) 0x03); /* R3 */
22  this->push2stack((stack_t) 0x04); /* R4 */
23  this->push2stack((stack_t) 0x05); /* R5 */
24  this->push2stack((stack_t) 0x06); /* R6 */
25  this->push2stack((stack_t) 0x07); /* R7 */
26  this->push2stack((stack_t) 0x08); /* R8 */
27  this->push2stack((stack_t) 0x09); /* R9 */
28  this->push2stack((stack_t) 0x10); /* R10 */
29  this->push2stack((stack_t) 0x11); /* R11 */
30  this->push2stack((stack_t) 0x12); /* R12 */
31  this->push2stack((stack_t) 0x13); /* R13 */
32  this->push2stack((stack_t) 0x14); /* R14 */
33  this->push2stack((stack_t) 0x15); /* R15 */
34  this->push2stack((stack_t) 0x16); /* R16 */
35  this->push2stack((stack_t) 0x17); /* R17 */
36  this->push2stack((stack_t) 0x18); /* R18 */
37  this->push2stack((stack_t) 0x19); /* R19 */
38  this->push2stack((stack_t) 0x20); /* R20 */
39  this->push2stack((stack_t) 0x21); /* R21 */
40  this->push2stack((stack_t) 0x22); /* R22 */
41  this->push2stack((stack_t) 0x23); /* R23 */
42
43  // place the parameter on the stack in the expected location
44  auto paramAddr = (POINTER_SIZE_TYPE) this->params;
45  this->push2stack((stack_t) (paramAddr & (POINTER_SIZE_TYPE) 0x00ff));
46  paramAddr >>= 8;
47  this->push2stack((stack_t) (paramAddr & (POINTER_SIZE_TYPE) 0x00ff));
48
49  // more debug values on R26-R31
50  this->push2stack((stack_t) 0x26); /* R26 */
51  this->push2stack((stack_t) 0x27); /* R27 */
52  this->push2stack((stack_t) 0x28); /* R28 */
53  this->push2stack((stack_t) 0x29); /* R29 */
54  this->push2stack((stack_t) 0x30); /* R30 */
55  this->push2stack((stack_t) 0x31); /* R31 */
56}

In Arduino UNO, the pointer size is 16-bit, so we define POINTER_SIZE_TYPE as uint16_t;
The type stack_t represents a byte using unsigned char;
I excluded the code that sets the canaries as it is optional;
Registers R24 and R25 contain the argument (void *) passed to the task;
Most registers don't have a meaningful value at the start of the task, so we fill them with debug values (values that are easily identifiable on a debugger);
In Arduino UNO, the stack grows backwards: from the end of the stack array (high addresses) to the beginning of the array (low addresses);
The PC is stored at the bottom of the stack and contains the address of the task's function: address of the first instruction.

Task yielding

When a task finishes an activation (executes all work for the current period), it needs to wait until the next activation to execute again. In the previous chapter, the tasks returned after each activation, thus signaling the kernel about the end of the activation. Since our task functions now contain a loop, we need a new way to signal the kernel to handle this. We saw 2 possible implementations:

The task would sleep for the duration until the beginning of the next period:
- This would be difficult to do and would require more code from the programmer's side, so we decided to avoid it.
The task calls a method (yield) to signal the kernel that it is done executing.

We went with the second option, because it is simpler to implement, and also to use by the programmer.

Given the ideas on previous chapters, yielding is fairly straight forward: the task calls a method that:

Saves its context;
Sets it as not ready for execution;
Selects the next task to run: new highest priority task;
Restores the context to the new task.

Idle task

There may be CPU idle periods during execution. This is problematic, because a task that wants to yield, would have no task to yield to. In order to deal with this edge-case without implementing more code, we decided to implement a special task (idle task). This task is set up, so it is always ready (never yields), and the lowest priority task of all the ready tasks.

With this, whenever we reach a CPU idle period, this is the task that runs. This model resembles a background execution server, that is something that we might talk about if we ever discuss aperiodic tasks.

Conclusion

In this chapter, we saw how we can save and restore the execution context of our tasks. This allows us to have a single function for each task, that is only called once (similar to POSIX threads). In future chapters, we will use this to implement blocking, and sleeping of tasks.

In the next chapter, we will look into dynamic priorities for tasks, and implement the Earliest Deadline First (EDF) algorithm.

Stay safe :P