Clicker Approach to Multithreading

Posted by Pype @ 09.12.01
Still under cooking ...
Related cakes :hot-swapping scheduler, process manager


Introduction

A modern operating system must be able to run several programs in parallel (unlike the antique MS-DOS :), and provide protection between them so that a malicious (or buggy) program can't do harm to other programs. This is what we call multiprocess in OS terms.
Each process is an execution environment of a running program (the program itself being some executable code - a static object). It's mainly made of an address space defining the memory areas available to the program and some communication with other running programs like files, etc.

Multithreading means that a single process can perform many tasks simultaneously, which becomes very handy when you have lot of conceptuals independent jobs to do (like processing connections from several clients if you're a web server program) that each might lead to temporary block of the job (i.e. waiting for a response from the current client). All those task will run the same program, and will share some datas (like the resources to be available by the Web server) but they also have a separate state each - thus a private copy of CPU registers, a private program counter and a private stack.

In Clicker32 architecture, each thread is made of 4 components:
  1. the task-state segment which is a CPU-specific structure that will hold the private copy of the registers. When a so-called task switch occurs, the CPU saves the current content of registers into the outgoing task-state segment (also called TSS) and then load the registers with the values held by the incoming TSS. This is a fast way to switch between two activities without needing any help from the programming language ...
  2. the thread descriptor(the kThread structure) is the place to store every informations needed by the scheduler to manage task switches. It provides the thread status (running, killed, sleeping, etc.), allow us to put the thread into some wait queue, etc. Most kernel function will manipulate the thread descriptor rather than the task-state segment, which makes the kernel more portable.
  3. The private stack of the thread, holding the function calls trace, but also all local variables. There are two types of stack in Clicker: the user-level stack (used for normal processing) and the system stack (used for system calls and interrupts processing).
  4. Private Glocal variables (that could be called thread-local variables ). Glocals variables are something intermediate between global vars (which are stored in the .data section of the program) and the locals variable (which are bound to the function call instance, due to their position on the stack). These are variables put on the thread's private stack (so each thread can have its own copy of it ;) but still available with a global scope because their position on the stack is absolute rather than being relative to the top-of-stack pointer.
We made the design choice to give every thread the ability to enter kernel mode in response to events like hardware interrupts, exceptions or system calls. This leads to reduced overhead compared to the approach of having a single system thread, but it also means that we need some access control within the microkernel (protecting shared data structures by a semaphore or more complex policer like readers/writer ).
Note that we can have as much threads in kernel mode as wanted. They do not have to be synchronized at the kernel-mode entry and the microkernel is fully preemptive (except when processing an interrupt or when IRQs are disabled by software cli).

Multithreading is the first parallel processing in Clicker, available since evolution 0.6.*, evolution 0.7.* is currently adding the multiprocess support (90% completed) and further evolution 0.8.* will add inter-process communication facilities like system-wide semaphores or mailboxes.


Structure of a thread

This section tries to clarify what a thread looks like within the system and how they're built. The kThread structure - and every other thread infos - is stored in the microkernel memory area, so that every thread can be accessed in every process context.

What is kept in kThread structure

Most of the members of the kThread structure are pointers to other parts of the thread/process context. Main pointers are shown on the next figure. The whole content of the kThread structure can be viewed in the Doxygen documentation page. We will focus here only on "interresting" fields that are not redundant and that doesn't appear on the figure.

what a thread is made of.

quantum
int
amount of clock ticks still given to the thread. Once that count reaches 0, the thread loses the CPU.
stat
enum kThreadState
tells in what state the thread is. By reading this field, we can know whether the thread is waiting for cpu in a dispatcher's queue (the ready state), running on a cpu (the running state) or waiting in a synchronisation structure's queue (the sleeping state, mainly used when waiting for a semaphore). It also has other bits to command thread deletion, etc.
shield
int
In some situation, we don't want a thread to be killable - for instance when we're doing manipulations on a system structure. In order to protect the thread from being, we can use the "shield" mechansim that use this field to know if the thread accepts to die or not. If we attempt to kill a thread when its shield value is non-zero, then the kill does not occur. Instead, we put the KT_KILLING flag in its state, which will force the thread to kill itself when it will reach the zero-shield level again.
dispatch
struct kDispatcher*
This is the dispatcher the thread is running on. On a multi-processors system, it will help finding which processor the thread is currently using.
resume
ktResumer*
This is a pointer to a function that can be used to kill the thread if it's in the wait state. It will perform simple actions to remove the thread from it's current waitqueue.
waitItem
void*
This pointer will be passed as argument to the resume method. It's intended to hold the reference of the object we're waiting on (a semaphore or anything else).

All the threads of a given process are stored in a balanced AVL tree and identified by their task state segment selector. The selector of the current running thread can be easily read out of the task register with the macro kCurrentTask(). We expect such mechanism to be available in every CPU architectures that have multithread facilities.

The kThread structure mainly have two roles:
  • store every information we need about thread scheduling, like the amount of time it can use the CPU (and which CPU it uses), pointers within the waitqueues, etc. 
  • refer to all other thread-related structures like stacks memory objects, etc. so that we can easily free all those resources when the thread is killed.
The pointer to the current running thread can be read with getSysGlocals()->thread . We also can modify that pointer, but it's highly dangerous and not really useful.

What are the possibles states of a thread

state diagram for a thread

When a thread is created, it doesn't immediately starts to run. Instead, we must call the threadActivate() function to make it enter the scheduler. It will then keep waiting in the KT_READY state until it's selected to run on a CPU and enter the KT_RUNNING state.

There's three way to make the thread leave the running state. First one is to wait long enough to have a cpu-preemption to occur (the quantum counter reach zero), which will return the thread in ready state at the end of the scheduler's queues. We can force that switch cooperatively by the mean of the threadResched() function. Note that when rescheduling, it's possible that the scheduler uses the remaining clock ticks to select where the thread must wait in the queues (for instance, we might give it a larger quantum, or make it selected earlier with only its remaining time, etc ...)
Third way is to put the thread in an external (not belonging to the scheduler) and then calling threadSuspend(). This is for instance used when you want to wait on some semaphore: you put the thread in the semaphore's waiting queue and then call threadSuspend ... Of course, we can't have a preemption that occurs between those two operations or the semaphore will be left in an inconsitent state. 

Building new threads

That's one of the most complicated code of the microkernel, because there's lot of things to be done: we must build stacks, thread descriptor, setup glocal variables mechanism (that's the most tricky). New threads are always built in an existing process context. If you want to start a new program, you have to make a new process first and then, you'll be able to create threads in it.

  • Before starting to build the thread, the threadCreate() function will try to find out the priviledge level needed for the thread (and thus for its stacks) by inspecting the priviledge level of the given code segment
  • Then, we allocate kernel memory for the kThread and TSS structures (with help from _getTss() local function), 
  • initialize the tss with values coming from the process informations (like the current local descriptor tables, the page directory address, etc.) default code and data segments. All generic registers will be filled with zeroes.
  • builds the thread's stack and system stack. Note that some so-called system threads are always running in kernel mode and therefore don't need a separated system stack (while their default stack is already a system-one). These operations are performed by _makeStack() and functions. Most of the job is performed by the memory class constructor. What we do here is mainly preparing the initialisation structure of the constructor with the thread initialisation values or the process informations and retrieve the proper memory class. By default, we use the STAK and SSTK memory class to create stacks, but those choices can be overriden by giving manually the stack to threadCreate().
  • prepare the stacks for holding glocal variables and setup glocal pointers in the code. This job is done by _stackPrepare() .
The creation of a new thread always require an entry point telling what code is to be executed by the thread and a process context. (the current process getSysGlocals()->thread->pnfo will be used if you don't provide it manually). All other parameters are optionnal and left in the ktInit structure , and we can select which one are used and which are not by the mean of the present flags.

Thread System Programming Interface

Lifetime control SPI

This group of functions allow you to create/kill threads. They're direct call to microkernel functions and can be found in "ksyms.thread.*" naming set. You can apply the usual mapping to translate symbolic KDS names "ksyms.thread.xyz" into microkernel functions threadXyz.
kthread*
create
(kProcess *pnfo, void* entry, ktFlags *present, ktInit *params)

creates a new thread running in the same address space as the caller, and executing the code starting at entry. If many processes exist in the same address space, then you can select the one you want by the pnfo argument.
kStatus
kill
(word tss)

stops the execution of the thread defined by tss. Both killer and killed thread must belong to the same address space or the kill will fail. We can also use this call to kill the caller thread but this requires a tss value of 0 (or you'll fool the kernel !)
void
shield
()
void
unshield
()

protects or unprotects the thread against remote-killing and self-killing by changing the shield level (increment on shield(), decrement on unshield()). The thread can only be killed when its shield level is 0.
void
terminate
(kThread *me)

this completely destroys the resources used by a zombie thread. It's the part of the kill procedure that must be handled by another thread. A killed thread can remain in the "zombie" state as long as needed by debug features. Once we're completely done - or if the process has specified that it did not want to have zombie threads, the thread is terminated.

Activity control SPI

Changing the activity of a thread can be achieved through calling methods of a KDS server that implements "services.sys.scheduler:default". It might seem weird to use something quite heavy like the KDS services to suspend/resume threads, but this allow us to dynamically change the scheduler at run-time without losing any thread (the unique hot-swap scheduler feature of the Clicker microkernel that opens the path to kernel customability ;-)

For convenience, wrapping macros map the kdsInvoque calls.

void
threadActivate
(kThread *)

turns a thread (back) in the "ready" state and puts (returns) it in the wait queue of the scheduler. This should only be called when the thread is in the building or the sleep state. Any thread can be activated by any other thread, regardless of their respective address space.
void
threadSuspend
()

remove the current thread from the CPU without letting it go back in the scheduler queue. The current thread should be placed in some external waitqueue before calling that function or it'll simply become a "lost thread". I strongly recommand you not to use this function directly but rather to use the existing synchronization classes of the koLib like the semaphores and read/write sharers instead.
void
threadSuspendAny
(kThread *)

removes another thread from the scheduler. Only the "ready" thread can be suspended this way, not running threads on other CPUs.
void
threadResched
()

make the current thread to nicely give back the CPU usage to the next thread in queue. This is a voluntary task switch (but we can't decide what thread is next).

If you plan to use this low-level programming interface directly rather than using koLib higher-level objects, i strongly recommand that you first inspect the koLib.share.semaphore implementation or expect to have surprising effects...

koLib.share.kSemaphore

This is an implementation of Dijkstra's semaphore using a FIFO waitqueue and the thread SPI. It allows clean recovery of killed-while-sleeping threads by the mean of a custom "resume" function. Note that when the semaphores are used in a more complex syncrhonisation object, we can override the semaphore's "resume" mechanism by a more specific mechanism simply by assigning our own resume function before calling the wait method: the semaphore install its own method only if it finds the waitItem field of the current thread NULL.

I don't feel like explaining the semantic of semaphores again: just check your favourite course/book about parallel programming.

All the methods are avaialble through koLib.share.kSemaphore.* and are documented in doxygen documentation . The usual name mapping apply between "koLib.share.kSemaphore.try" and its pending function name in the microkernel ksemTry.

Readers and Writer

This is another high-level abstraction for threads synchronisation. It allows several thread to access a resource as long as they just need to "read" it, but not to change the values. Another kind of access (write) can be required to modify the shared object, and access will then be granted exclusively to one thread (no other writer or reader at the same time).
The chosen implementation is a bit complicated, but it prevents starvation (you're sure you will eventually get the access regardless of what other process can do - provided that noone will keep its access forever :)

This kind of access sharing is the base of KDS interfaces where you can have as much interface client as you want at a given moment, but where all client access are disabled when you're modifying the interface or adding/removing implementations.

See the rwshare structure and the header files rwshare.h and src/kolib/rwshare.c for more documentation.

Message queues

These are more than simple synchronization items: they allow you to queue datas that will be read by another thread. Both message (kMsg structure) emitters and readers should agree on the effective content of the messages. There's no convention about this, so if you want to send datas through a message queue to a message reader, check the reader's code/documentation first

The queue is implemented by a (usually small) array of messages slots - stored in microkernel area so that inter-process communication can be done - and two semaphores (so that you can either wait for datas or for free slots).

See the kQueue structure and head/kolib/share.h  documentatin for more informations.



What the hell are glocal variables

You can think at glocal variables as you would think about Java's thread local variables, except they aren't in Java :). The concept is to give each thread a space where you want to put globally-accessible (i mean usable by all functions as normal static or heap global variables would be) variables but which value would be specific to the thread.
One obvious example of glocal variable is the thread pointer or the errno variable.

In Clicker, you have two kind of such glocal variable: kernel glocal variables and user glocal variables. Kernel glocal variables are allocated on the kernel stack and are - therefore - only available in kernel mode. Following table gives a description of the kernel glocal variables in Clicker version 0.7.x (they're defined in  head/glocals.h , and form the sysGlocals structure ).

thread
kThread*
Holds a pointer to the current running thread. It's useful to avoid the need of taskid - to -threadpointer lookup table, but also to quickly access all the components of that threads structure, mainly for scheduling purposes or to get the current process pointer.
space
pgSpace*
The pointer to the current address space structure - that you can get with pgCurrSpace() - used for virtually all the paged memory functions. It could be found with the address space directory (using the page base register value as a key), but it's really faster to get it through glocals variables.
dsp_buffer
char*
This is the buffer the thread use to prepare its kprint messages. It could be possible (but really more complicated and slower) to use one single buffer shared by all threads, but we decided to "give a cat a home" (a thread a buffer :)... This makes it possible to keep the same design regardless you enable multithreading or not.
dsp_ptr
int
This is the offset of the currently reached position in the buffer. We need to put it glocal (rather than simply local) to avoid normal code and interrupt handlers interference in a single thread's buffer.
_DbTop
int
This is used to keep a track of deepness in the DebugSystem's function names stack. There is only one stack for all threads (and only the thread that has been "elected" for debug displaying uses it, but every thread keeps a track of where it would be if it was elected. This avoid to find the stack back in total garbage after switching debug display to another thread.

Well, now you clearly see that all those kernel glocal variables are things that will make our coder's life much easier. That was the main purpose of introducing them. They all can be accessed with getSysGlocals() pointermaker. That is in fact a macro that will perform efficient assembly operation to retreive the position of glocals pointers.

The goal of user-level glocals variables is to solve some programming problems (because user program do not have a direct access to the thread ID or some other identification purpose), so it's a convenience to build more evoluated stuffs like posix keys.
For now, the only user-level glocal variable that has been defined is __ errno, which is used by the kernel to store it's failure reason. More space can be made available to user-level glocal variables by tuning the thread creation, but that's a language library issue, not a kernel issue!

About IA32 implementation of Glocals variables.

Glocal variables must be stored in a thread-specific structure, but at a "standard" position. The way we solved that problem for Intel CPU was to allocate a few bytes at the bottom of each stack to put some so-called Glocals Pointers, that point to sysGlocals and kGlocals structures of the thread that own that stack.

The trick was to use the lsl instruction to find that place. Once it's done, you just have to move to the proper offset. For the people that are not familiar with Intel Architecture, lsl stands for Load Segment Limit. It allows you to find the highest address for a segment (here the stack segment).
The great part of it is that this instruction -when performed on an already loaded segment - just requires 1 CPU cycle. You can difficultly beat that performance, can't you ??
   
structure of a system stack

structure of a system-thread stack and glocal pointers
structure of a user-level thread stack and glocal pointers plus reference of user glocal variables by sytem-level stack