REN

Ph.D. in Computer Science at Rutgers University

A Dual-Level Thread Library Implementation in Linux

This article will discuss about the implementation of rThread - a thread library in Linux. rThread takes an M:N mapping model of user-level threads and kernel-level threads. Since I've already have a paper on it, so this article will mainly focus on coding and implementation.


Logic Overview

When we talking about implementing a thread library, it's hard to start from scretch. However, thanks to several useful Linux system calls and runtime library that we could use. In previous articles, we've learned the difference between user-level threads and kernel-level threads. As we know, kernel-level thread are totally scheduled by Linux kernel, user space has no idea about how kernel-level threads are scheduled. Linux has several system call for creating a kernel-level thread. For user-level thread, it's totally invisible to kernel space. We need to design the data structure of user-thread TCB, Queue, and handle carefully on context switch.

As I stated in my paper, rThread has a flexibility in mapping relationship between kernel-level thread and user-level thread. 1:1 model just run as what pthread does. In 1:N model, a kernel-level thread is running a scheduler and user-level threas forms a run queue. For M:N mapping model, user-level threads are thrown into a user-level thread queue, and each kernel-level thread runs in a loop, grabs any run-able user-level threads in the user-level thread queue.

   

The figure above shows how M:N mapping works. We have M kernel-level threads and N user-level threads (Typically, N>M). User-level threads forms a run queue at the beginning, each kernel-level thread grabs user-level threads to execute. To ensure kernel-level threads access run queue in a correct manner, we use a spin lock here by using __sync_lock_test_and_set() because accessing run queue (dequeue and enqueue) are quite short operations. In the figure, 1.1 means the first step of kernel-level thread ”1”. For kernel-level thread ”2” in the figure, (2.1) it first tries to grab user-level thread ”1”, however it fails because kernel-level thread ”1” has already grubs user-level thread ”1” even they’re almost arrive at the same time (This is why we need hardware atomic operation support here). (2.2) Then it grabs user-level thread ”2” successfully; (2,3) And bookkeeping the context of current user-level thread ”2” inside its global hash table entry. (2.4) Next it should swap its current context to user-level thread ”2”. (2.5) When the time quantum of user-level thread ”2” is up, the scheduler should swap back to the context of kernel-level thread ”2”.

Each kernel-level thread runs the same logic, but not the same user-level threads. Everytime the time interrupt signal raises, a context restore should happen, that is to say the user-level thread need to go back to its previous scheduler context. When switching context between kernel-level threads and user-level threads (the context switching here is still in user space), we must bookkeep both the user-level thread context and kernel-level thread context.

   

As it shown in the figure above, we use a global hash table bookkeeping the key-value pair of kernel-level thread context and its current user-level thread context. The unique identifier of a kernel-level thread is its thread id given by the kernel, which could be retrieved by using syscall(SYS gettid). Thus, by using tid of a kernel-level thread as the key, we could build a global hash table on context mapping between kernel-level threads and their current executing user-level threads.


Detailed Design and Implementation

User-level Threads are totally running in user space which means during its lifecycle it’s user that is responsible for its creation, synchronization, scheduling and destroy. In this section, we will discuss on how we design user-level threads and several vital data structures along with user-level thread.


Thread Control Block and Thread Status

As we know, TCB (Thread Control Block) is an important data structure for a thread which keeps the status, identification and executing control information of a Thread. The following struct is the TCB for rThread user-level threads.

typedef struct threadControlBlock {
	rthread_t   tid;			/* Thread ID            */
	threadStatus status;			/* Thread Status        */
	ucontext_t context;			/* Thread Contex        */
	void *stack;				/* Thread Stack pointer */
	...
} _tcb;

rThread had 5 states, there's a Finite State Machine describing the state transformation between these 5 states. (The diagram is like what I've showed in my previous posts). The following is the five states of a thread. We need the FINISH state to indicate that the thread is finished but whose resource has not yet been released. We need a wrapper function which will be discussed later.

typedef enum threadStatus {
	NOT_STARTED = 0,
	RUNNING,
	SUSPENDED,
	TERMINATED,
	FINISHED,
} threadStatus;

Queue ADT

Queue, first in first out, will be a very important data structure in our implementation. Since we will have different types of queue: TCB queue, mutex queue, condition variable queue, it's preferable to make it an abstract data structure. The following is the data structure of Queue ADT, the type of each element is a void pointer which will be explicitly cast to what you want by programmer.

typedef struct Queue {
	void **queue;
	int size;
	int head;
	int rear;
} Queue;

Mutex Lock

Mutex is a mechanism that allows only one process or thread hold the lock, others will wait until the current owner release the lock. The following figure describes the finite state machine for how mutex lock works.

   

As Figure 2 shows, when trying to lock the mutex, only one thread will success and others goes to block state. When the current mutex holder releases the mutex, the first thread in the block queue are waken up and being put into running queue. In order to ensure only one thread hold the lock, an hardware-supported atomic primitive __sync_lock_test_and_set() is needed. The reason using blocking lock in rThread mutex is that frequent test-and-set operation is a big waste for CPU resource.

typedef struct rthread_mutex_t {
	_tcb *owner;		/* owner of mutex */
	uint lock;		/* lock */
	Queue wait_list;	/* block queue */
} rthread_mutex_t;

The code above is the data structure for rThread mutex lock. One significant different between mutex and unary semaphore is that mutex must have a owner and only the owner could release the lock. We have three basic operation for mutex lock: rthread_mutex_init(), rthread_mutex_lock(), rthread_mutex_unlock() and rthread_mutex_destory(); And the pseudo-code are given below.

int rthread_mutex_init(rthread_mutex_t *mutex) {
	// Initialize owner to be null
	// Initialize lock to be zero
	// Initialize block list
	return 0;
}
int rthread_mutex_lock(rthread_mutex_t *mutex) {
	// Use "test-and-set" atomic operation to aquire the mutex lock
	// If failed to require mutex lock, restore the current context to tcb and swap current context to schedular context
	// Else let mutex lock owner to be the current thread.
	return 0;
}
int rthread_mutex_unlock(rthread_mutex_t *mutex) {
	// Dequeue the thread from the head of mutex lock block list
	// Enqueue what we dequeued into thread run queue
	// Set mutex lock owner to be null
}
int rthread_mutex_destory(rthread_mutex_t *mutex) {
	// Release memory chunck in heap
}

Condition Variable

Condition Variable is another synchronization mechanism using wait/signal scheme. Threads waiting on a condition variable go to sleep until this condition variable is available. The condition variable is a little bit like mutex lock but mostly, they are different. If several tasks are all waiting for a certain condition, they are put into the waiting queue. When this condition variable is now signaled, only one task waiting on the queue is going to resume. And we could also do broadcast to let everybody in the waiting queue to resume in order.

typedef struct rthread_cond_t {
	Queue wait_list;		/* block list waiting on the condition variable */
	rthread_mutex_t list_mutex;	/* mutex on the condition variable */
} rthread_cond_t;

Above is the data structure of There're currently 5 operations for condition variable: rthread_cond_init(), rthread_cond_wait(), rthread_cond_signal(), rthread_cond_broadcast() and rthread_cond_destory();

int rthread_cond_init(rthread_cond_t *condvar) {
	// Initialize wait list
	return 0;
}
int rthread_cond_wait(rthread_cond_t* condvar, rthread_mutex_t *mutex) {
	// Enqueue the thread in the end of wait list (restore to its TCB)
	// Unlock the mutex
	// Swap to the scheduler context and waiting for someone else wakes me up.
	// Rehold the lock again when being waken up. 
	return 0;
}
int rthread_cond_signal(rthread_cond_t *condvar) {
	// Dequeue the thread in the head of condvar wait list
	// Enqueue what we dequeued into thread run queue
	return 0;
}
int rthread_cond_broadcast(rthread_cond_t *condvar) {
	// Dequeue the all of the thread in the condvar wait list
	// Enqueue whatever we dequeued into thread run queue
	return 0;
}
int rthread_mutex_destory(rthread_mutex_t *mutex) {
	// Release memory chunck in heap
}

User-level Thread Wrapper Function

As I stated above, user-level thread is totally newly designed, which runs on heap instead of stack. Thus, when the function of a user-level thread finishes, it cannot automatically release the resource. In order to do it, we could use a wrapper function containing the real thread function. After the thread function finishes, the wrapper function could release the resource.

void u_thread_exec_func(void (*thread_func)(void*), void *arg, _tcb *newThread) {
	_tcb *u_thread = newThread;
	u_thread->status = RUNNING;
	thread_func(arg);
	u_thread->status = FINISHED;

	// Keep Track of which kernel-level thread is running this user-level thread
	// When this thread finished, delete TCB and yield CPU control
}

GET_TCB() Macro

Use Macro to replace function is a very common and effective method in system programming. In rThread, the element of run queue is a pointer to user-level thread context. To get its corresponding TCB, we need this macro:

#define GET_TCB(uc_ptr, tcb, ctx) \
((tcb*)((char*)(uc_ptr) - (unsigned long long)(&((tcb*)0)->ctx)))

Scheduler Function

Each kernel-level thread in rThread is running a loop, grabs available user-level thread to run from run queue. In addition, the scheduler function is open to users; user could choose their own scheduler mechanism for user-level threads. The experiment result in the paper work shows that user-specific algorithm could significantly reduce average response time for user-level threads.

void k_thread_exec_func(void *arg) {
	// Set Timer and Signal Handler to Call Scheduler periodically

	// grab and run a user level thread from the user level thread queue, until no available user level thread
	while (1) {
		while(__sync_lock_test_and_set(&_spinlock, 1));

		if (0 != deQueue(&_thread_queue, (void**)&next_thread_uc)) {
			__sync_lock_release(&_spinlock);
			return;
		}
		__sync_lock_release(&_spinlock);

		current_tcb = GET_TCB((ucontext_t*)next_thread_uc, _tcb, context);
		
		/* current user thread is already terminated or finished by rthread_exit() */
		if (TERMINATED == current_tcb->status || FINISHED == current_tcb->status) {
			// do V() in thread semaphore implies current user level thread is done
			// release resource of this user-level thread
			continue;
		}

		current_tcb->status = RUNNING;
		current_uthread_context[k_tid & K_CONTEXT_MASK] = (ucontext_t*)next_thread_uc;
		swapcontext(&context_main[k_tid & K_CONTEXT_MASK], (ucontext_t*)next_thread_uc);
	}
}

Conclusion

rThread is a lightweight thread library implementation providing flexible mapping relation between user-level threads and kernel-level threads. rThread achieved at least equivalent performance within its functionality scope as pthread and NPTL. Flexible mapping relation is the key point in rThread design philosophy, for users could adjust their own scheduling algorithms on different kinds of workloads. The performance study shows that rThread could achieve equivalent performance compared with pthread and even better performance than pthread in some scenario for context switching between user-level threads is faster. Although rThread currently has several limitations, it will still serves in many scenario in practice. Kernel-level threads dynamic-tuning will be a good topic for our future research on improving the performance and energy efficiency.