Spinlock for C++ kernel (with x86 ASM)

Question 1

This is some prototype code for a spin-lock for my toy operating system. Bear in mind this means there are no standard library routines to fall back on. It's also for C++11 compilation and will only ever be compiled by the GNU G++ compiler.

The primary use for this is in the scheduler and other code that accesses the process table.

I think there are lots of ways of getting this wrong, and some of them aren't obvious. So it'd be great to have a second pair of eyes on it.

Firstly the locking code itself in NASM x86 asm:

; *******************************
; void spin_lock(uint32 * p);
;
global __kspin_lock
__kspin_lock:
 push ebp
 mov ebp, esp ; we might as well the the stack frame right?
 mov eax, [ebp + 8] ; eax contains address of uint32 used for the lock
 mov ecx, 1
.tryacquire ; try and get the lock.
 xchg ecx, [eax]
 test ecx, ecx
 jz .acquired ; if the lock wasn't 0 (unlocked) repeat.
.pauseloop
 pause ; see 'About PAUSE' below
 test dword [eax], 1
 jne .pauseloop
 jmp .tryacquire
.acquired
 pop ebp
 ret

And secondly a class that wraps it a little. (I will eventually then wrap this with an RAII class to release the lock automatically when it goes out of scope).

#pragma once
#include <std_types.h>
extern "C" {
/** These routines probably shouldn't be used directly. */
void __kspin_lock(uint32 *p);
};
/** 
 * SpinLock provides a crude locking mechanism based around an atomic
 * exchange of a variable in memory. 
 *
 * The SpinLock loops, swapping the value 1 for the value in lock. 
 * Each time it checks to see if the value retrieved from lock is now 0, 
 * if it is it knows that lock now contains 1, and it holds the lock.
 *
 * To ensure all spinlocks reside in a unique cache line the spinlock
 * is aligned to 64 bytes and is 64 bytes in size. This avoids two 
 * processes competing for ownership of the same 64 byte cache line,
 * which could be bad for performance.
 */
class __attribute__((aligned(64))) SpinLock 
{
public:
 SpinLock()
 : lock_value(0)
 {}
 void lock()
 {
 __kspin_lock(&lock_value); 
 }
 void release()
 {
 __kspin_lock_release(&lock_value); 
 }
private:
 static void __kspin_lock_release(uint32 *p) 
 { 
 *p = 0; 
 }
 /* Difficult to know whether it makes sense to copy a spinlock. 
 * It might be practical to copy it but always 0 the lock for
 * the copy. */
 SpinLock(const SpinLock &) = delete; 
 SpinLock & operator=(const SpinLock &) = delete; 
 /* padded to ensure it's on a unique cache line. */
 uint32 lock_value;
 char reserved[64 - sizeof(uint32)];
};

I also think this code could use some through tests, especially on performance, but that might be rather a case of profiling in place in the kernel. Any suggestions welcome.

Incidentally - I'm also thinking that it's going to be useful to add some additional debugging and profiling features to the code. For example, and probably only while running a debug build:

How frequently was the lock contended.
Which thread owns the lock.
Is the lock held by this thread (this would allow functions to check that the lock is held by a calling function properly).

Question 2

This is largely a duplicate of https://stackoverflow.com/questions/6935442/x86-spinlock-using-cmpxchg, https://stackoverflow.com/questions/11959374/fastest-inline-assembly-spinlock, etc.

(1) Your code works only on 32-bit x86. If you're using 64-bit (x86-64), the calling convention is totally wrong (you need to save and restore %rbp rather than %ebp, and you're looking for the argument in the wrong place). If you're implementing your own OS, I suppose you know what you're doing... but in general I think it's weird to see 32-bit x86 code these days.

(2) Useless trivia of the day: The two instructions test ecx, ecx; jz .acquired could be replaced with the single instruction jecxz .acquired. I don't know whether this is an optimization or a pessimization on your particular hardware. It would be a size optimization, at least.

(2.5) Really, consider using cmpxchg instead of xchg; test. The cmpxchg mnemonic maps more directly onto the "Compare And Swap" idiom that you're using. It will also help you when it comes time to implement "is the lock held by this thread"; you can set the word to current-thread-id instead of just 1.

(3) See 'About PAUSE' below: Unfortunately there is no "About PAUSE" below. However, it seems like you're doing the right thing there. You could definitely tighten up the code; for example I don't know why you're wasting time with that test dword [eax], 1. You should just retry the xchg as soon as the pause is over.

(4) Consider removing the stack frame; it's not doing anything useful (unless it's necessary for your debugger of choice), just slowing things down.

Question 3

Thanks! That's great feedback. (1) By the time I'm finished I'm sure I'll know what I'm doing. Writing an OS is the most interesting education I've done for many years! (3) The 'About PAUSE' comment that's missing was a quote from here: stackoverflow.com/questions/12894078/pause-instruction-in-x86/… ... and yes, (4) the stack frame was there simply to help with debugging. I'll (think of the rainforests...) and remove it. It is a waste of good processor cycles. (2, 2.5) I'm about a tenth of the way through the intel manual! I'll check those bits out! Thanks again!

Question 4

I'm curious what is the point of using assembly if you are going to call it from C++11 anyway. In C++11 you could use atomics to implement just that. If the C++ compiler generates slightly worse code for the spin lock I think it would be negligible.

There are basically two cases:

The lock is contended -- in this case an extra instruction or two inside the spin loop costs nothing compared to cpu cache traffic.
The lock is not contended -- but a call to assembly function might force the C++ compiler to spill caller-saved registers on the stack thus the effect of slightly more optimal assembly would be cancelled.

It is possible to fix #2 by using gcc inline assembly where you could specify what registers are used so let gcc figure out if it required to spill them. Anyway, I doubt this will be much better than pure C++ code as shown below:

class SpinLock {
public:
 SpinLock() noexcept : lock_(ATOMIC_FLAG_INIT) {}
 SpinLock(const SpinLock &) = delete;
 SpinLock &operator=(const SpinLock &) = delete;
 void Lock() noexcept { Lock(NoBackoff{}); }
 template <typename Backoff> void Lock(Backoff backoff) noexcept {
 while (lock_.test_and_set(std::memory_order_acquire))
 backoff();
 }
 void Unlock() noexcept { lock_.clear(std::memory_order_release); }
private:
 std::atomic_flag lock_;
};

This is cut'n'paste from my larger project. It allows to specify backoff policy. Thus the benefit of using C++ is greater flexibility.

The downside is that it is a simple TAS lock while you have a TATAS lock. Unfortunately, it looks like there is no way to load the current std::atomic_flag value. This is a weird libc++ limitation, so it might be more suitable to use std::atomic<bool> if a TATAS lock is really needed.

Question 5

It's a great point and well made, but I don't have a standard library yet There is no std::atomic. At some point I'll need to port or write one. But there's no libc yet either ... !

Question 6

I see. I believe there is no dependency on libc for porting std::atomic. It could be done with gcc builtin functions alone: gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Question 7

Aleksey - Thank you, I didn't even know those existed. That feels like it'd save a source file, a function call and a few push operations and generally be a bit tidier. When I get a moment I'll have a play and see what kind of assembly pops out of the compiler for those.

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Accepted Answer · 2015-03-17 05:52:25Z

This is largely a duplicate of https://stackoverflow.com/questions/6935442/x86-spinlock-using-cmpxchg, https://stackoverflow.com/questions/11959374/fastest-inline-assembly-spinlock, etc.

(1) Your code works only on 32-bit x86. If you're using 64-bit (x86-64), the calling convention is totally wrong (you need to save and restore %rbp rather than %ebp, and you're looking for the argument in the wrong place). If you're implementing your own OS, I suppose you know what you're doing... but in general I think it's weird to see 32-bit x86 code these days.

(2) Useless trivia of the day: The two instructions test ecx, ecx; jz .acquired could be replaced with the single instruction jecxz .acquired. I don't know whether this is an optimization or a pessimization on your particular hardware. It would be a size optimization, at least.

(2.5) Really, consider using cmpxchg instead of xchg; test. The cmpxchg mnemonic maps more directly onto the "Compare And Swap" idiom that you're using. It will also help you when it comes time to implement "is the lock held by this thread"; you can set the word to current-thread-id instead of just 1.

(3) See 'About PAUSE' below: Unfortunately there is no "About PAUSE" below. However, it seems like you're doing the right thing there. You could definitely tighten up the code; for example I don't know why you're wasting time with that test dword [eax], 1. You should just retry the xchg as soon as the pause is over.

(4) Consider removing the stack frame; it's not doing anything useful (unless it's necessary for your debugger of choice), just slowing things down.

Thanks! That's great feedback. (1) By the time I'm finished I'm sure I'll know what I'm doing. Writing an OS is the most interesting education I've done for many years! (3) The 'About PAUSE' comment that's missing was a quote from here: stackoverflow.com/questions/12894078/pause-instruction-in-x86/… ... and yes, (4) the stack frame was there simply to help with debugging. I'll (think of the rainforests...) and remove it. It is a waste of good processor cycles. (2, 2.5) I'm about a tenth of the way through the intel manual! I'll check those bits out! Thanks again!

Stack Exchange Network

Spinlock for C++ kernel (with x86 ASM)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Spinlock for C++ kernel (with x86 ASM)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions