Block Allocation in Fixed-Size Block Allocator

Question 1

The following routine is the block-allocation procedure in a fixed-size allocator being written for library use. It is designed to be accessed from C++ (the un-mangled symbol name is void * Superblock::alloc_block()). The performance of this routine is the critical to the library; and I, with almost no prior experience in assembly, am looking for suggestions about performance and also any holes in the logic. It seems to be working correctly right now, but I'm worried about edge cases.

Execution context: written for x86_64 System V abi, full allocator may be found here: https://github.com/cmura81/experimental-allocator.

alloc_sb_alloc_block.s:

.globl __ZN10Superblock11alloc_blockEv
// 'this is passed in %rdi
// Variables:
// this->beginning = 0(%rdi) +8
// this->last_privately_freed_block = 8(%rdi) +8
// this->last_publicly_freed_block = 16(%rdi) +8
// this->block_size = 24(%rdi) +2
// this->free_blocks = 26(%rdi) +2
// this->max_blocks = 28(%rdi) +2
__ZN10Superblock11alloc_blockEv:
 xorq %rcx, %rcx
 // this->beginning -> rsi
 movq 0(%rdi), %rsi
 // allocated_block = this->last_privately_freed_block
 movq 8(%rdi), %rax
 // if this->last_privately_freed_block == NULL
 testq %rax, %rax
 jz __ZN10Superblock11alloc_blockEv.i_pubchalloc
 // this->free_blocks--
 decw 26(%rdi)
 // Move 2 bytes from rax into edx
 movzwl (%rax), %edx
 notw %dx
 jz __ZN10Superblock11alloc_blockEv.i_nbi_ffff
 notw %dx
 movslq %edx, %rcx
 // this->last_privately_freed_block = this->beginning + next_block_index
 addq %rcx, %rsi
 movq %rsi, 8(%rdi)
 ret
__ZN10Superblock11alloc_blockEv.i_nbi_ffff:
 movq 0,ドル 8(%rdi)
 ret
__ZN10Superblock11alloc_blockEv.i_pubchalloc:
 movq 16(%rdi), %rax
 // if this->last_publicly_freed_block == NULL
 testq %rax, %rax
 jz __ZN10Superblock11alloc_blockEv.i_noalloc
 // this->free_blocks--
 decw 26(%rdi)
 // if *(uint16_t)(this->last_publicly_freed_block + next_block_index) == 0xFFFF
 movzwl (%rax), %edx
 notw %dx
 jz __ZN10Superblock11alloc_blockEv.i_pubnbi_ffff
 notw %dx
 movslq %edx, %rcx
 // this->last_privately_freed_block = this->beginning + next_block_index
 addq %rcx, %rsi
 movq %rsi, 16(%rdi)
 ret
__ZN10Superblock11alloc_blockEv.i_pubnbi_ffff:
 movq 0,ドル 8(%rdi)
 ret
__ZN10Superblock11alloc_blockEv.i_noalloc:
 movq 0,ドル %rax
 ret

Question 2

Don't you have some unit tests for the edge cases? Perhaps you should have posted them - it's often easier to find gaps in the testing than to spot the errors in the code.

Question 3

These comments all have to do with optimization, not "logic holes." Also, they are written without having tested them. Obviously you'll want to both ensure correct function, and profile to see if they actually improve anything.

1) As an obvious first step:

xorq %rcx, %rcx

can be done as

xorq %ecx, %ecx

It produces the same effect, but it's 1 byte shorter.

2) It looks like you are using notw %dx (in 2 different places) to see if dx is 0xffff? I might try cmp 0ドルxffff, %dx instead. It's 2 bytes longer, but you don't have to "put it back" if it's not (the extra notw %dx costs 3 bytes).

3) (Check me on this one) You are using movslq %edx, %rcx to move edx to rcx. However because of how you populate edx, I believe it already clears the upper bits. Such being the case, a simple movq %rdx, %rcx might suffice.

4) The __ZN10Superblock11alloc_blockEv.i_pubchalloc loop should probably have a pause inserted. Perhaps something like this:

__ZN10Superblock11alloc_blockEv.i_pubchalloc2:
 pause
__ZN10Superblock11alloc_blockEv.i_pubchalloc:
 movq 16(%rdi), %rax
 // if this->last_publicly_freed_block == NULL
 testq %rax, %rax
 jz __ZN10Superblock11alloc_blockEv.i_pubchalloc2

See the docs on pause for why this is a good idea.

These are all pretty nitpicky. Since there's only the 1 loop, there's just not that much to optimize.

Question 4

Actually, .i_pubchalloc isn't a loop - it was a typo on my part; should have been .i_noalloc, never got noticed because it's an extremely low-probability event

Question 5

Ahh. I just assumed it was a multi-threading thing.

David Wohlferd David Wohlferd 1,5181 gold badge8 silver badges17 bronze badges · Accepted Answer · 2018-05-29 02:26:49Z

These comments all have to do with optimization, not "logic holes." Also, they are written without having tested them. Obviously you'll want to both ensure correct function, and profile to see if they actually improve anything.

1) As an obvious first step:

xorq %rcx, %rcx

can be done as

xorq %ecx, %ecx

It produces the same effect, but it's 1 byte shorter.

2) It looks like you are using notw %dx (in 2 different places) to see if dx is 0xffff? I might try cmp 0ドルxffff, %dx instead. It's 2 bytes longer, but you don't have to "put it back" if it's not (the extra notw %dx costs 3 bytes).

3) (Check me on this one) You are using movslq %edx, %rcx to move edx to rcx. However because of how you populate edx, I believe it already clears the upper bits. Such being the case, a simple movq %rdx, %rcx might suffice.

4) The __ZN10Superblock11alloc_blockEv.i_pubchalloc loop should probably have a pause inserted. Perhaps something like this:

__ZN10Superblock11alloc_blockEv.i_pubchalloc2:
 pause
__ZN10Superblock11alloc_blockEv.i_pubchalloc:
 movq 16(%rdi), %rax
 // if this->last_publicly_freed_block == NULL
 testq %rax, %rax
 jz __ZN10Superblock11alloc_blockEv.i_pubchalloc2

See the docs on pause for why this is a good idea.

These are all pretty nitpicky. Since there's only the 1 loop, there's just not that much to optimize.

Actually, .i_pubchalloc isn't a loop - it was a typo on my part; should have been .i_noalloc, never got noticed because it's an extremely low-probability event

Stack Exchange Network

Block Allocation in Fixed-Size Block Allocator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Block Allocation in Fixed-Size Block Allocator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions