Find binary sequences with low peak sidelobe level of autocorrelation function

Question 1

I would like to find all binary sequences from the specified range, which has low peak sidelobe level of its autocorrelation function.

Here is the solution (this is simplified version of the real code) using GCC Inline Assembly with AT&T syntax. Binary sequences are represented as sequence of bits in the integer number. The example is a complete C program which finds and prints the 13-position Barker Code.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
// A helper function, which is performed very very rare.
void SaveCode (const uint8_t length, const uint64_t code);
int main (int argc, char * argv [])
{
 const uint8_t sideLobeLimit= 1;
 const uint8_t length = 13;
 const uint64_t beginCode = 1ull << (length - 1);
 const uint64_t endCode = (1ull << length) - 1;
 const uint64_t mask = (1ull << (length - 1) ) - 1ull;
 __asm__ __volatile__ (
 "INIT: \n\t" // Prepare for computation.
 " movb %[length], %%r8b \n\t" // Load length of sequences into CPU register.
 " movq %[code], %%r9 \n\t" // Load first sequence of the range into CPU register.
 " movq %[maxcode], %%r10 \n\t" // Load last sequence of the range into CPU register.
 " movb %[limit], %%r11b \n\t" // Load maximum allowed level of sidelobes into CPU register.
 " movq %[mask], %%r12 \n\t" // Load mask for extracting significant bits into CPU register.
 "CHECK_CODE: \n\t" // Body of loop through sequence (like the "do-while" loop).
 " movb 1,ドル %%cl \n\t" // Set the offset value.
 " movq %%r12, %%r13 \n\t" // Set mask into mutable variable.
 "NEXT_SHIFT: \n\t" // Beginning of loop through shift of sequence (like the "do-while" loop).
 " movq %%r9, %%rdi \n\t" // Shifting sequence.
 " shrq %%cl, %%rdi \n\t" // Shift.
 " xorq %%r9, %%rdi \n\t" // Counting level of sidelobes.
 " andq %%r13, %%rdi \n\t" // Remove extra bits.
 " popcntq %%rdi, %%rax \n\t" // al = n (number of the different bits).
 " shlb 1,ドル %%al \n\t" // al = 2 * n.
 " subb %%r8b, %%al \n\t" // al = 2 * n - l (l - length of the sequence).
 " addb %%cl, %%al \n\t" // al = o + 2 * n - l (o - current offset).
 " jge ABS \n\t" // al =|o + 2 * n - l| (now al contain the sidelobe level).
 " negb %%al \n\t" // .
 " ABS: \n\t" // .
 " cmpb %%r11b, %%al \n\t" // Check if the sidelobe level acceptable?
 " jg NEXT_CODE \n\t" // If it is not, then go to the next sequence.
 " incb %%cl \n\t" // Increment the offset for creating next shifted sequence.
 " cmpb %%cl, %%r8b \n\t" // Check if is it the lass offset.
 " jbe SAVE_CODE \n\t" // If it is, then save cureent sequence.
 " shrq 1,ドル %%r13 \n\t" // Shift mask for next shifted sequence.
 " jmp NEXT_SHIFT \n\t" // End of loop through shift of sequence.
 "NEXT_CODE: \n\t" // Control of loop through sequence.
 " incq %%r9 \n\t" // Set next sequence.
 " cmpq %%r10, %%r9 \n\t" // Check if the sequence inside the range.
 " jbe CHECK_CODE \n\t" // If it is, then go to the begining of the loops body.
 " jmp QUIT \n\t" // If it is not, then go to the end of procedure.
 "SAVE_CODE: \n\t" // Saving sequence with accepted level of sidelobes.
 " pushq %%r8 \n\t" // Store registers.
 " pushq %%r9 \n\t" // .
 " pushq %%r10 \n\t" // .
 " pushq %%r11 \n\t" // .
 " pushq %%r12 \n\t" // .
 " movl %%r8d, %%edi \n\t" // .
 " movq %%r9, %%rsi \n\t" // .
 " call SaveCode \n\t" // Calling external function for saving the sequence.
 " popq %%r12 \n\t" // Restore registers.
 " popq %%r11 \n\t" // .
 " popq %%r10 \n\t" // .
 " popq %%r9 \n\t" // .
 " popq %%r8 \n\t" // .
 " jmp NEXT_CODE \n\t" // Continue test sequences.
 "QUIT: \n\t" // Exit of procedure.
 " nop \n\t" // .
 :
 : [length ] "m" (length),
 [code ] "m" (beginCode),
 [maxcode] "m" (endCode),
 [limit ] "m" (sideLobeLimit),
 [mask ] "m" (mask)
 : "%rax", "%rcx", "%rdi", "%rsi",
 "%r8", "%r9", "%r10", "%r11", "%r12", "%r13"
 );
 return EXIT_SUCCESS;
}
void SaveCode (const uint8_t length, const uint64_t code)
{
 uint8_t i = 0;
 for (i = 0; i < length; ++i) {
 (code >> i) & 0x01 ? printf ("+") : printf ("-");
 }
 printf ("\n");
}

It can be built with:

gcc main.c -o main

I'd like to hear any suggestions about how to:

improve the algorithm;
improve performance (use some tricks or instruction reordering or something else);
improve comments (content and translation);
improve readability (set aliases for CPU registers if it's possible or something else);
any other suggestions

Question 2

Editing your question after a review creates problems when trying to correlate a review back to the question. See what yoou can an cannot do after receiving an answer. In this case, the review focuses on one part of the code only, and the edits to the rest of the code do not affect the review. Rolling back just the changes to SaveCode.

Question 3

I have some suggestions for improving the algorithm's performance:

Seriously consider writing it in C, with inline assembly for popcount (or use the gcc __builtin_popcount(var)). Compilers have register allocators, can inline the output function, and should be able to schedule this relatively simple code quite well. It would also make it easier to unroll the loop, either manually or by asking the compiler to do it.
Use register constraints rather than explicit registers. The constraints are aliases for the register (google "gcc inline assembly"). If the compiler decides which registers are used, it can reuse their contents decide what to clobber. It can also reduce clobbering at call sites.
Iterate o - l rather than o. It simplifies your test to a test against zero, which saves one instruction.
If possible, use 32-bit integers for arithmetic. Modern PCs are optimized for 32-bit arithmetic, and the instructions are shorter.
Try alternative methods for computing the absolute value. If there is a 50/50 distribution of positive and negative sidelobes, then other methods are likely faster.
Construct the output string in an array on the stack, and output the string with puts (remember the terminating null character). The repeated putc calls can be far more expensive then the actual computation. You can ignore this comment if you do not plan to output the result to stdout.
Try if add 1 is faster than inc. inc does not change the carry flag, so it causes a dependency between previous and following instructions.

I also have a few comments for the readability:

Use one name for variables instead of many.
Switch to C, or at least provide C pseudocode.
Provide a link for the wiki page

You should also not use a raw rdtsc for performance readings because it does not order memory accesses. Use rdtscp or rdtsc with cpuid before and after.

Question 4

Thanks for your answer! Suggestions for algorithm. 1. I already have had C version, but, frankly speaking, this is not the same what is assembly code, I will try rewrite it based on assembly code. 2. Thanks, I will try to do it. 3. I have two loops, and it looks like this suggestion is applicable only for shift loop, I am trying to use it.

Question 5

4. Is it mean that I should reduce when it possible 64-bit instruction up to 32-bit, or it's also mean that I should extend 8-bit and 16-bit instructions? 5. Thanks, I will try to do it. 6. Thanks for suggestion but, one more time, SaveCode is very and very rare function. For instance for sequence with length 48 (which need 2^47 extern loops!) SaveCode called only 16 times! 7. As far as I am remember, I have tried do it, but anyway I will try to do again. Suggestions for readability. 1. Thanks, you are right! 2. I have already answered that I will try to do it. 3. What page do you mean?

Question 6

Here is version of code on C. Based on result of simple benchmark (time ./main) C version requires twice as much time than Asm version.

Question 7

I've tried to use branchless abs and it has worst execution time than existed solution. The example can be launched here.

Question 8

It looks like replacing inc by add 1 allows to reduce execution time about 0.5 percent.

Question 9

Hmmm, after reviewing SaveCode() and getting deeper into it, I now see SaveCode() is a helper function and not the main attraction. FWIW, here is the review of SaveCode().

SaveCode() Minor: Suggest simply shifting code by 1 each loop. On 64-bit machines, likely will make little improvement, but with humble machines, a single shift is easier.
```
// (code >> i) & 0x01 ? printf ("+") : printf ("-");
(code & 1) ? printf ("+") : printf ("-");
code >>= 1;
```
SaveCode() Minor: Likley node performnece nor code space saving using uint8_t. unsigned will do.
```
 // uint8_t i = 0;
 unsigned i = 0;
```

fputc() or putc() is certainly quicker than printf ("+"). Note: not a fan of using ?: here.

 // ... ? printf ("+") : printf ("-");
 ... ? fputc('+', stdout) : fputc('-', stdout);

Putting that all together, and some other bits, suggest:

void SaveCode(unsigned length, uint64_t code) 
{
 while (length > 0) {
 length--;
 (code & 1) ? fputc('+', stdout) : fputc('-', stdout);
 code >>= 1;
 }
 fputc('\n', stdout);
}

Question 10

I now see SaveCode() is a helper function - yep it's my mistake, it's highlighted now. This is not exactly what I want, but any way thank you for your attention and time!

Question 11

1. - Agree. 2. - Can you explain why (link will be enough)? 3. - What about fputs (code & 1 ? '+' : '-', stdout);?

Question 12

2. In general, processors are optimized (speed/code space) for int/unsigned rather than (u)int8_t This answer and Jens comment gets into some of the issues. 3. fputc(code & 1 ... (not fputs()) would work as well.

Lars Bonnichsen Lars BonnichsenLars Bonnichsen 411 bronze badge · Answer 1 · 2014-12-08 01:12:12Z

I have some suggestions for improving the algorithm's performance:

Seriously consider writing it in C, with inline assembly for popcount (or use the gcc __builtin_popcount(var)). Compilers have register allocators, can inline the output function, and should be able to schedule this relatively simple code quite well. It would also make it easier to unroll the loop, either manually or by asking the compiler to do it.
Use register constraints rather than explicit registers. The constraints are aliases for the register (google "gcc inline assembly"). If the compiler decides which registers are used, it can reuse their contents decide what to clobber. It can also reduce clobbering at call sites.
Iterate o - l rather than o. It simplifies your test to a test against zero, which saves one instruction.
If possible, use 32-bit integers for arithmetic. Modern PCs are optimized for 32-bit arithmetic, and the instructions are shorter.
Try alternative methods for computing the absolute value. If there is a 50/50 distribution of positive and negative sidelobes, then other methods are likely faster.
Construct the output string in an array on the stack, and output the string with puts (remember the terminating null character). The repeated putc calls can be far more expensive then the actual computation. You can ignore this comment if you do not plan to output the result to stdout.
Try if add 1 is faster than inc. inc does not change the carry flag, so it causes a dependency between previous and following instructions.

I also have a few comments for the readability:

Use one name for variables instead of many.
Switch to C, or at least provide C pseudocode.
Provide a link for the wiki page

You should also not use a raw rdtsc for performance readings because it does not order memory accesses. Use rdtscp or rdtsc with cpuid before and after.

Thanks for your answer! Suggestions for algorithm. 1. I already have had C version, but, frankly speaking, this is not the same what is assembly code, I will try rewrite it based on assembly code. 2. Thanks, I will try to do it. 3. I have two loops, and it looks like this suggestion is applicable only for shift loop, I am trying to use it.
4. Is it mean that I should reduce when it possible 64-bit instruction up to 32-bit, or it's also mean that I should extend 8-bit and 16-bit instructions? 5. Thanks, I will try to do it. 6. Thanks for suggestion but, one more time, SaveCode is very and very rare function. For instance for sequence with length 48 (which need 2^47 extern loops!) SaveCode called only 16 times! 7. As far as I am remember, I have tried do it, but anyway I will try to do again. Suggestions for readability. 1. Thanks, you are right! 2. I have already answered that I will try to do it. 3. What page do you mean?
Here is version of code on C. Based on result of simple benchmark (time ./main) C version requires twice as much time than Asm version.
I've tried to use branchless abs and it has worst execution time than existed solution. The example can be launched here.
It looks like replacing inc by add 1 allows to reduce execution time about 0.5 percent.

chux chuxchux 36.4k2 gold badges43 silver badges96 bronze badges · Answer 2 · 2014-11-01 02:44:51Z

Hmmm, after reviewing SaveCode() and getting deeper into it, I now see SaveCode() is a helper function and not the main attraction. FWIW, here is the review of SaveCode().

SaveCode() Minor: Suggest simply shifting code by 1 each loop. On 64-bit machines, likely will make little improvement, but with humble machines, a single shift is easier.
```
// (code >> i) & 0x01 ? printf ("+") : printf ("-");
(code & 1) ? printf ("+") : printf ("-");
code >>= 1;
```
SaveCode() Minor: Likley node performnece nor code space saving using uint8_t. unsigned will do.
```
 // uint8_t i = 0;
 unsigned i = 0;
```

fputc() or putc() is certainly quicker than printf ("+"). Note: not a fan of using ?: here.

 // ... ? printf ("+") : printf ("-");
 ... ? fputc('+', stdout) : fputc('-', stdout);

Putting that all together, and some other bits, suggest:

void SaveCode(unsigned length, uint64_t code) 
{
 while (length > 0) {
 length--;
 (code & 1) ? fputc('+', stdout) : fputc('-', stdout);
 code >>= 1;
 }
 fputc('\n', stdout);
}

I now see SaveCode() is a helper function - yep it's my mistake, it's highlighted now. This is not exactly what I want, but any way thank you for your attention and time!
1. - Agree. 2. - Can you explain why (link will be enough)? 3. - What about fputs (code & 1 ? '+' : '-', stdout);?
2. In general, processors are optimized (speed/code space) for int/unsigned rather than (u)int8_t This answer and Jens comment gets into some of the issues. 3. fputc(code & 1 ... (not fputs()) would work as well.

Stack Exchange Network

Find binary sequences with low peak sidelobe level of autocorrelation function

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find binary sequences with low peak sidelobe level of autocorrelation function

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions