I would like to find all binary sequences from the specified range, which has low peak sidelobe level of its autocorrelation function.
Here is the solution (this is simplified version of the real code) using GCC Inline Assembly with AT&T syntax. Binary sequences are represented as sequence of bits in the integer number. The example is a complete C program which finds and prints the 13-position Barker Code.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
// A helper function, which is performed very very rare.
void SaveCode (const uint8_t length, const uint64_t code);
int main (int argc, char * argv [])
{
const uint8_t sideLobeLimit= 1;
const uint8_t length = 13;
const uint64_t beginCode = 1ull << (length - 1);
const uint64_t endCode = (1ull << length) - 1;
const uint64_t mask = (1ull << (length - 1) ) - 1ull;
__asm__ __volatile__ (
"INIT: \n\t" // Prepare for computation.
" movb %[length], %%r8b \n\t" // Load length of sequences into CPU register.
" movq %[code], %%r9 \n\t" // Load first sequence of the range into CPU register.
" movq %[maxcode], %%r10 \n\t" // Load last sequence of the range into CPU register.
" movb %[limit], %%r11b \n\t" // Load maximum allowed level of sidelobes into CPU register.
" movq %[mask], %%r12 \n\t" // Load mask for extracting significant bits into CPU register.
"CHECK_CODE: \n\t" // Body of loop through sequence (like the "do-while" loop).
" movb 1,ドル %%cl \n\t" // Set the offset value.
" movq %%r12, %%r13 \n\t" // Set mask into mutable variable.
"NEXT_SHIFT: \n\t" // Beginning of loop through shift of sequence (like the "do-while" loop).
" movq %%r9, %%rdi \n\t" // Shifting sequence.
" shrq %%cl, %%rdi \n\t" // Shift.
" xorq %%r9, %%rdi \n\t" // Counting level of sidelobes.
" andq %%r13, %%rdi \n\t" // Remove extra bits.
" popcntq %%rdi, %%rax \n\t" // al = n (number of the different bits).
" shlb 1,ドル %%al \n\t" // al = 2 * n.
" subb %%r8b, %%al \n\t" // al = 2 * n - l (l - length of the sequence).
" addb %%cl, %%al \n\t" // al = o + 2 * n - l (o - current offset).
" jge ABS \n\t" // al =|o + 2 * n - l| (now al contain the sidelobe level).
" negb %%al \n\t" // .
" ABS: \n\t" // .
" cmpb %%r11b, %%al \n\t" // Check if the sidelobe level acceptable?
" jg NEXT_CODE \n\t" // If it is not, then go to the next sequence.
" incb %%cl \n\t" // Increment the offset for creating next shifted sequence.
" cmpb %%cl, %%r8b \n\t" // Check if is it the lass offset.
" jbe SAVE_CODE \n\t" // If it is, then save cureent sequence.
" shrq 1,ドル %%r13 \n\t" // Shift mask for next shifted sequence.
" jmp NEXT_SHIFT \n\t" // End of loop through shift of sequence.
"NEXT_CODE: \n\t" // Control of loop through sequence.
" incq %%r9 \n\t" // Set next sequence.
" cmpq %%r10, %%r9 \n\t" // Check if the sequence inside the range.
" jbe CHECK_CODE \n\t" // If it is, then go to the begining of the loops body.
" jmp QUIT \n\t" // If it is not, then go to the end of procedure.
"SAVE_CODE: \n\t" // Saving sequence with accepted level of sidelobes.
" pushq %%r8 \n\t" // Store registers.
" pushq %%r9 \n\t" // .
" pushq %%r10 \n\t" // .
" pushq %%r11 \n\t" // .
" pushq %%r12 \n\t" // .
" movl %%r8d, %%edi \n\t" // .
" movq %%r9, %%rsi \n\t" // .
" call SaveCode \n\t" // Calling external function for saving the sequence.
" popq %%r12 \n\t" // Restore registers.
" popq %%r11 \n\t" // .
" popq %%r10 \n\t" // .
" popq %%r9 \n\t" // .
" popq %%r8 \n\t" // .
" jmp NEXT_CODE \n\t" // Continue test sequences.
"QUIT: \n\t" // Exit of procedure.
" nop \n\t" // .
:
: [length ] "m" (length),
[code ] "m" (beginCode),
[maxcode] "m" (endCode),
[limit ] "m" (sideLobeLimit),
[mask ] "m" (mask)
: "%rax", "%rcx", "%rdi", "%rsi",
"%r8", "%r9", "%r10", "%r11", "%r12", "%r13"
);
return EXIT_SUCCESS;
}
void SaveCode (const uint8_t length, const uint64_t code)
{
uint8_t i = 0;
for (i = 0; i < length; ++i) {
(code >> i) & 0x01 ? printf ("+") : printf ("-");
}
printf ("\n");
}
It can be built with:
gcc main.c -o main
I'd like to hear any suggestions about how to:
- improve the algorithm;
- improve performance (use some tricks or instruction reordering or something else);
- improve comments (content and translation);
- improve readability (set aliases for CPU registers if it's possible or something else);
- any other suggestions
-
2\$\begingroup\$ Editing your question after a review creates problems when trying to correlate a review back to the question. See what yoou can an cannot do after receiving an answer. In this case, the review focuses on one part of the code only, and the edits to the rest of the code do not affect the review. Rolling back just the changes to SaveCode. \$\endgroup\$rolfl– rolfl2014年11月10日 11:42:48 +00:00Commented Nov 10, 2014 at 11:42
2 Answers 2
I have some suggestions for improving the algorithm's performance:
- Seriously consider writing it in C, with inline assembly for popcount (or use the
gcc __builtin_popcount(var)
). Compilers have register allocators, can inline the output function, and should be able to schedule this relatively simple code quite well. It would also make it easier to unroll the loop, either manually or by asking the compiler to do it. - Use register constraints rather than explicit registers. The constraints are aliases for the register (google "gcc inline assembly"). If the compiler decides which registers are used, it can reuse their contents decide what to clobber. It can also reduce clobbering at call sites.
- Iterate o - l rather than o. It simplifies your test to a test against zero, which saves one instruction.
- If possible, use 32-bit integers for arithmetic. Modern PCs are optimized for 32-bit arithmetic, and the instructions are shorter.
- Try alternative methods for computing the absolute value. If there is a 50/50 distribution of positive and negative sidelobes, then other methods are likely faster.
- Construct the output string in an array on the stack, and output the string with puts (remember the terminating null character). The repeated putc calls can be far more expensive then the actual computation. You can ignore this comment if you do not plan to output the result to stdout.
- Try if add 1 is faster than inc. inc does not change the carry flag, so it causes a dependency between previous and following instructions.
I also have a few comments for the readability:
- Use one name for variables instead of many.
- Switch to C, or at least provide C pseudocode.
- Provide a link for the wiki page
You should also not use a raw rdtsc for performance readings because it does not order memory accesses. Use rdtscp or rdtsc with cpuid before and after.
-
\$\begingroup\$ Thanks for your answer! Suggestions for algorithm. 1. I already have had C version, but, frankly speaking, this is not the same what is assembly code, I will try rewrite it based on assembly code. 2. Thanks, I will try to do it. 3. I have two loops, and it looks like this suggestion is applicable only for shift loop, I am trying to use it. \$\endgroup\$Gluttton– Gluttton2014年12月08日 20:48:18 +00:00Commented Dec 8, 2014 at 20:48
-
\$\begingroup\$ 4. Is it mean that I should reduce when it possible 64-bit instruction up to 32-bit, or it's also mean that I should extend 8-bit and 16-bit instructions? 5. Thanks, I will try to do it. 6. Thanks for suggestion but, one more time,
SaveCode
is very and very rare function. For instance for sequence with length 48 (which need 2^47 extern loops!)SaveCode
called only 16 times! 7. As far as I am remember, I have tried do it, but anyway I will try to do again. Suggestions for readability. 1. Thanks, you are right! 2. I have already answered that I will try to do it. 3. What page do you mean? \$\endgroup\$Gluttton– Gluttton2014年12月08日 20:49:03 +00:00Commented Dec 8, 2014 at 20:49 -
-
\$\begingroup\$ I've tried to use branchless abs and it has worst execution time than existed solution. The example can be launched here. \$\endgroup\$Gluttton– Gluttton2014年12月13日 14:30:27 +00:00Commented Dec 13, 2014 at 14:30
-
\$\begingroup\$ It looks like replacing
inc
byadd 1
allows to reduce execution time about 0.5 percent. \$\endgroup\$Gluttton– Gluttton2014年12月13日 15:39:17 +00:00Commented Dec 13, 2014 at 15:39
Hmmm, after reviewing SaveCode()
and getting deeper into it, I now see SaveCode()
is a helper function and not the main attraction. FWIW, here is the review of SaveCode()
.
SaveCode()
Minor: Suggest simply shiftingcode
by 1 each loop. On 64-bit machines, likely will make little improvement, but with humble machines, a single shift is easier.// (code >> i) & 0x01 ? printf ("+") : printf ("-"); (code & 1) ? printf ("+") : printf ("-"); code >>= 1;
SaveCode()
Minor: Likley node performnece nor code space saving usinguint8_t
.unsigned
will do.// uint8_t i = 0; unsigned i = 0;
fputc()
orputc()
is certainly quicker thanprintf ("+")
. Note: not a fan of using?:
here.// ... ? printf ("+") : printf ("-"); ... ? fputc('+', stdout) : fputc('-', stdout);
Putting that all together, and some other bits, suggest:
void SaveCode(unsigned length, uint64_t code)
{
while (length > 0) {
length--;
(code & 1) ? fputc('+', stdout) : fputc('-', stdout);
code >>= 1;
}
fputc('\n', stdout);
}
-
\$\begingroup\$
I now see SaveCode() is a helper function
- yep it's my mistake, it's highlighted now. This is not exactly what I want, but any way thank you for your attention and time! \$\endgroup\$Gluttton– Gluttton2014年11月01日 10:13:28 +00:00Commented Nov 1, 2014 at 10:13 -
2\$\begingroup\$
1.
- Agree.2.
- Can you explain why (link will be enough)?3.
- What aboutfputs (code & 1 ? '+' : '-', stdout);
? \$\endgroup\$Gluttton– Gluttton2014年11月01日 10:29:43 +00:00Commented Nov 1, 2014 at 10:29 -
\$\begingroup\$ 2. In general, processors are optimized (speed/code space) for
int/unsigned
rather than(u)int8_t
This answer and Jens comment gets into some of the issues. 3.fputc(code & 1 ...
(notfputs()
) would work as well. \$\endgroup\$chux– chux2014年11月01日 14:55:09 +00:00Commented Nov 1, 2014 at 14:55
Explore related questions
See similar questions with these tags.