I noticed that there's no such question, so here it is:
Do you have general tips for golfing in x86/x64 machine code? If the tip only applies to a certain environment or calling convention, please specify that in your answer.
Please only one tip per answer (see here).
38 Answers 38
mov-immediate is expensive for constants
This might be obvious, but I'll still put it here. In general it pays off to think about the bit-level representation of a number when you need to initialize a value.
Initializing eax with 0:
b8 00 00 00 00 mov 0ドルx0,%eax
should be shortened (for performance as well as code-size) to
31 c0 xor %eax,%eax
Initializing eax with -1:
b8 ff ff ff ff mov $-1,%eax
can be shortened to (32-bit mode)
31 c0 xor %eax,%eax
48 dec %eax # 2 bytes in 64-bit mode
or (any mode)
83 c8 ff or $-1,%eax
Or more generally, any 8-bit sign-extended value can be created in 3 bytes with push -12 (2 bytes) / pop %eax (1 byte). This even works for 64-bit registers with no extra REX prefix; push/pop default operand-size = 64.
6a f3 pushq 0ドルxfffffffffffffff3
5d pop %rbp
Or given a known constant in a register, you can create another nearby constant using lea 123(%eax), %ecx (3 bytes). This is handy if you need a zeroed register and a constant; xor-zero (2 bytes) + lea-disp8 (3 bytes).
31 c0 xor %eax,%eax
8d 48 0c lea 0xc(%eax),%ecx
-
2\$\begingroup\$ BTW to initialize a register to -1, use
dec, e.g.xor eax, eax; dec eax\$\endgroup\$anatolyg– anatolyg2017年07月18日 10:28:32 +00:00Commented Jul 18, 2017 at 10:28 -
\$\begingroup\$ @anatolyg: 200 is a poor example, it doesn't fit in a sign-extended-imm8. But yes,
push imm8/pop regis 3 bytes, and is fantastic for 64-bit constants on x86-64, wheredec/incis 2 bytes. Andpush r64/pop 64(2 bytes) can even replace a 3 bytemov r64, r64(3 bytes with REX). See also Set all bits in CPU register to 1 efficiently for stuff likelea eax, [rcx-1]given a known value ineax(e.g. if need a zeroed register and another constant, just use LEA instead of push/pop \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 13:35:00 +00:00Commented Mar 29, 2018 at 13:35
Choose your calling convention to put args where you want them.
The language of your answer is asm (actually machine code), so treat it as part of a program written in asm, not C-compiled-for-x86. Your function doesn't have to be easily callable from C with any standard calling convention. That's a nice bonus if it doesn't cost you any extra bytes, though.
In a pure asm program, it's normal for some helper functions to use a calling convention that's convenient for them and for their caller. Such functions document their calling convention (inputs/outputs/clobbers) with comments.
In real life, even asm programs do (I think) tend to use consistent calling conventions for most functions (especially across different source files), but any given important function could do something special. In code-golf, you're optimizing the crap out of one single function, so obviously it's important/special.
To test your function from a C program, you can write a wrapper that puts args in the right places, saves/restores any extra registers you clobber, and puts the return value into e/rax if it wasn't there already.
The limits of what's reasonable: anything that doesn't impose an unreasonable burden on the caller:
esp/rspmust be call-preserved1; other integer regs are fair game for being call-clobbered. (rbpandrbxare usually call-preserved in normal conventions, but you could clobber both.)Any arg in any register (except
rsp) is reasonable, but asking the caller to copy the same arg to multiple registers is not.Requiring
DF(string direction flag forlods/stos/etc.) to be clear (upward) on call/ret is normal. Letting it be undefined on call/ret would be ok. Requiring it to be cleared or set on entry but then leaving it modified when you return would be weird.Returning FP values in x87
st0is reasonable, but returning inst3with garbage in other x87 registers isn't. The caller would have to clean up the x87 stack. Even returning inst0with non-empty higher stack registers would also be questionable (unless you're returning multiple values).Your function will be called with
call, so[rsp]is your return address. You can avoidcall/reton x86 using a link register likelea rbx, [ret_addr]/jmp functionand return withjmp rbx, but that's not "reasonable". That's not as efficient ascall/ret, so it's not something you'd plausibly find in real code.Clobbering unlimited memory above
rspis not reasonable, but clobbering your function args on the stack is allowed in normal calling conventions. x64 Windows requires 32 bytes of shadow space above the return address, while x86-64 System V gives you a 128 byte red-zone belowrsp, so either of those are reasonable. (Or even a much larger red-zone, especially in a stand-alone program rather than function.)
Note 1: or have some well-defined sensible rule for how RSP is modified: e.g. callee-pops stack args like with ret 8. (Although stack args and a larger ret imm16 encoding are usually not what you want for code-golf). Or even returning an array by value on the stack is an unconventional but usable calling-convention. e.g. pop the return address into a register, then push in a loop, then jmp reg to return. Probably only justifiable with a size in a register, else the caller would have to save the original RSP somewhere. RBP or some other reg would have to be call-preserved so a caller could use it as a frame pointer to easily clean up the result. Also probably not smaller than using stos to write an output pointer passed in RDI.
Borderline cases: write a function that produces a sequence in an array, given the first 2 elements as function args. I chose to have the caller store the start of the sequence into the array and just pass a pointer to the array. This is definitely bending the question's requirements. I considered taking the args packed into xmm0 for movlps [rdi], xmm0, which would also be a weird calling convention and ever harder to justify / more of a stretch.
Return a boolean in FLAGS (condition codes)
OS X system calls do this (CF=0 means no error): Is it considered bad practice to use the flags register as a boolean return value?.
Any condition that can be checked with one jcc is perfectly reasonable, especially if you can pick one that has any semantic relevance to the problem. (e.g. a compare function might set flags so jne will be taken if they weren't equal).
Require narrow args (like a char) to be sign or zero extended to 32 or 64 bits.
This is not unreasonable; using movzx or movsx to avoid partial-register slowdowns is normal in modern x86 asm. In fact clang/LLVM already makes code that depends on an undocumented extension to the x86-64 System V calling convention: args narrower than 32 bits are sign or zero extended to 32 bits by the caller.
You can document/describe extension to 64 bits by writing uint64_t or int64_t in your prototype if you want, e.g. so you can use a loop instruction, which uses the whole 64 bits of rcx unless you use an address-size prefix to override the size down to 32 bit ecx (yes really, address-size not operand-size).
Note that long is only a 32-bit type in the Windows 64-bit ABI, and the Linux x32 ABI; uint64_t is unambiguous and shorter to type than unsigned long long.
Existing calling conventions:
Windows 32-bit
__fastcall, already suggested by another answer: integer args inecxandedx.x86-64 System V: passes lots of args in registers, and has lots of call-clobbered registers you can use without REX prefixes. More importantly, it was actually chosen to allow compilers to inline (or implement in libc)
memcpyormemsetasrep movsbeasily: the first 6 integer/pointer args are passed inrdi,rsi,rdx,rcx,r8, andr9.If your function uses
lodsd/stosdinside a loop that runsrcxtimes (with theloopinstruction), you can say "callable from C asint foo(int *rdi, const int *rsi, int dummy, uint64_t len)with the x86-64 System V calling convention". example: chromakey.32-bit GCC
regparm: Integer args ineax,ecx, andedx, return in EAX (or EDX:EAX). Having the first arg in the same register as the return value allows some optimizations, like this case with an example caller and a prototype with a function attribute. And of course AL/EAX is special for some instructions.The Linux x32 ABI uses 32-bit pointers in long mode, so you can save a REX prefix when modifying a pointer (example use-case). You can still use 64-bit address-size, unless you have a 32-bit negative integer zero-extended in a register (so it would be a large unsigned value if you did
[rdi + rdx], going outside the low 32 bits of address space.).Note that
push rsp/pop raxis 2 bytes, and equivalent tomov rax, rsp, so you can still copy full 64-bit registers in 2 bytes.
-
\$\begingroup\$ When challenges ask to return an array, do you think returning on the stack is reasonable? I think that's what compilers will do when returning a struct by value. \$\endgroup\$qwr– qwr2018年05月18日 18:52:19 +00:00Commented May 18, 2018 at 18:52
-
\$\begingroup\$ @qwr: no, the mainstream calling conventions pass a hidden pointer to the return value. (Some conventions pass/return small structs in registers). C/C++ returning struct by value under the hood, and see the end of How do objects work in x86 at the assembly level?. Note that passing arrays (inside structs) does copy them onto the stack for x86-64 SysV: What kind of C11 data type is an array according to the AMD64 ABI, but Windows x64 passes a non-const pointer. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 20:14:04 +00:00Commented May 18, 2018 at 20:14
-
\$\begingroup\$ so what do you think about reasonable or not? Do you count x86 under this rule codegolf.meta.stackexchange.com/a/8507/17360 \$\endgroup\$qwr– qwr2018年05月18日 23:37:40 +00:00Commented May 18, 2018 at 23:37
-
4\$\begingroup\$ @qwr: x86 isn't a "stack based language". x86 is a register machine with RAM, not a stack machine. A stack machine is like reverse-polish notation, like x87 registers. fld / fld / faddp. x86's call-stack doesn't fit that model: all normal calling conventions leave RSP unmodified, or pop the args with
ret 16; they don't pop the return address, push an array, thenpush rcx/ret. The caller would have to know the array size or have saved RSP somewhere outside the stack to find itself. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 23:56:42 +00:00Commented May 18, 2018 at 23:56 -
\$\begingroup\$ Call push the address of instruction after the call in the stack jmp to function called; ret pop the address from the stack and jmp to that address \$\endgroup\$user58988– user589882019年03月12日 18:56:55 +00:00Commented Mar 12, 2019 at 18:56
Use special-case short-form encodings for AL/AX/EAX, and other short forms and single-byte instructions
Examples assume 32 / 64-bit mode, where the default operand size is 32 bits. An operand-size prefix changes the instruction to AX instead of EAX (or the reverse in 16-bit mode).
inc/deca register (other than 8-bit):inc eax/dec ebp. (Not x86-64: the0x4xopcode bytes were repurposed as REX prefixes, soinc r/m32is the only encoding.)
8-bit inc bl is 2 bytes, using the inc r/m8 opcode + ModR/M operand encoding. So use inc ebx to increment bl, if it's safe. (e.g. if you don't need the ZF result in cases where the upper bytes might be non-zero).
scasd:e/rdi+=4, requires that the register points to readable memory. Sometimes useful even if you don't care about the FLAGS result (likecmp eax,[rdi]/rdi+=4). And in 64-bit mode,scasbcan work as a 1-byteinc rdi, if lodsb or stosb aren't useful.xchg eax, r32: this is where 0x90 NOP came from:xchg eax,eax. Example: re-arrange 3 registers with twoxchginstructions in acdq/idivloop for GCD in 8 bytes where most of the instructions are single-byte, including an abuse ofinc ecx/loopinstead oftest ecx,ecx/jnzcdq: sign-extend EAX into EDX:EAX, i.e. copying the high bit of EAX to all bits of EDX. To create a zero with known non-negative, or to get a 0/-1 to add/sub or mask with. x86 history lesson:cltqvs.movslq, and also AT&T vs. Intel mnemonics for this and the relatedcdqe.lodsb/d: like
mov eax, [rsi]/rsi += 4without clobbering flags. (Assuming DF is clear, which standard calling conventions require on function entry.) Also stosb/d, sometimes scas, and more rarely movs / cmps.push/pop reg. e.g. in 64-bit mode,push rsp/pop rdiis 2 bytes, butmov rdi, rspneeds a REX prefix and is 3 bytes.
xlatb exists, but is rarely useful. A large lookup table is something to avoid. I've also never found a use for AAA / DAA or other packed-BCD or 2-ASCII-digit instructions, except for a hacky use of DAS as part of converting a 4-bit integer to an ASCII hex digit, thanks to Peter Ferrie.
1-byte lahf / sahf are rarely useful. You could lahf / and ah, 1 as an alternative to setc ah, but it's typically not useful.
And for CF specifically, there's sbb eax,eax to get a 0/-1, or even un-documented but universally supported 1-byte salc (set AL from Carry) which effectively does sbb al,al without affecting flags. (Removed in x86-64). I used SALC in User Appreciation Challenge #1: Dennis ♦.
1-byte cmc / clc / stc (flip ("complement"), clear, or set CF) are rarely useful, although I did find a use for cmc in extended-precision addition with base 10^9 chunks. To unconditionally set/clear CF, usually arrange for that to happen as part of another instruction, e.g. xor eax,eax clears CF as well as EAX. There are no equivalent instructions for other condition flags, just DF (string direction) and IF (interrupts). The carry flag is special for a lot of instructions; shifts set it, adc al, 0 can add it to AL in 2 byte, and I mentioned earlier the undocumented SALC.
std / cld rarely seem worth it. Especially in 32-bit code, it's better to just use dec on a pointer and a mov or memory source operand to an ALU instruction instead of setting DF so lodsb / stosb go downward instead of up. Usually if you need downward at all, you still have another pointer going up, so you'd need more than one std and cld in the whole function to use lods / stos for both. Instead, just use the string instructions for the upward direction. (The standard calling conventions guarantee DF=0 on function entry, so you can assume that for free without using cld.)
8086 history: why these encodings exist
In original 8086, AX was very special: instructions like lodsb / stosb, cbw, mul / div and others use it implicitly. That's still
the case of course; current x86 hasn't dropped any of 8086's opcodes (at least not any of the officially documented ones, except in 64-bit mode). But later CPUs added new instructions that gave better / more efficient ways to do things without copying or swapping them to AX first. (Or to EAX in 32-bit mode.)
e.g. 8086 lacked later additions like movsx / movzx to load or move + sign-extend, or 2 and 3-operand imul cx, bx, 1234 that don't produce a high-half result and don't have any implicit operands.
Also, 8086's main bottleneck was instruction-fetch, so optimizing for code-size was important for performance back then. 8086's ISA designer (Stephen Morse) spent a lot of opcode coding space on special cases for AX / AL, including special (E)AX/AL-destination opcodes for all the basic immediate-src ALU- instructions, just opcode + immediate with no ModR/M byte. 2-byte add/sub/and/or/xor/cmp/test/... AL,imm8 or AX,imm16 or (in 32-bit mode) EAX,imm32.
But there's no special case for EAX,imm8, so the regular ModR/M encoding of add eax,4 is shorter.
The assumption is that if you're going to work on some data, you'll want it in AX / AL, so swapping a register with AX was something you might want to do, maybe even more often than copying a register to AX with mov.
Everything about 8086 instruction encoding supports this paradigm, from instructions like lodsb/w to all the special-case encodings for immediates with EAX to its implicit use even for multiply/divide.
Don't get carried away; it's not automatically a win to swap everything to EAX, especially if you need to use immediates with 32-bit registers instead of 8-bit. Or if you need to interleave operations on multiple variables in registers at once. Or if you're using instructions with 2 registers, not immediates at all.
But always keep in mind: am I doing anything that would be shorter in EAX/AL? Can I rearrange so I have this in AL, or am I currently taking better advantage of AL with what I'm already using it for.
Mix 8-bit and 32-bit operations freely to take advantage whenever it's safe to do so (you don't need carry-out into the full register or whatever).
-
\$\begingroup\$
cdqis useful fordivwhich needs zeroededxin many cases. \$\endgroup\$qwr– qwr2018年04月14日 16:50:23 +00:00Commented Apr 14, 2018 at 16:50 -
3\$\begingroup\$ @qwr: right, you can abuse
cdqbefore unsigneddivif you know your dividend is below 2^31 (i.e. non-negative when treated as signed), or if you use it before settingeaxto a potentially-large value. Normally (outside code-golf) you'd usecdqas setup foridiv, andxor edx,edxbeforediv\$\endgroup\$Peter Cordes– Peter Cordes2018年04月14日 21:17:05 +00:00Commented Apr 14, 2018 at 21:17 -
\$\begingroup\$ "I've also never found a use for AAA / DAA or other packed-BCD or 2-ASCII-digit instructions." Here's an example of converting a number to an ASCII hex digit codepoint. I golfed this at some point and found several choices, none of which were shorter than this sequence. \$\endgroup\$ecm– ecm2021年09月07日 15:03:08 +00:00Commented Sep 7, 2021 at 15:03
-
1\$\begingroup\$ @ecm: There's a 5-byte / 3-instruction hack using DAS (which I didn't know about when I wrote this answer), suggested by @ peter ferrie. I described how/why it works in Little Endian Number to String Conversion \$\endgroup\$Peter Cordes– Peter Cordes2021年09月07日 18:47:29 +00:00Commented Sep 7, 2021 at 18:47
-
1\$\begingroup\$ @ecm: that's correct; 64-bit mode cleaned up the opcode coding space a bit for future 64-bit-only extensions, which Intel has unfortunately been reluctant to take advantage of. Still wasting code-size cramming things like EVEX prefixes into patterns that aren't valid 32-bit encodings. What I said wasn't wrong, since modern CPUs are still required to support 16 and 32-bit modes, but it is useful to clarify, thanks. For actual codegolf.SE purposes, it's sufficient that 64-bit
lahf/sahfare available on some implementations, and fuz's linked answer mentions that along with the BCD insns. \$\endgroup\$Peter Cordes– Peter Cordes2022年02月08日 16:25:29 +00:00Commented Feb 8, 2022 at 16:25
In a lot of cases, accumulator-based instructions (i.e. those that take (R|E)AX as the destination operand) are 1 byte shorter than general-case instructions; see this question on StackOverflow.
-
1\$\begingroup\$ Normally the most useful ones are the
al, imm8special cases, likeor al, 0x20/sub al, 'a'/cmp al, 'z'-'a'/ja .non_alphabeticbeing 2 bytes each, instead of 3. Usingalfor character data also allowslodsband/orstosb. Or usealto test something about the low byte of EAX, likelodsd/test al, 1/setnz clmakes cl=1 or 0 for odd/even. But in the rare case where you need a 32-bit immediate, then sureop eax, imm32, like in my chroma-key answer \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 13:31:26 +00:00Commented Mar 29, 2018 at 13:31
Subtract -128 instead of add 128
0100 81C38000 ADD BX,0080
0104 83EB80 SUB BX,-80
Samely, add -128 instead of subtract 128
-
6\$\begingroup\$ This also works the other direction, of course: add -128 instead of sub 128. Fun fact: compilers know this optimization, and also do a related optimization of turning
< 128into<= 127to reduce the magnitude of an immediate operand forcmp, or gcc always prefers rearranging compares to reduce the magnitude even if it's not -129 vs. -128. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 20:22:04 +00:00Commented May 18, 2018 at 20:22
Create 3 zeroes with mul (then inc/dec to get +1 / -1 as well as zero)
You can zero eax and edx by multiplying by zero in a third register.
xor ebx, ebx ; 2B ebx = 0
mul ebx ; 2B eax=edx = 0
inc ebx ; 1B ebx=1
will result in EAX, EDX, and EBX all being zero in just four bytes. You can zero EAX and EDX in three bytes:
xor eax, eax
cdq
But from that starting point you can't get a 3rd zeroed register in one more byte, or a +1 or -1 register in another 2 bytes. Instead, use the mul technique.
Example use-case: concatenating the Fibonacci numbers in binary.
Note that after a LOOP loop finishes, ECX will be zero and can be used to zero EDX and EAX; you don't always have to create the first zero with xor.
-
1\$\begingroup\$ This is a bit confusing. Could you expand? \$\endgroup\$NoOneIsHere– NoOneIsHere2017年11月14日 01:29:14 +00:00Commented Nov 14, 2017 at 1:29
-
2\$\begingroup\$ @NoOneIsHere I believe he wants to set three registers to 0, including EAX and EDX. \$\endgroup\$Maya– Maya2017年11月27日 15:11:29 +00:00Commented Nov 27, 2017 at 15:11
-
\$\begingroup\$ Another use-case: 8086 Segment Address to Linear using 32-bit registers in 16-bit code, getting upper-16 zeroed in multiple registers is useful for later zero-extending, and this trick is more likely to be worth it since you're paying two
66hprefixes for three 32-bit registers. And if you don't need all those zeros right away, sometimes it can still allow other savings likemov cl, 8instead ofmov cx, 8after zeroing ECX as part of this. \$\endgroup\$Peter Cordes– Peter Cordes2022年12月24日 07:39:42 +00:00Commented Dec 24, 2022 at 7:39
Skipping instructions
Skipping instructions are opcode fragments that combine with one or more subsequent opcodes. The subsequent opcodes can be used with a different entrypoint than the prepended skipping instruction. Using a skipping instruction instead of an unconditional short jump can save code space, be faster, and set up incidental state such as NC (No Carry).
My examples are all for 16-bit Real/Virtual 86 Mode, but a lot of these techniques can be used similarly in 16-bit Protected Mode, or 32- or 64-bit modes.
Quoting from my ACEGALS guide:
11: Skipping instructions
The constants __TEST_IMM8, __TEST_IMM16, and __TEST_OFS16_IMM8 are defined to the respective byte strings for these instructions. They can be used to skip subsequent instructions that fit into the following 1, 2, or 3 bytes. However, note that they modify the flags register, including always setting NC. The 16-bit offset plus 16-bit immediate test instruction is not included for these purposes because it might access a word at offset 0FFFFh in a segment. Also, the __TEST_OFS16_IMM8 as provided should only be used in 86M, to avoid accessing data beyond a segment limit. After the db instruction using one of these constants, a parenthetical remark should list which instructions are skipped.
The 86 Mode defines in lmacros1.mac 323cc150061e (2021年08月29日 21:45:54 +0200):
%define __TEST_IMM8 0A8h ; changes flags, NC
%define __TEST_IMM16 0A9h ; changes flags, NC
; Longer NOPs require two bytes, like a short jump does.
; However they execute faster than unconditional jumps.
; This one reads random data in the stack segment.
; (Search for better ones.)
%define __TEST_OFS16_IMM8 0F6h,86h ; changes flags, NC
The 0F6h,86h opcode in 16-bit modes is a test byte [bp + disp16], imm8 instruction. I believe I am not using this one anywhere actually. (A stack memory access might actually be slower than an unconditional short jump, in fact.)
0A8h is the opcode for test al, imm8 in any mode. The 0A9h opcode changes to an instruction of the form test eax, imm32 in 32- and 64-bit modes.
Two use cases in ldosboot boot32.asm 07f4ba0ef8cd (2021年09月10日 22:45:32 +0200):
First, chain two different entrypoints for a common function which both need to initialise a byte-sized register. The mov al, X instructions take 2 bytes each, so __TEST_IMM16 can be used to skip one such instruction. (This pattern can be repeated if there are more than two entrypoints.)
error_fsiboot:
mov al,'I'
db __TEST_IMM16 ; (skip mov)
read_sector.err:
mov al, 'R' ; Disk 'R'ead error
error:
Second, a certain entrypoint that needs two bytes worth of additional teardown but can otherwise be shared with the fallthrough case of a later code part.
mov bx, [VAR(para_per_sector)]
sub word [VAR(paras_left)], bx
jbe @F ; read enough -->
loop @BB
pop bx
pop cx
call clust_next
jnc next_load_cluster
inc ax
inc ax
test al, 8 ; set in 0FFF_FFF8h--0FFF_FFFFh,
; clear in 0, 1, and 0FFF_FFF7h
jz fsiboot_error_badchain
db __TEST_IMM16
@@:
pop bx
pop cx
call check_enough
jmp near word [VAR(fsiboot_table.success)]
Here's a use case in inicomp lz4.asm 4d568330924c (2021年09月03日 16:59:42 +0200) where we depend on the test al, X instruction clearing the Carry Flag:
.success:
db __TEST_IMM8 ; (NC)
.error:
stc
retn
Further, here's a very similar use of a skipping instruction in DOSLFN Version 0.41c (11/2012). Instead of test ax, imm16 they're using mov cx, imm16 which has no effect on the status flags but clobbers the cx register instead. (Opcode 0B9h is mov ecx, imm32 in non-16-bit modes, and writes to the full ecx or rcx register.)
;THROW-Geschichten... [english: THROW stories...]
SetErr18:
mov al,18
db 0B9h ;mov cx,nnnn
SetErr5:
mov al,5
db 0B9h ;mov cx,nnnn
SetErr3:
mov al,3
db 0B9h ;mov cx,nnnn
SetErr2:
mov al,2
SetError:
Finally, the FAT12 boot loader released on 2002年11月26日 as fatboot.zip/fat12.asm by Chris Giese (which I based my FAT12, FAT16, and FAT32 loaders on) uses cmp ax, imm16 as a skipping instruction in its error handler. This is similar to my lDOS boot error handlers but cmp leaves an indeterminate Carry Flag state rather than always setting up No Carry. Also note the comment referring to "Microsoft's Color Computer BASIC":
mov al,'F' ; file not found; display blinking 'F'
; 'hide' the next 2-byte instruction by converting it to CMP AX,NNNN
; I learned this trick from Microsoft's Color Computer BASIC :)
db 3Dh
disk_error:
mov al,'R' ; disk read error; display blinking 'R'
error:
-
2\$\begingroup\$ Welcome to Code Golf! This is a great tip! \$\endgroup\$2021年09月20日 13:27:45 +00:00Commented Sep 20, 2021 at 13:27
-
1\$\begingroup\$ My answer on Golf a Custom Fibonacci Sequence is another example of the same idea. The first instruction I wanted for my loop happened to be opcode
01 add, so anEBbyte before the loop made the first trip through decode asjmp rel8=+1, jumping to the end of the 2-byte add instruction. I first learned of the idea from Ira Baxter's answer on Can assembled ASM code result in more than a single possible way (except for offset values)? \$\endgroup\$Peter Cordes– Peter Cordes2021年10月27日 17:32:57 +00:00Commented Oct 27, 2021 at 17:32
CPU registers and flags are in known startup states
For a full/standalone program, we can assume that the CPU is in a known and documented default state based on platform and OS.
For example:
DOS http://www.fysnet.net/yourhelp.htm
Linux x86 ELF
http://asm.sourceforge.net/articles/startup.html - in _start in a static executable, most registers are zero other than the stack pointer, to avoid leaking info into a fresh process. pop will load argc which is a small non-negative integer, 1 if run normally from a shell with no args.
Same applies for x86-64 processes on Linux.
-
4\$\begingroup\$ Code Golf rules say your code has to work on at least one implementation. Linux chooses to zero all the regs (except RSP) and stack before entering a fresh user-space process, even though the i386 and x86-64 System V ABI docs say they're "undefined" on entry to
_start. So yes it's fair game to take advantage of that if you're writing a program instead of a function. I did so in Extreme Fibonacci. (In a dynamically-linked executable, ld.so runs before jumping to your_start, and does leave garbage in registers, but static is just your code.) \$\endgroup\$Peter Cordes– Peter Cordes2019年05月03日 01:50:05 +00:00Commented May 3, 2019 at 1:50 -
\$\begingroup\$ A couple others: eflags is set to 0x202, mxcsr is set to 0x1f80, though you can't directly access them. \$\endgroup\$General Grievance– General Grievance2020年01月10日 05:04:31 +00:00Commented Jan 10, 2020 at 5:04
-
\$\begingroup\$ Adding on to DOS / .COM, we can assume also
org 100handIP=100hon start as well. \$\endgroup\$640KB– 640KB2025年09月02日 16:49:16 +00:00Commented Sep 2 at 16:49
mov small immediates into lower registers when applicable
If you already know the upper bits of a register are 0, you can use a shorter instruction to move an immediate into the lower registers.
b8 0a 00 00 00 mov 0ドルxa,%eax
versus
b0 0a mov 0ドルxa,%al
Use push/pop for imm8 to zero upper bits
Credit to Peter Cordes. xor/mov is 4 bytes, but push/pop is only 3!
6a 0a push 0ドルxa
58 pop %eax
-
\$\begingroup\$
mov al, 0xais good if you don't need it zero-extended to the full reg. But if you do, xor/mov is 4 bytes vs. 3 for push imm8/pop orleafrom another known constant. This could be useful in combination withmulto zero 3 registers in 4 bytes, orcdq, if you need a lot of constants, though. \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:13:49 +00:00Commented Mar 29, 2018 at 18:13 -
\$\begingroup\$ The other use-case would be for constants from
[0x80..0xFF], which are not representable as a sign-extended imm8. Or if you already know the upper bytes, e.g.mov cl, 0x10after aloopinstruction, because the only way forloopto not jump is when it madercx=0. (I guess you said this, but your example uses anxor). You can even use the low byte of a register for something else, as long as the something else puts it back to zero (or whatever) when you're done. e.g. my Fibonacci program keeps-1024in ebx, and uses bl. \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:17:28 +00:00Commented Mar 29, 2018 at 18:17 -
\$\begingroup\$ @PeterCordes I've added your push/pop technique \$\endgroup\$qwr– qwr2018年03月29日 18:30:14 +00:00Commented Mar 29, 2018 at 18:30
-
\$\begingroup\$ Should probably go into the existing answer about constants, where anatolyg already suggested it in a comment. I'll edit that answer. IMO you should rework this one to suggest using 8-bit operand-size for more stuff (except
xchg eax, r32) e.g.mov bl, 10/dec bl/jnzso your code doesn't care about the high bytes of RBX. \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:37:11 +00:00Commented Mar 29, 2018 at 18:37 -
2\$\begingroup\$ Caveat is that
PUSH immediateis not supported on the 8086/8088. \$\endgroup\$640KB– 640KB2019年02月18日 19:16:44 +00:00Commented Feb 18, 2019 at 19:16
Use do-while loops instead of while loops
This is not x86 specific but is a widely applicable beginner assembly tip. If you know a while loop will run at least once, rewriting the loop as a do-while loop, with loop condition checking at the end, often saves a 2 byte jump instruction. In a special case you might even be able to use loop.
-
3\$\begingroup\$ Related: Why are loops always compiled like this? explains why
do{}while()is the natural looping idiom in assembly (especially for efficiency). Note also that 2-bytejecxz/jrcxzbefore a loop works very well withloopto handle the "needs to run zero times" case "efficiently" (on the rare CPUs whereloopisn't slow).jecxzis also usable inside the loop to implement awhile(ecx){}, withjmpat the bottom. \$\endgroup\$Peter Cordes– Peter Cordes2018年04月15日 00:02:18 +00:00Commented Apr 15, 2018 at 0:02 -
\$\begingroup\$ @PeterCordes that is a very well written answer. I'd like to find a use for jumping into the middle of a loop in a code golf program. \$\endgroup\$qwr– qwr2018年04月15日 02:56:33 +00:00Commented Apr 15, 2018 at 2:56
-
\$\begingroup\$ Use goto jmp and indentation... Loop follow \$\endgroup\$user58988– user589882019年03月12日 19:04:21 +00:00Commented Mar 12, 2019 at 19:04
Combinations with CDQ for certain piecewise-linear functions
CDQ sign-extends EAX into EDX, making EDX 0 if EAX is nonnegative and -1 (all 1s) if EAX is negative. This can be combined with several other instructions to apply certain piecewise-linear functions to a value in EAX in 3 bytes:
CDQ + AND → \$ \min(x, 0) \$ (in either EAX or EDX). (I have used this here.)
CDQ + OR → \$ \max(x, -1) \$.
CDQ + XOR → \$ \max(x, -x-1) \$.
CDQ + MUL EDX → \$ \max(-x, 0) \$ in EAX and \$ \left\{ \begin{array}{ll} 0 & : x \ge 0 \\ x - 1 & : x < 0 \end{array} \right.\$ in EDX.
Use a good assembler
There are dozens of x86 assemblers out there, and they are not created equal.
Not only can a bad assembler be painful to use, but they might not always output the most optimal code.
Most x86 instructions have multiple valid encodings, some shorter than others.
For example, I saw one user with a 16-bit assembler that emitted different code depending on the order of xchg's operands. It is a commutative operation, it shouldn't make a difference.
87 D8 xchg ax, bx
93 xchg bx, ax
Life is too short for bad assemblers, and it should not be the thing getting in the way of golfing.
The three assemblers I would suggest are:
- nasm is the first one I would recommend to everyone (and I wish I had learned it first).
- It fully supports 16-bit, 32-bit, and 64-bit code
- It can easily assemble all sorts of object formats, as well as raw binaries/
.comfiles (a multi-step ritual with GAS). - It is officially supported on DOS as well as all modern OSes.
- While it doesn't support C macros, it has a god-tier preprocessor that is much better than C.
- The way it handles local labels is really nice.
- Good error messages, mostly fairly beginner-friendly pointing you in the direction of why an instruction isn't allowed or what it doesn't like about a source line.
- GAS (GCC's assembler) is another fairly good assembler.
- It is the assembler used by GCC and Clang. You might recognize it if you use Godbolt or
gcc -S. - It supports AT&T syntax which some might prefer
- With
.intel_syntax noprefix, you can switch to Intel syntax - While its built-in preprocessor is pretty limited, it can be easily combined with a C preprocessor.
- For x87, beware of the AT&T syntax design bug that interchanges
fsubrwithfsubin some cases, same forfdiv[r]. Older GAS versions applied the same swap in Intel-syntax mode, and so did older binutilsobjdump -dversions. (This is AT&T's fault, not GNU's, and current GAS versions do as well as possible, but is an inherent downside in using AT&T syntax for x87.) - Error messages are less helpful than NASM about why an instruction is invalid
- Ambiguous instructions other than
movdefault to dword operand-size instead of being an error in AT&T syntax, such asadd 123,ドル (%rdi)assembling asaddl. Clang, and GAS in Intel syntax mode, error on this.
- It is the assembler used by GCC and Clang. You might recognize it if you use Godbolt or
- Clang is, in most cases, completely exchangeable for GAS.
- It has much more helpful error messages and doesn't silently treat x86_64 registers as symbols in 32-bit mode.
- While AT&T syntax is fully supported, Intel syntax currently has a few bugs.
- It is the only reason I am recommending GAS over Clang. 😔
- It still works as a solid linter.
- It also supports C macros.
I haven't used enough of the other assemblers to give a good opinion, as I am more than satisfied with those three.
FASM is also well-regarded, using very nearly the same syntax as NASM, and is supported on https://TIO.run/ ; It's able to make 32-bit executables on TIO, unlike with other assemblers, using the directive format ELF executable 3 to emit a 32-bit ELF executable, not a .o object file that would need linking.
EuroAssembler is also open-source; its maintainer is active on Stack Overflow in the [assembly] and [x86] tags.
The FLAGS are set after many instructions
After many arithmetic instructions, the Carry Flag (unsigned) and Overflow Flag (signed) are set automatically (more info). The Sign Flag and Zero Flag are set after many arithmetic and logical operations. This can be used for conditional branching.
Example:
d1 f8 sar %eax
ZF is set by this instruction, so we can use it for condtional branching.
-
\$\begingroup\$ When have you ever used the parity flag? You know it's the horizontal xor of the low 8 bits of the result, right? (Regardless of operand-size, PF is set only from the low 8 bits; see also). Not even-number / odd-number; for that check ZF after
test al,1; you usually don't get that for free. (Orand al,1to create an integer 0/1 depending on odd/even.) \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:46:20 +00:00Commented Mar 29, 2018 at 18:46 -
1\$\begingroup\$ Anyway, if this answer said "use flags already set by other instructions to avoid
test/cmp", then that would be pretty basic beginner x86, but still worth an upvote. \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:49:10 +00:00Commented Mar 29, 2018 at 18:49 -
\$\begingroup\$ @PeterCordes Huh, I seemed to have misunderstood the parity flag. I am still working on my other answer. I'll edit the answer. And as you can probably tell, I am a beginner so basic tips help. \$\endgroup\$qwr– qwr2018年03月29日 18:55:55 +00:00Commented Mar 29, 2018 at 18:55
The loop and string instructions are smaller than alternative instruction sequences. Most useful is loop <label> which is smaller than the two instruction sequence dec ECX and jnz <label>, and lodsb is smaller than mov al,[esi] and inc si.
Use fastcall conventions
x86 platform has many calling conventions. You should use those that pass parameters in registers. On x86_64, the first few parameters are passed in registers anyway, so no problem there. On 32-bit platforms, the default calling convention (cdecl) passes parameters in stack, which is no good for golfing - accessing parameters on stack requires long instructions.
When using fastcall on 32-bit platforms, 2 first parameters are usually passed in ecx and edx. If your function has 3 parameters, you might consider implementing it on a 64-bit platform.
C function prototypes for fastcall convention (taken from this example answer):
extern int __fastcall SwapParity(int value); // MSVC
extern int __attribute__((fastcall)) SwapParity(int value); // GNU
Note: you can also use other calling conventions, including custom ones. I never use custom calling conventions; for any ideas related to these, see here.
-
7\$\begingroup\$ Or use a fully custom calling convention, because you're writing in pure asm, not necessarily writing code to be called from C. Returning booleans in FLAGS is often convenient. \$\endgroup\$Peter Cordes– Peter Cordes2018年08月04日 21:32:02 +00:00Commented Aug 4, 2018 at 21:32
lea for multiplications by particular small constants
A well-known trick is that lea can be used to do multiplication by 2, 3, 5, or 9 (and store in a new register) in 3 bytes, optionally adding a displacement register for 1 additional byte. All examples assume 32-bit mode.
For example, to calculate ebx = 9 * eax in 3 bytes:
8d 1c c0 lea ebx, [eax, 8*eax]
ebx = 9*eax + 3 in 4 bytes:
8d 5c c0 03 lea ebx, [eax + 8*eax + 3]
For multiplication by 2, it saves 1 byte compared to shl/mov:
8d 1c 00 lea ebx, [eax + eax]
89 c3 mov ebx, eax
d1 e3 shl ebx, 1
Of course, if you shift in-place, you only need shl.
Interestingly, lea multiplication by 4 or 8 isn't size efficient at all because it ends up using 0x00000000 as an absolute displacement. Another difference is that lea doesn't affect flags at all, which may or may not be a good thing depending on the flags use-case.
8d 1c 85 00 00 00 00 lea ebx, [4*eax]
89 c3 mov ebx, eax
c1 e3 02 shl ebx, 2
(GCC -Oz showed me if you are returning or starting in eax and are ok with clobbering the original register, you can save a byte over mov with xchg. That has nothing to do with shl.)
If the result ends in eax, you can use imul on any constant from 2 to 127 for 3 bytes.
More info on x86's addressing modes: https://blog.yossarian.net/2020/06/13/How-x86_64-addresses-memory
-
3\$\begingroup\$ Maybe worth mentioning that
lea eax, [rcx + 13]is the no-extra-prefixes version for 64-bit mode. 32-bit operand-size (for the result) and 64-bit address size (for the inputs). \$\endgroup\$Peter Cordes– Peter Cordes2018年03月30日 17:39:09 +00:00Commented Mar 30, 2018 at 17:39
To copy a 64-bit register, use push rcx ; pop rdx instead of a 3-byte mov.
The default operand-size of push/pop is 64-bit without needing a REX prefix.
51 push rcx
5a pop rdx
vs.
48 89 ca mov rdx,rcx
(An operand-size prefix can override the push/pop size to 16-bit, but 32-bit push/pop operand-size is not encodeable in 64-bit mode even with REX.W=0.)
If either or both registers are r8..r15, use mov because push and/or pop will need a REX prefix. Worst case this actually loses if both need REX prefixes. Obviously you should usually avoid r8..r15 anyway in code golf.
You can keep your source more readable while developing with this NASM macro. Just remember that it steps on the 8 bytes below RSP. (In the red-zone in x86-64 System V). But under normal conditions it's a drop-in replacement for 64-bit mov r64,r64 or mov r64, -128..127
; mov %1, %2 ; use this macro to copy 64-bit registers in 2 bytes (no REX prefix)
%macro MOVE 2
push %2
pop %1
%endmacro
Examples:
MOVE rax, rsi ; 2 bytes (push + pop)
MOVE rbp, rdx ; 2 bytes (push + pop)
mov ecx, edi ; 2 bytes. 32-bit operand size doesn't need REX prefixes
MOVE r8, r10 ; 4 bytes, don't use
mov r8, r10 ; 3 bytes, REX prefix has W=1 and the bits for reg and r/m being high
xchg eax, edi ; 1 byte (special xchg-with-accumulator opcodes)
xchg rax, rdi ; 2 bytes (REX.W + that)
xchg ecx, edx ; 2 bytes (normal xchg + modrm)
xchg rcx, rdx ; 3 bytes (normal REX + xchg + modrm)
The xchg part of the example is because sometimes you need to get a value into EAX or RAX and don't care about preserving the old copy. push/pop doesn't help you actually exchange, though.
Take advantage of the x86_64 code model
Linux's default code model will put all of your code and globals in the low 31 bits of memory, so 32-bit pointer arithmetic here is perfectly safe. The stack, libraries, and any dynamically allocated pointers are not, though. Try it online!
Make sure to still use the entire 64 bits in memory operands (including lea), because using [eax] requires a 67 prefix byte.
-
1\$\begingroup\$ The Linux x32 ABI (ILP32 in 64-bit mode) puts all addresses, including stack, in the low 32 bits of virtual address space. You can say you're targeting that if you want to
add edi, 4instead ofadd rdi,4. (Although in that specific example,scasdwill increment RDI by 4, assuming you haven't changed DF and it doesn't segfault). You can copy a 64-bit pointer in 2 bytes with push/pop, and maybe get away with only comparing the low 32 bits (of pointers to the same array), so there's a lot you can do while still being mostly 64-bit clean. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月27日 17:27:18 +00:00Commented Oct 27, 2021 at 17:27
Entry point doesn't necessarily have to be first byte of submission
I came across this answer, and didn't understand it at first until I realized that the intention is:
; ---- example calling code starts here -------------
MOV ECX, 1
CALL entry
RET
; ---- code golf answer code starts here (5 bytes) --
41 INC ECX
entry: E3 FD JECXZ SHORT $-1
91 XCHG EAX,ECX
C3 RETN
; ---- code golf answer code ends here -------------
Does not seem to conflict with any of the conditions of "Choose your calling convention" and is otherwise valid assembly language.
-
\$\begingroup\$ I see nothing wrong with that. The reason you don't see it in other languages is because they don't allow you to. Assembly is just bytes. \$\endgroup\$EasyasPi– EasyasPi2021年01月27日 23:50:26 +00:00Commented Jan 27, 2021 at 23:50
-
2\$\begingroup\$ You can do this in real programs using real asm / C toolchains, so it seems perfectly fine to me. From the PoV of anything else, you could look at this as the function jumping to a helper function next to it, if you're using tools that insist on treating the bytes of the function proper as the ones following the label. Does a function with instructions before the entry-point cause problems for anything? - AFAIK it's fine even in Linux shared or static libraries (maybe requiring labels and separate
.sizemetadata for the part before the function entry) \$\endgroup\$Peter Cordes– Peter Cordes2021年04月24日 06:58:53 +00:00Commented Apr 24, 2021 at 6:58
To add or subtract 1, use the one byte inc or dec instructions which are smaller than the multibyte add and sub instructions.
-
\$\begingroup\$ Note that 32-bit mode has 1-byte
inc/dec r32with the register number encoded in the opcode. Soinc ebxis 1 byte, butinc blis 2. Still smaller thanadd bl, 1of course, for registers other thanal. Also note thatinc/decleave CF unmodified, but update the other flags. \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:56:15 +00:00Commented Mar 29, 2018 at 18:56 -
1\$\begingroup\$ 2 for +2 & -2 in x86 \$\endgroup\$l4m2– l4m22018年03月31日 04:39:59 +00:00Commented Mar 31, 2018 at 4:39
Use whatever calling conventions are convenient
System V x86 uses the stack and System V x86-64 uses rdi, rsi, rdx, rcx, etc. for input parameters, and rax as the return value, but it is perfectly reasonable to use your own calling convention. __fastcall uses ecx and edx as input parameters, and other compilers/OSes use their own conventions. Use the stack and whatever registers as input/output when convenient.
Example: The repetitive byte counter, using a clever calling convention for a 1 byte solution.
Meta: Writing input to registers, Writing output to registers
Other resources: Agner Fog's notes on calling conventions
-
1\$\begingroup\$ I finally got around to posting my own answer on this question about making up calling conventions, and what's reasonable vs unreasonable. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 05:05:10 +00:00Commented May 18, 2018 at 5:05
-
\$\begingroup\$ @PeterCordes unrelated, what is the best way to print in x86? So far I've avoided challenges that require printing. DOS looks like it has useful interrupts for I/O but I am only planning on writing 32/64 bit answers. The only way I know of is
int 0x80which requires a bunch of setup. \$\endgroup\$qwr– qwr2018年05月18日 19:23:00 +00:00Commented May 18, 2018 at 19:23 -
\$\begingroup\$ Yeah,
int 0x80in 32-bit code, orsyscallin 64-bit code, to invokesys_write, is the only good way. It's what I used for Extreme Fibonacci. In 64-bit code,__NR_write = 1 = STDOUT_FILENO, so you canmov eax, edi. Or if the upper bytes of EAX are zero,mov al, 4in 32-bit code. You could alsocall printforputs, I guess, and write a "x86 asm for Linux+glibc" answer. I think it's reasonable to not count the PLT or GOT entry space, or the library code itself. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 20:02:36 +00:00Commented May 18, 2018 at 20:02 -
1\$\begingroup\$ I'd be more inclined to have the caller pass a
char*bufand produce the string in that, with manual formatting. e.g. like this (awkwardly optimized for speed) asm FizzBuzz, where I got string data into register and then stored it withmov, because the strings were short and fixed-length. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月18日 20:04:35 +00:00Commented May 18, 2018 at 20:04
Try XLAT for byte memory access
XLAT is a one byte instruction that is equivalent to AL = [BX+AL]. Yes, that's right, it lets you use AL as an index register for memory access.
-
\$\begingroup\$ Has this ever been used for golfing? I remember that I wanted to use it several times, but each time I found a shorter way to do the same. \$\endgroup\$anatolyg– anatolyg2020年05月20日 13:38:48 +00:00Commented May 20, 2020 at 13:38
-
-
2\$\begingroup\$ @anatolyg Combining with other short opcodes that operate on
ALlikeLODSB,AAM,AAD,INT 10/21H, it could be golfy. For example, this codechar bx[] = "qwertyuiop"; al = bx[ al % 10 ];"get an arbitary character in a string using the rightmost digit in accumulator as index" would be 3 bytesAAM / XLAT. Used here. \$\endgroup\$640KB– 640KB2020年05月20日 14:09:15 +00:00Commented May 20, 2020 at 14:09
Use interrupts and syscalls wisely
In general, unlike the C calling conventions, most syscalls and interrupts will preserve your registers and flags unless noted otherwise, except for a return value usually in AL/EAX/RAX depending on the OS. (e.g. the x86-64 syscall instruction itself destroys RCX and R11)
Linux specific:
- If exiting with a crash is okay,
int3,int1, orintocan usually do the job in one byte. Don't try this on DOS though, it will lock up.- Note that the error messages like
Segmentation fault (core dumped)orTrace/breakpoint trapare actually printed by your shell, not the program/kernel. Don't believe me? Try runningset +mbefore your program or redirecting stderr to a file.
- Note that the error messages like
- You can use
int 0x80in 64-bit mode. It will use the 32-bit ABI though (eax, ebx, ecx, edx), so make sure all pointers are in the low 32 bits. On the small code model, this true for all code stored in your binary. Keep this in mind for restricted-source.- Additionally,
sysenterandcall dword gs:0x10can also do syscalls in 32-bit mode, although the calling convention is quite....weird for the former.
- Additionally,
DOS/BIOS specific:
- Use
int 29hinstead ofint 21h:02hfor printing single bytes to the screen.int 29hdoesn't needahto be set and very conveniently usesalinstead ofdl. It writes directly to the screen, so you can't just redirect to a file, though. - DOS also has
strlenandstrcmpinterrupts (see this helpful page for this and other undocumented goodies) - Unless you modified
cs, don't useint 20horint 21h:4Chfor exiting, justretfrom your.comfile. Alternatively, if you happen to have0x0000on the top of your stack, you can alsoretto that. - In the rare case that you need to call helper functions more than 4 times, consider registering them to the
int1,int3, orintointerrupts. Make sure to useiretinstead ofret, though.
; AH: 0x25 (set interrupt)
; AL: 0x03 (INT3)
; Use 0x2504 for INTO, or 0x2501 for INT1
mov ax, 0x2503
; DX: address to function (make sure to use IRET instead of RET)
mov dx, my_interrupt_func
int 0x21
; Now instead of this 3 byte opcode...
call my_func
; ...just do this 1 byte opcode.
int3 ; or int1, into
- restricted-source tip: Your interrupt vector table is a table of far pointers at
0000:0000(so for example,int 21his at0000h:0084h).
-
\$\begingroup\$ re: repeated calls: Shorter x86 call instruction - getting a function-pointer into a register only takes 3 or 5 bytes (16 vs. 32/64-bit modes), and allows 2-byte
call reg. In 32-bit mode, this comes out ahead for just two call-sites. (5 + 2x2 vs. 5x2). That does cost a register, unlike int3. \$\endgroup\$Peter Cordes– Peter Cordes2021年08月24日 16:59:13 +00:00Commented Aug 24, 2021 at 16:59 -
\$\begingroup\$ Spending 8 bytes to set up an int3 handler in real mode breaks even if you amortize over 4 calls. (
3*n = 3+3+2 + 1*nsolves to n = 4.) mov reg,func / call reg in 16-bit mode amortizes to break-even + costing a register over 3 calls (3*n = 3 + 2*n). It loses to int3 at 6 or more calls (assuming you can spare a register), break even at 5 (3 + 2*n = 3+3+2 + 1*n) \$\endgroup\$Peter Cordes– Peter Cordes2021年08月24日 17:00:44 +00:00Commented Aug 24, 2021 at 17:00
Use conditional moves CMOVcc and sets SETcc
This is more a reminder to myself, but conditional set instructions exist and conditional move instructions exist on processors P6 (Pentium Pro) or newer. There are many instructions that are based on one or more of the flags set in EFLAGS.
-
2\$\begingroup\$ I've found branching is usually smaller. There are some cases where it's a natural fit, but
cmovhas a 2-byte opcode (0F 4x +ModR/M) so it's 3 bytes minimum. But the source is r/m32, so you can conditionally load in 3 bytes. Other than branching,setccis useful in more cases thancmovcc. Still, consider the entire instruction set, not just baseline 386 instructions. (Although SSE2 and BMI/BMI2 instruction are so large that they're rarely useful.rorx eax, ecx, 32is 6 bytes, longer than mov + ror. Nice for performance, not golf unless POPCNT or PDEP saves many isns) \$\endgroup\$Peter Cordes– Peter Cordes2018年03月29日 18:33:32 +00:00Commented Mar 29, 2018 at 18:33 -
\$\begingroup\$ @PeterCordes thanks, I've added
setcc. \$\endgroup\$qwr– qwr2018年03月29日 18:36:30 +00:00Commented Mar 29, 2018 at 18:36 -
\$\begingroup\$ (BTW, "conditionally load" isn't quite accurate in my last comment. Memory-source
cmovis an unconditional load, which will fault on a bad address. And then an ALU select. Not like an ARM predicatedldreqor whatever.) \$\endgroup\$Peter Cordes– Peter Cordes2022年12月24日 07:32:28 +00:00Commented Dec 24, 2022 at 7:32
Save on jmp bytes by arranging into if/then rather than if/then/else
This is certainly very basic, just thought I would post this as something to think about when golfing. As an example, consider the following straightforward code to decode a hexadecimal digit character:
cmp $'A', %al
jae .Lletter
sub $'0', %al
jmp .Lprocess
.Lletter:
sub $('A'-10), %al
.Lprocess:
movzbl %al, %eax
...
This can be shortened by two bytes by letting a "then" case fall into an "else" case:
cmp $'A', %al
jb .digit
sub $('A'-'0'-10), %eax
.digit:
sub $'0', %eax
movzbl %al, %eax
...
-
1\$\begingroup\$ You'd often do this normally when optimizing for performance, especially when the extra
sublatency on the critical path for one case isn't part of a loop-carried dependency chain (like here where each input digit is independent until merging 4-bit chunks). But I guess +1 anyway. BTW, your example has a separate missed optimization: if you're going to need amovzxat the end anyway then usesub $imm, %alnot EAX to take advantage of the no-modrm 2-byte encoding ofop $imm, %al. \$\endgroup\$Peter Cordes– Peter Cordes2019年08月24日 05:05:09 +00:00Commented Aug 24, 2019 at 5:05 -
\$\begingroup\$ Also, you can eliminate the
cmpby doingsub $'A'-10, %al;jae .was_alpha;add $('A'-10)-'0'. (I think I got the logic right). Note that'A'-10 > '9'so there's no ambiguity. Subtracting the correction for a letter will wrap a decimal digit. So this is safe if we're assuming our input is valid hex, just like yours does. \$\endgroup\$Peter Cordes– Peter Cordes2019年08月24日 05:10:52 +00:00Commented Aug 24, 2019 at 5:10
(way too many) ways of zeroing a register
I remember being taught these by a certain person (I "invented" some of these myself); I don't remember who did I get them from, anyways these are the most interesting; possible use cases include restricted source code challenges or other bizzare stuff.
=> Zero mov:
mov reg, 0
; mov eax, 0: B800000000
=> push+pop:
push [something equal to zero]
pop reg
; push 0 / pop eax: 6A0058
; note: if you have a register equal to zero, it will be
; shorter but also equal to a mov.
=> sub from itself:
sub reg, reg
; sub eax, eax: 29C0
=> mul by zero:
imul reg, 0
; imul eax, 0: 6BC000
=> and by zero:
and reg, 0
; and eax, 0: 83E000
=> xor by itself:
xor reg, reg
; xor eax, eax: 31C0
; possibly the best way to zero an arbitrary register,
; I remembered this opcode (among other).
=> or and inc / not:
or reg, -1
inc reg ; or not reg
; or eax, -1 / inc eax: 83C8FF40
=> reset ECX:
loop $
; loop $: E2FE
=> flush EDX:
shr eax, 1
cdq
; D1E899
=> zero AL (AH = AL, AL = 0)
aam 1
; D401
=> reset AH:
aad 0
; D500
=> Read 0 from the port
mov dx, 81h
in al, dx
; 66BA8100EC
=> Reset AL
stc
setnc al
; F90F93C0
=> Use the zero descriptor from gdt:
sgdt [esp-6]
mov reg, [esp-4]
mov reg, [reg]
; with eax: 0F014424FA8B4424FC8B00
=> Read zero from the fs segment (PE exe only)
mov reg, fs:[10h]
; with eax: 64A110000000
=> The brainfuck way
inc reg
jnz $-1
; with eax: 4075FD
=> Utilize the coprocessor
fldz
fistp dword ptr [esp-4]
mov eax, [esp-4]
; D9EEDB5C24FC8B4424FC
Another possible options:
- Read zero using the builtin random number generator.
- calculate sine from
pi * n(usefmul).
There are way cooler and potentially useful ways to execute this operation; although I didn't come up with them, therefore I'm not posting.
-
1\$\begingroup\$ If you really just want a zero, most of these are uselessly large >.<. This answer might be a better fit for this SO question: How many ways to set a register to zero?. My answer there has quite a lot of 1-instruction ways: if we allow 2 or more instructions, the possibilities become nearly endless. Some of these are interesting, though, like AAD 0 or AAM 1. (Your
loop $should probably go next to yourinc / jnzloop; they're nearly the same thing.) And BTW, yesxor-zeroing is the most efficient choice. \$\endgroup\$Peter Cordes– Peter Cordes2021年01月24日 19:17:23 +00:00Commented Jan 24, 2021 at 19:17 -
\$\begingroup\$ The moral of the story seems to be: it takes minimum two bytes to zero a register alone, so xor is as good as any. But sometimes you can combine it with other operations, like CDQ. \$\endgroup\$qwr– qwr2024年09月08日 16:26:21 +00:00Commented Sep 8, 2024 at 16:26
Use multiplication for hashing
IMUL, multiplication by an immediate signed number, is a powerful instruction which can be used for hashing.
The regular multiplication instruction hard-codes one of the input operands and the output operand to be in eax (or ax or al). This is inconvenient; it requires instructions for setup and sometimes also to save and restore eax and edx. But if one of the operands is a constant, the instruction becomes much more versatile:
- No need to load the constant into a register
- The other operand can be in any register, not only
eax - The result can be in any register, not necessarily overwriting the input!
- The result is 32-bit, not a pair of registers
- If the constant is between -128 and 127, it can be encoded by only one byte
I used this many times (I hope I can be excused for these shameless plugs: 1 2 3 ...)
Avoid registers which need prefixes
Quite a simple tip I haven't seen mentioned before.
Avoid r8-r15 (as well as dil, sil, bpl, and spl) on x86_64 like the plague. Even just thinking about these registers requires an extra REX prefix. The only exception is if you are using them exclusively for 64-bit arithmetic (which also needs REX prefixes). Even still, you are usually better off using a low register since some operations can be done using the implicit zero extension.
Note that this tip also applies to ARM Thumb-2.
Additionally, be careful when using 16-bit registers in 32/64-bit mode (as well as 32-bit registers in 16-bit mode, but this is rare), as these need a prefix byte as well.
However, unlike the extra x86_64 registers, 16-bit instructions can be useful: Many instructions which would otherwise need a full 32-bit immediate argument will only use a 16-bit argument. So, if you were to bitwise and eax by 0xfffff00f, and ax, 0xf00f would be smaller.
-
\$\begingroup\$ Related: my silly answer "Use x86 instead of x86-64 if you can" \$\endgroup\$qwr– qwr2024年09月08日 16:18:42 +00:00Commented Sep 8, 2024 at 16:18
"Free bypass": If you already have an instruction with an immediate on a register that you only care about the low part of, making the immediate longer than necessary could allow you to insert into its high part other instructions that can be jumped to (but don't execute when coming from before). This works because of little-endianness. Example; another example.
Use 32-bit x86 instead of 64-bit x86-64, if you can
This is a bit silly, but many code golf challenges only require 32-bit inputs, and 32-bit programs execute just fine on x86-64 processors (almost all instructions work, you can't use the lower 8 bits SIL, DIL, SPL, BPL). Assemble with nasm -f elf32 and link with ld -m elf_i386. You avoid extra bytes from REX prefixes without even thinking about it. MUL can multiply two 32-bit numbers and puts the full result in EDX:EAX, and DIV can divide a 64-bit dividend by r/m32.
...or 16-bit DOS, if you can
NASM can assemble raw COM files with -f bin. You'll need an emulator like emu2 or DOSBox to run these, so it won't run natively on a modern system, which is half the fun of x86.
Explore related questions
See similar questions with these tags.