2

I'm writing a little toy program to try to help myself better understand this language (AT&T syntax, x86_64 assembly language). Consider this code, if you'll be so kind:

.section .data
mystring: .ascii "This is my string.0円"
mystringptr: .quad mystring

When it comes time to try to access a given character from mystringptr, it takes an extra step. Whereas, if I were trying to access mystring directly, it'd be easy. Supposing I want, for some odd reason, the fifth character of the string:

.section .text
movq $mystring, %rbx
movq 5(%rbx), %rdi

I would dereference %rbx with a 5-byte offset using displacement/base pointer mode (if I'm using the right terminology -- there's a good chance I'm not), and the %rdi register will then return the ASCII table value for the character 'i' to me. I'm fine so far. But trying to get the same result from mystringptr is more difficult. I'm finding I have to do this:

.section .text
movq $mystringptr, %rbx
movq (%rbx), %rax
movq 5(%rax), %rdi

Which gets me the same result. I suppose it makes sense that given the extra abstraction layer with the pointer, that there is likewise another dereference involved in accessing the string through that pointer, but I'm just not intuitively grasping it. Can anyone walk me through what's happening here?

With movq $mystring, %rbx (or, alternatively, leaq mystring, %rbx), I'm moving a pointer to mystringptr into %rbx, right? And then with the next statement I'm dereferencing %rbx (which would normally retrieve the contents at that address, right?) and moving those contents into %rax. This is where I'm getting lost. I guess I just want to understand what's really happening, beneath the hood so to speak, with this code. It's great that it works, but I want to understand it.


Editor's footnotes about code details, leaving just the conceptual part of the question in need of answering:

.asciz "This is my string." is the usual way to write a 0-terminated C-style string in Unix assemblers like GAS, but 0円 does work since this style of assembler does process C-style escapes inside quotes.)

lea mystring(%rip), %rbx (general case) or mov $mystring, %ebx (only non-PIE or largeaddressaware:no) are the standard ways to put a pointer to a label into a register in x86-64.

Also related: How to load a single byte from address in assembly ; movq mem, reg loads 8 bytes, movzbl mem, reg loads 1 and zero-extends.

Peter Cordes
376k50 gold badges739 silver badges1k bronze badges
asked Oct 18 at 15:16
3
  • If you know C (or C++) then mystring is similar to char *mystring = "This is my string.";, and mystringptr is char **mystringptr = &mystring;. In assembly mystring is a pointer to the first character in the string, and mystringptr is a pointer to mystring. Commented Oct 18 at 16:19
  • @Someprogrammerdude: No, mystring is static char mystring[] = "This is my string."; not a pointer to anonymous rodata. Or would be if it was .ascii "..." or .asciz "..." instead of just a bare string literal after the label, which is a syntax error since it's not a valid instruction mnemonic. Commented Oct 18 at 18:24
  • 1
    I fixed the missing .ascii, and edited in a footnote about other improvements to the details of the code. That's not what the question was asking about at all, which is why I went for this unusual approach instead of writing an answer. I would normally just have commented, but since I was editing anyway... I ended up putting my additions at the end, rather than mixed in with the original paragraphs, since not all of them fit well mixed in. Anyway, I don't love the result, not something I'm going to do every time. Commented Oct 18 at 18:46

2 Answers 2

4

mystring: .ascii "This is my string.0円"

When this is assembled, two things are happening: 1) the assembler assigns a location, let's say it's 01000h, for the string and stores the byte sequence there in the image . 2) it creates a constant named mystring with the value 01000h in its symbol table.

movq $mystring, %rbx

This generates a movq instruction, which reads the value 01000h into rbx.

movq 5(%rbx), %rdi

This generates a movq instruction, which loads the content (indicated by the parenthesis) of what rbx contains (01000h) + 5, i.e. address 01005h into rdi. (Note that this loads a qword, so not actually what you want for characters.)

Anyway.

mystringptr: .quad mystring

This creates a variable in the image, which contains the address of mystring, so 01000h again. The address of this variable could be 01050h, for example.

movq $mystringptr, %rbx

This then loads the address of the mystringptr variable (01050h) into rbx.

movq (%rbx), %rax

This now loads the address of mystring, obtained from the the content of the mystring variable, into rax, and then the element is loaded in the same way as before.

Peter Cordes
376k50 gold badges739 silver badges1k bronze badges
answered Oct 18 at 16:23
Sign up to request clarification or add additional context in comments.

6 Comments

Just for the record, this explanation is a simplification. The assembler just creates a relocatable reference to a certain offset in the .data section. It's only the linker that chooses an absolute address like 0x402000 (the default in a Linux non-PIE executable, like if you link this with bare ld or with gcc -no-pie). In a PIE, 01000h might be the image base address, but when executed the kernel would offset it to a random base address, and dynamic linker fixup any 64-bit absolute addresses. This uses movq $symbol, %reg so it's 32-bit sign-extended, can't be linked into a PIE.
Anyway, objdump -drwC foo.o on the assembler output will show the relocations with their placeholder addresses. objdump -drwC foo after ld -o foo foo.o will show the final absolute addresses.
BTW, x86-64 supports addressing modes like mov mystringptr(%rip), %rbx to load 8 bytes from memory at that address, without first putting the address into a register with mov-immediate. I guess that's an extra complication that would confuse the issue vs. using x86-64 like a RISC where you construct addresses in registers first like the OP is doing, only using addressing modes which involve a register and small offset.
Thanks for this. Good explanation. Though, inevitably, I still have a question. So, the value in mystring is the character array and the value in mystringptr is the address of that character array? This is the explanation that works for me, and yet... the author of this book keeps saying that when I define a label with an array of values using something like .quad or .ascii, that the label is not really holding the array itself, but an address that marks the first element in that array. But, if this were the case, then wouldn't mystring and mystringptr both hold that same address?
Scratch that. Maybe I understand. So when I define a label like this, the assembler or linker is storing the value I specify in memory while also placing the address of the value in its symbol table (that address being able to be referenced by the label name). Thinking of these two components is helping me, I think. So when I define mystringptr, the same two components are in play, only now the value placed in memory is the address of mystring and a second, different address is placed in the symbol table to be refenced with mystringptr. Is this the idea, in a basic sense?
Yes, you got it.
3

That construction creates a simple data structure in memory: a string and a pointer to it; both are stored in memory.

The advantage of such a data structure is that you can write code to change where the pointer points, perhaps to another byte within the same string, or perhaps to another string entirely. And this can be useful, for example, to allow some code to work with different strings, while allowing some other code to change what string to be worked with.

As with most in-memory data structures, some extra dereferencing is required to get what from what you start with to what you're interested in.

So, if you want to start with the pointer, and reach a byte from the string,

  1. you have to fetch the value held within the in-memory pointer

    • this is accessing the value held by a global variable, and will require a read from memory
    • hence it is technically a dereference, but not one that would be seen in C — in C we would simply mention the global variable in an expression, and when executed, that mention is replaced with the global variable's value.
  2. next, using that pointer value, dereference that value to get bytes from the string

    • this dereference would be seen in C using the * indirection/dereference operator (or else the [] indexing operator)

How to accomplish this in assembly depends of course, on the CPU ISA and the memory model for the program (of where global variables are stored and how they are accessed). First you need a way to access the global variable, then a way to dereference the pointer value obtained from it.

On some other (older, CISC) processors, this all might be accomplished all in one single machine code instruction using a complex addressing mode. Whereas on a RISC machine like RISC V or MIPS, this might take several instructions, first to construct the address of the global variable, then dereference that to get the value of the global variable, and finally, the true dereference to get a byte of interest from the string.


Note the sizes of data types involved, and that we need quad word load to fetch a 64-bit pointer, whereas fetching a byte from a string should use a byte load form of some kind rather than quad word load.

answered Oct 18 at 17:11

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.