Building upon my previous question, I have come up with a method that'll search through a buffer and return a pointer to the end of the line. The primary purpose of the function to extract rows from a csv file using the logic specified in RFC4180; the third parameter isQuotedSequence
can be used in cases where parsing is parallelized or started in the middle of a string that is known to be quoted.
Notes:
- the excessive amount of comments are for people who may not know the language but still want to try and understand what the code is doing
- registers are renamed primarily to aid in refactoring but also enable some rudimentary static analysis to be performed
- the
isQuotedSequence
variable is placed in therdx
register in order to mimic operations likemul
that return a 128-bit result; I'm not sure if this is actually usable by anything other than ASM or supported by theWindows x64
orSystem V
ABIs though...
Anything info that could be used to make the function safer or more performant would be most useful; thanks for your review!
**Code:**
COMMENT @
C Interface:
extern char* ReadLine(const char* bufferOffset, const char* bufferTail, long long isQuotedSequence);
Reference:
https://www.rfc-editor.org/rfc/rfc4180
@
;-----------------------------; (CONSTANTS)
CARRIAGE_RETURN = 00000000Dh
DOUBLE_QUOTE = 000000022h
LINE_FEED = 00000000Ah
TRUE = 000000001h
;-----------------------------; (ARGUMENTS)
arg0 textequ <rcx>
arg1 textequ <rdx>
arg2 textequ <r8>
;-----------------------------; (LOCALS)
bufferOffset textequ <rax>
bufferTail textequ <r9>
currentCharacter textequ <ecx>
isQuotedSequence textequ <rdx>
nextCharacter textequ <r8d>
.code
ReadLine proc
mov bufferOffset, arg0 ; initialize [bufferOffset]
mov bufferTail, arg1 ; initialize [bufferTail]
mov isQuotedSequence, arg2 ; initialize [isQuotedSequence]
cmp bufferOffset, bufferTail ; validate that there are more characters to read
jge ReadLine@Return ; if end of file reached, jump to Return
movzx nextCharacter, byte ptr[bufferOffset] ; extract [nextCharacter] from [bufferOffset]
ReadLine@NextChar:
mov currentCharacter, nextCharacter ; shift [nextCharacter] into [currentCharacter]
add bufferOffset, 1h ; increment [bufferOffset]
cmp bufferOffset, bufferTail ; validate that there are more characters to read
jge ReadLine@Return ; if end of file reached, jump to Return
movzx nextCharacter, byte ptr[bufferOffset] ; extract [nextCharacter] from [bufferOffset]
cmp currentCharacter, DOUBLE_QUOTE ; compare [currentCharacter] to QUOTE_DOUBLE
jz ReadLine@HitDoubleQuote ; if equal, jump to HitDoubleQuote
cmp currentCharacter, CARRIAGE_RETURN ; compare [currentCharacter] to CARRIAGE_RETURN
jz ReadLine@HitCarriageReturn ; if equal, jump to HitCarriageReturn
cmp currentCharacter, LINE_FEED ; compare [currentCharacter] to LINE_FEED
jz ReadLine@HitLineFeed ; if equal, jump to HitLineFeed
jmp ReadLine@NextChar ; jump to NextChar
ReadLine@HitDoubleQuote:
xor isQuotedSequence, TRUE ; invert [isQuotedSequence] indicator
jmp ReadLine@NextChar ; jump to NextChar
ReadLine@HitCarriageReturn:
cmp nextCharacter, LINE_FEED ; compare [nextCharacter] to LINE_FEED
jz ReadLine@NextChar ; if equal, jump to NextChar
ReadLine@HitLineFeed:
cmp isQuotedSequence, TRUE ; compare [isQuotedSequence] to TRUE
jz ReadLine@NextChar ; if equal, jump to NextChar
ReadLine@Return:
ret ; return to caller
ReadLine endp
end
2 Answers 2
Ok, old question. But no accepted answer and I've got some thoughts, so...
My first thought on reading this code was to suggest you re-arrange your compares, so that you can "drop through" rather than doing an additional jump. If you move this to the end of your compare block:
cmp currentCharacter, DOUBLE_QUOTE ; compare [currentCharacter] to QUOTE_DOUBLE
jnz ReadLine@NextChar ; <-- Note: jnz
You can get rid of that extra jmp
, dropping straight into HitDoubleQuote.
But upon reflection, it might be better to leave it where it is, and add an additional jmp.
Consider the string Goodbye cruel world.\n
How many compares and how many jumps instructions does this take with your current code? Excluding the bufferTail, there's 3 cmp, 3 jz, and 1 jmp for every character. That's 63 cmp, 63 jz, and 62 jmp (including the end of line, which doesn't jmp).
Now, what if right after cmp currentCharacter, DOUBLE_QUOTE
you add jg ReadLine@NextChar
? Since jg doesn't affect the flags, there would be no need to do an extra cmp, so you're just adding one instruction. And since the "largest" value you're interested in parsing is the DOUBLE_QUOTE (0x22), any value greater than that and you already know what to do.
And look what happens to the number of cmp/jz/jmp.
cmp:21+3*2 = 27, jz: 21+3*3 = 30, jmp: 2
To clarify: We'd only go past the jg 3 times: twice for the spaces (0x20), and once for eol (0xa). So, while tabs, spaces, lf, cr, double quotes and exclamation marks (0x21) would all become 1 instruction more expensive due to the extra jg, all the letters and numbers become 5 instructions cheaper.
Worth doing? Probably not now that you likely haven't looked at the code in the last 5 months. But something to think about for next time.
Note: You could still take advantage of the drop thru. Something like this:
cmp currentCharacter, DOUBLE_QUOTE ; compare [currentCharacter] to QUOTE_DOUBLE
jg ReadLine@NextChar ; not DOUBLE_QUOTE or control code
jz ReadLine@HitDoubleQuote ; if equal, jump to HitDoubleQuote
cmp currentCharacter, LINE_FEED ; compare [currentCharacter] to LINE_FEED
jz ReadLine@HitLineFeed ; if equal, jump to HitLineFeed
cmp currentCharacter, CARRIAGE_RETURN ; compare [currentCharacter] to CARRIAGE_RETURN
jnz ReadLine@NextChar ; if not CR, get next character
ReadLine@HitCarriageReturn:
cmp nextCharacter, LINE_FEED ; compare [nextCharacter] to LINE_FEED
jz ReadLine@NextChar ; if equal, jump to NextChar
ReadLine@HitLineFeed:
cmp isQuotedSequence, TRUE ; compare [isQuotedSequence] to TRUE
jz ReadLine@NextChar ; if equal, jump to NextChar
ReadLine@Return:
ret ; return to caller
ReadLine@HitDoubleQuote:
xor isQuotedSequence, TRUE ; invert [isQuotedSequence] indicator
jmp ReadLine@NextChar ; jump to NextChar
ReadLine endp
Notice how the jmp
is gone after the compares?
What else? 1201ProgramAlarm already mentioned using inc
vs add
. That's one byte shorter. However you could also use lea rax, [rax + 1]
. While it's not shorter than add, it doesn't use the flags register (which both add and inc do). This might ease the CPU's pipelining.
Similarly, xor edx
is only 3 bytes, compared to xor rdx
which is 4. Since you're only using 1 bit, I don't see any need to use rdx. Similarly, why use r8d/movzx instead of just 'mov r8b, [rax]`? Again, it's a byte shorter.
I note that there is no facility provided for error returns. What if you are inside a quoted string when you hit bufferTail? You want to be able to start "mid string," but you don't return a value indicating that you need to do so? Maybe return NULL in this case? And I guess the caller can check if the return value is bufferTail to detect a missing cr/lf.
I get what you are saying about 128 bit returns. But I don't see a practical way to do that here. Such being the case, you're incurring a penalty (of swapping registers around) for a benefit you cannot (or at least do not) achieve.
From an ease of use point of view, I might be tempted to change the call to:
extern char* ReadLine(const char* bufferOffset, size_t length, bool sQuotedSequence);
Obviously I haven't seen the code for the caller, but keeping track of lengths might be a bit easier for people to grasp than "a pointer to 1 byte after your string." Even better might be to indicate that NUL (0x0) is a buffer terminator and may not be embedded in the string. Then you don't need either bufferTail or length, you can just check for currentCharacter being 0.
You apologize for:
the excessive amount of comments
I'm not sure there is such a thing in asm. They add nothing at all to the execution time of the code, and greatly reduce the maintenance time. In fact, adding comments about how you check for missing cr/lf or how mismatched quoted strings are handled might be a good idea. On the other hand, if the function ever gets changed, the maintainer has to go thru them all and make sure they reflect the new logic or you end up with comments that are WRONG, which can be worse than no comments.
A few stylistic nits: I might add more blank lines to make things easier to read. And I'm pretty sure using "proc" means that labels are local to the function. So the "ReadLine@" (to avoid conflict with names in other functions?) may be redundant. And as the guy said last time, I'd probably go with je
rather than jz
. Functionally they're identical, but conceptually you're (J)umping depending on whether currentCharacter is (E)qual to DOUBLE_QUOTE. And you have several comments that read "compare [nextCharacter]". Putting the brackets means you're going to read the value at the address pointed to by nextCharacter. But nextCharacter isn't a pointer, it's the actual value. They're comments, but still.
And one last thought: Unless this was a homework project for an asm class, why write this in asm? Unless you know what "instruction fusing" and "pipelining" mean and what causes "stalling" and a dozen other esoteric concepts, squeezing the max perf out of assembler is really hard. The people who write c/c++ compilers are all completely bonkers, but they do understand this stuff and have decades worth of "tricks" they can apply. As a result, well-structured c code can actually result in smaller code and faster execution times than asm written by us mere mortals.
Edit: Rolling my (non-stylistic) comments in:
;-----------------------------; (CONSTANTS)
CARRIAGE_RETURN = 00Dh
DOUBLE_QUOTE = 022h
LINE_FEED = 00Ah
arg0 textequ <rcx>
bufferOffset textequ <rax>
bufferTail textequ <rdx>
currentCharacter textequ <cl>
isQuotedSequence textequ <r8d>
nextCharacter textequ <r9b>
.code
Rfc4180_ReadRow proc
mov bufferOffset, arg0 ; initialize bufferOffset
cmp bufferOffset, bufferTail ; validate that there are more characters to read
jae Rfc4180_ReadRow@Return ; if end of file reached, jump to Return
mov nextchar, byte ptr[bufferOffset] ; extract nextCharacter from [bufferOffset]
; todo try adding: .align xx
Rfc4180_ReadRow@NextChar:
mov currentCharacter, nextchar ; shift nextCharacter into currentCharacter
inc bufferOffset ; increment bufferOffset
; todo maybe replace inc: lea bufferOffset, [bufferOffset + 1]
cmp bufferOffset, bufferTail ; validate that there are more characters to read
jae Rfc4180_ReadRow@Return ; if end of file reached, jump to Return
mov nextchar, byte ptr[bufferOffset] ; extract nextCharacter from [bufferOffset]
cmp currentCharacter, DOUBLE_QUOTE ; compare currentCharacter to QUOTE_DOUBLE
; todo maybe add: jg Rfc4180_ReadRow@NextChar
jz Rfc4180_ReadRow@HitDoubleQuote ; if equal, jump to HitDoubleQuote
cmp currentCharacter, LINE_FEED ; compare currentCharacter to LINE_FEED
jz Rfc4180_ReadRow@HitLineFeed ; if equal, jump to HitLineFeed
cmp currentCharacter, CARRIAGE_RETURN ; compare currentCharacter to CARRIAGE_RETURN
jnz Rfc4180_ReadRow@NextChar ; if not CARRIAGE_RETURN, NextChar
Rfc4180_ReadRow@HitCarriageReturn:
cmp nextchar, LINE_FEED ; compare nextCharacter to LINE_FEED
jz Rfc4180_ReadRow@NextChar ; if equal, jump to NextChar
Rfc4180_ReadRow@HitLineFeed:
test isQuotedSequence, isQuotedSequence ; see if isQuotedSequence is set
jnz Rfc4180_ReadRow@NextChar ; if set, jump to NextChar
Rfc4180_ReadRow@Return:
ret ; return to caller
Rfc4180_ReadRow@HitDoubleQuote:
xor isQuotedSequence, 1 ; invert isQuotedSequence indicator
jmp Rfc4180_ReadRow@NextChar ; jump to NextChar
Rfc4180_ReadRow endp
end
Note: I haven't even assembled this let alone validated or perf tested it. Still, shows you what I'm thinking. While I think it's likely to be better, that can't be known until someone times it against real data. See the 3 todo
s for more things to try.
-
\$\begingroup\$
Worth doing? Probably not now that you likely haven't looked at the code in the last 5 months. But something to think about for next time.
One would then probably be surprised at how important it is to me to get this 'right'; pipelining, micro-op fusion, port pressure, instruction latency, etc are all metrics that I care deeply about and have been heavily researching throughout the development of this code. Why? Another project of mine emits and later directly executes x64 bytecode. The belated advice is much appreciated! \$\endgroup\$Kittoes0124– Kittoes01242019年08月21日 02:02:53 +00:00Commented Aug 21, 2019 at 2:02 -
\$\begingroup\$ If you're at all interested: the latest revision can be found here. I'm fairly busy for the next couple of weeks but will definitely find time to test and incorporate many of the changes you've suggested (including stuff related to comments, I found them very sensible). Other stuff like
jz
vsje
is unlikely to change because this code is primarily just for myself and I find it way easier to collapse the two concepts into a single instruction than mentally tracking the meaning of two instructions; bonkers, I know. \$\endgroup\$Kittoes0124– Kittoes01242019年08月21日 02:19:27 +00:00Commented Aug 21, 2019 at 2:19 -
1\$\begingroup\$ Ahh. Then you've probably already bumped in to everything I mentioned that was useful. If you are into pipeline analysis, the two best things I know to point you at are IACA and agner fog. If you want to compare the latency of
movzx
vsmov
, agner's guide has all the numbers, and Intel's tool can (statically) analyze your code and show the computed latency numbers and fusions. \$\endgroup\$David Wohlferd– David Wohlferd2019年08月21日 02:21:06 +00:00Commented Aug 21, 2019 at 2:21 -
\$\begingroup\$ I literally have Agner bookmarked as "The Old Testament"; though one doesn't pretend to have mastered the teachings contained within... IACA also guided me to the current layout of the code. I don't at all remember if the instruction/structure changes you suggested were already tested but am more than happy to try again when I get the chance. \$\endgroup\$Kittoes0124– Kittoes01242019年08月21日 02:29:32 +00:00Commented Aug 21, 2019 at 2:29
-
\$\begingroup\$ Looking at your current code, some of my (non-stylistic) suggestions still apply. If this really is performance critical, you should give it another look. \$\endgroup\$David Wohlferd– David Wohlferd2019年08月21日 02:36:33 +00:00Commented Aug 21, 2019 at 2:36
You're treating comparison of addresses as signed values, instead of unsigned. For example, after you execute cmp bufferOffset, bufferTail
, you use the jge
condition, which is jump greater or equal. This is intended for signed values. You should use jae
jump above or equal instruction instead. This condition needs to be changed in several places.
Rather than add bufferOffset, 1h
, just use inc bufferOffset
.
From a readability perspective, having all those leading zeros on you constants makes the actual value a little harder to read, and implies that they could potentially have some value that would be that large. Since most of those values are character codes, two digits should suffice.
-
\$\begingroup\$ Regarding consistency with carriage return, this is to handle all possible cases of valid line endings: \r\n, \r, \n. In other words, the line feed after a carriage return is optional (thanks to varying OS behavior). \$\endgroup\$Kittoes0124– Kittoes01242019年03月11日 19:05:37 +00:00Commented Mar 11, 2019 at 19:05
-
\$\begingroup\$ @Kittoes0124 I got confused by the
nextCharacter
check. I've removed that paragraph from the review. \$\endgroup\$1201ProgramAlarm– 1201ProgramAlarm2019年03月11日 19:24:38 +00:00Commented Mar 11, 2019 at 19:24