As my applications rely on VT-100 emulation, I found that I had to do quite a few workarounds to determine the number of visible characters in a string with embedded VT-100 control codes. So that led to the design of this 73 byte snippet.
Function would be called as follows;
xor edx, edx ; ARG1 - DX = 0 Search for first NULL.
mov ecx, edx ; ARG1 - Not more than ECX chars.
mov esi, SignOn ; ARG0 - Pointer to ASCII string.
call FindChr ; Length of string returned in RAX.
with text @ 0x40017c
SignOn db 27, '[2J', 27, '[1;37m', 27, '[2;000H'
db 'Horizons Buisness ', 27, '[42m& Personal Finance'
db 27, '[40;39m', 0
which returns
RAX = 0x24 = 36
RCX = 0xffffffbc = negated (0x44 = 68)
RSI = 0x4001c0
Assembled using NASM 2.13.02
; =============================================================================
; Search for character defined in DL but not past one in DH for a maximum
; count of ECX. Ignore VT-100 control codes but there are considered in
; total buffer size specified in RCX.
; ENTER: RSI = Pointer to ASCII string.
; ECX = Maximum interations (if 0 then 4,294,967,295).
; DL = unsigned char to match.
; DH = not beyond this character (terminator).
; LEAVE: EAX = Total printable characters.
; RSI = Points to match char in DL or RSI+1 if DH.
; = If SF (overflow) end of buffer + 1.
; ECX = Bytes remaining
; = If SF -1 (0xffffffff)
; DX = Unchanged.
; Total length of string can be derived from either
; negating ECX if ECX was zero on entry
; or ECX - original ECX - 1
; FLAGS: ZF = Match found (DL)
; NZ = Match found (DH)
; JS = ECX exhausted (overflow)
ESC equ 27 ; 0x1b o033 11011b
; -------------------------------------------------------------------------
FindChr:
push rbx
xor ebx, ebx ; Set initial character count.
; As this is essentially a do -> while (maxCount >=0) if there is a
; non-zero value in ECX it must be zero indexed so the proper number of
; of characters are evaluated.
test ecx, ecx
jz $ + 4
dec ecx ; Zero index max buffer count.
.L0: mov al, [rsi]
cmp al, ESC ; Test for beginning of VT-100 code first.
jnz .J0
; ----------
; If the next character is left bracket, then it is a VT-100 code.
cmp byte [rsi+1], '['
jnz .J0 ; NZ = Consider ESC a printable char
.l0: dec ecx
jz .done - 2
lodsb
cmp al, 'A'
jae .j0
; There are two codes that are not terminated with an alphabetic char
; that are used to select fonts.
cmp al, '('
jz .L0
cmp al, ')'
jnz .l0 ; Must be still inside control code
; This works as we are only concered with alphabetic chars.
.j0: and al, 0x5f ; Convert to upper case
cmp al, 'Z'
jbe .L0 ; CY or ZF we are at end of VT-100 code.
jmp .l0 ; Continue scanning inside VT-100 code
; ----------
.J0: cmp al, dl ; Is char we are looking for?
jz .done ; ZF = Stay at char just read and exit
inc rsi ; Bump pointer to next character
cmp al, dh ; Is it delimiter
jnz .J1
test esi, esi ; So NZ is set.
jmp .done ; Return with SF.
.J1: inc ebx ; Bump count of printable characters
dec ecx ; and maximum buffer count
jnz .L0
dec ecx ; Sets sign flag
.done:
mov eax, ebx ; Return char count in RAX
pop rbx
ret
NOTE The need for pointer to first printable character arose, so I've made this amendment.
.J1: inc ebx ; Bump count of printable characters
dec ecx ; and maximum buffer count
jz .done - 2
; This will set a pointer to first printable character in string
test rdi, rdi
jnz .L0
; Update pointer to address just before current RSI
mov rdi, rsi
dec rdi
jmp .L0
-
\$\begingroup\$ Why did you decide to write this code in assembler, instead of the simpler C? \$\endgroup\$Roland Illig– Roland Illig2019年11月10日 17:36:34 +00:00Commented Nov 10, 2019 at 17:36
-
\$\begingroup\$ @RolandIllig Simpler is a subjective term in so much as if you understand C, then it may very well be, but even at that, C will never come in as tight as assembly. Probably in no small part to System V ABI. In this case, two of the values would need to have been passed by reference. Secondly, many of my functions because the calculation or comparison has already been done, assert flags that the caller can use, eliminating redundant comparisons. My primary objective is algorithmic flow to pack as much computing with the least amount of code. In assembly I need only understand instruction set. \$\endgroup\$Shift_Left– Shift_Left2019年11月10日 19:48:51 +00:00Commented Nov 10, 2019 at 19:48
1 Answer 1
Your code is a nightmare.
This is mainly because all the jump labels have very similar names: L0, l0, j0, J0, J1. What the hell do they mean? Please use more descriptive names for them.
The assembly code you are using does not make use of any advanced features of machine language. Instead, only the variable names get worse. It's far easier for a human reader to guess what str
means than to guess what esi
means. As another example, needle
and terminator
are much more intuitive than dl
and dh
.
You happily mix 32-bit registers (ecx
) with 64-bit registers (rsi
). This is confusing.
A single example string is not enough to make an exhaustive test suite. You need at least enough examples to cover each path of the code.
There is no JS
flag in the x86 instruction set.
There is no guarantee near Return with SF
that test esi, esi
really sets the sign flag.
I rewrote your code in C, just for fun, and it became 50% the size. It's possible to cut that code down by another 50% by using higher-level control structures instead of only goto
. Most of the saving comes from just deleting comments that describe on the assembler level what the C code can express directly.
Basically what you wrote is just a little state machine with a couple of special cases. Nothing you would really need assembler for. Especially not since the VT100 escape sequences are so closely coupled with I/O that it's usually not worth squeezing the last bit of performance out of this code.
Debugging a C program is easier as well, since you have named variables and you cannot get confused by the edi
register, which is always visible in the debug view but not relevant for this piece of code.
Here's my try of transforming the code to C, just to give you something to play with. It's still far from good code, but at least any good IDE can rename variables to make the helpful names guide the human reader.
#include <stdbool.h>
#include <stddef.h>
struct str_width_result {
size_t width;
const unsigned char *end;
size_t remaining;
bool found;
bool eof;
};
static struct str_width_result
str_width(const char *str, size_t str_size, char needle, char terminator)
{
const unsigned char *rsi = str;
size_t ecx = str_size;
unsigned char dl = needle;
unsigned char dh = terminator;
unsigned char ESC = 0x1B;
size_t ebx = 0;
if (ecx == 0) ecx--;
L0:
if (rsi[0] != ESC) goto J0;
if (rsi[1] != '[') goto J0;
l0:
ecx--;
if (ecx == 0) goto done_minus_2;
unsigned char al = *rsi++;
if (al >= 'A') goto j0;
if (al == '(') goto L0;
if (al == ')') goto l0;
j0:
al &= 0x5f;
if (al <= 'Z') goto L0;
goto l0;
J0:
if (al == dl) goto done;
rsi++;
if (al == dh) goto J1;
J1:
ebx++;
ecx--;
if (ecx != 0) goto L0;
done_minus_2:
ecx--;
done:;
struct str_width_result result = {
.width = ebx,
.end = rsi,
.remaining = ecx,
.found = false, // TODO
.eof = false // TODO
};
return result;
}
#define ESC "033円"
#include <stdio.h>
int main(void)
{
const char str[] =
ESC "[2J" ESC "[1;37m" ESC "[2;000H"
"Horizon Business "
ESC "[42m" "& Personal Finance"
ESC "[40;39m";
struct str_width_result result = str_width(str, sizeof str, '0円', '0円');
printf("%zu %p\n", result.width, (void *)result.end);
}
-
\$\begingroup\$ I definitely agree about the labels, but as the editor I'm using doesn't auto complete, I try to stay away from more typing than I have too. On that note, I do agree, if it isn't intuitive for the person that hasn't written it, then it might just as well be written in hieroglyphs. Yes,
JS
should have beenSF
. The use of 64 vs 32 bit registers is not random. I make use of sign extension and where I know pointers are in the lower half. No point usingXOR RAX,RAX
whenXOR EAX,EAX
does exactly the same thing and save 1 byte of opcode \$\endgroup\$Shift_Left– Shift_Left2019年11月10日 23:53:36 +00:00Commented Nov 10, 2019 at 23:53 -
\$\begingroup\$ I am curious to see how you compiled your code to bring the object size of
str_width
to 44 bytes as my version with ammendment added is 88. Even more interesting would be how you'd make it 50% less than that again. I'm not trying to be factious as I've always been curious about this as my HLL skills have not yielded similar results. \$\endgroup\$Shift_Left– Shift_Left2019年11月11日 00:01:24 +00:00Commented Nov 11, 2019 at 0:01