Checking substring in 8086 ASM

Question 1

I have tried like this to check a substring in a mainstring in 8086. Is there any shorter way of doing this? My implementation seems lengthy.

DATA SEGMENT
STR1 DB 'MADAM'
LEN1 DW ($-STR1); storing the length of STR1
STR2 DB 'MADAA'
LEN2 DW ($-STR2); stroing the length of STR2
DATA ENDS
CODE SEGMENT
LEA SI, STR1
LEA DI, STR2
MOV DX, LEN1
MOV CX, LEN2
CMP CX, DX; comparing main & substring length
JA EXIT; if substring size is bigger than there is no chance to be found it in main string
JE SAMELENGTH; if main & sub string both have same length the we can compare them directly
JB FIND; general case (substring length < mainstring length): we can apply our main process 
SAMELENGTH:
 CLD
 REPE CMPSB
 JNE RED
 JMP GREEN
FIND: 
 MOV AL, [SI]; storing the ascii value of current character of mainstring 
 MOV AH, [DI]; storing the ascii value of current character of substring
 CMP AL,AH; comparing both character
 JE CHECK; 
 JNE NOTEQL
NOTEQL:
 INC SI; if both character don't match then we would point to the next char of main string
 DEC DX; DX keeps track of how many character of mainstring is left to process
 CMP DX, 0000H; checking if there are any character left in the main string for further comparison 
 JE RED; if no character is left in main string then obviously the substring doesn't exists in main string
 JMP FIND
CHECK:
 MOV CX, LEN2; CX is used internally for REPE CMPSB. So storing length of the substring in CX would limit the number of characters for comparison to exact length of substring.
 ; For example to compare between "madam" & "ada" we need to compare *ada* portion of main string with substring ada, no more, no less 
 MOV SP, SI; storing the index of current character of main string so if the following REPE CMPSB find mismatch then the process can be started over from the next character of main string (SEE line 1 of TEMPRED) by going to TEMPRED > FIND
 ADD SP, 0001H
 CLD
 REPE CMPSB
 JNE TEMPRED
 JMP GREEN
TEMPRED:; substring not found starting from the current character of main string, but it is possible to find match if we start from next character in main string
 MOV SI,SP; going to the next character of main string (after REPE CMPSB of CHECK segment)
 DEC DX
 LEA DI, STR2; reloading substring index in DI (after REPE CMPSB of CHECK segment)
 JMP FIND; if a character matches but the following substring mismatches in main string then we start over the same process from the next character of main string by going to FIND segment 
GREEN: 
 MOV BX, 0001H; substring found
 JMP EXIT
RED: 
 MOV BX, 0000H; substring not found
 JMP EXIT
EXIT: 
 CODE ENDS
 END
 RET

Question 2

specify the assembler name/version. Is it nasm,masm,fasm,as.. ?

Question 3

There are a number of things that could be improved with this code. I hope you find these suggestions helpful.

Specify which assembler

Unlike C or Python, there are a great many variations in assembler syntax, even for the same architecture, such as the x86 of this code. Generally, it's useful to note which assembler, which target processor and which OS (if any) in the comments at the top of the file. In this case, it looked most like 16-bit TASM, so that's the compiler I used to test this code.

Use an `ASSUME` directive

The code would not assemble for me until I added an ASSUME directive. The ASSUME directive doesn't actually generate any code. It simply specifies which assumptions the assembler should make when generating the output. It also helps human readers of your code understand the intended context. In this particular case, I added this line just after the CODE SEGMENT declaration:

ASSUME CS:CODE, DS:DATA, ES:DATA

The CS and DS assumptions are obvious, but the ES assumption is less so. However, the code uses the CMPSB instruction and based on the context, this means an implicit assumption that ES also points to the DATA segment. In my case, (emulated 16-bit DOS), I had to add a few statements to the start of the code to actually load the DS and ES segment registers appropriately.

Avoid instructions outside any segment

The EXIT code currently looks like this:

EXIT: 
 CODE ENDS
 END
 RET

The problem is that the CODE ENDS closes the CODE segment and the END directive tells the assembler that there is no more code and thus the RET instruction may or may not be assembled, and may or may not actually be placed in the CODE segment. You probably meant instead to do this:

EXIT: 
 RET
 CODE ENDS
 END

Eliminate convoluted branching

Avoid needless branching. They make your code harder to read and slower to execute. For example, the code currently has this:

 JA EXIT
 JE SAMELENGTH
 JB FIND
SAMELENGTH:
 CLD
 REPE CMPSB
 JNE RED
 JMP GREEN
 ; ... code elided
GREEN: 
 MOV BX, 0001H; substring found
 JMP EXIT
RED: 
 MOV BX, 0000H; substring not found
 JMP EXIT
EXIT:

This could be very much simplified:

 JA EXIT
 JB FIND
 ; fall through to same length
SAMELENGTH:
 XOR BX,BX ; assume string not found
 CLD
 REPE CMPSB
 JNE EXIT
 INC BX ; indicate that string was found
EXIT:

There are a number of such simplifications possible with little effort.

Know your instruction set

The code currently has this set of instructions

 DEC DX
 CMP DX, 0000H
 JE RED

However, the DEC instruction already sets the Z flag, so the CMP instruction is not needed.

Use `REPNE SCASB` as appropriate

The code at the location FIND is largely the same as would have been done by using REPNE SCASB. The only difference is in which registers are used. The code you have isn't necessarily wrong, but it could probably be shorter.

Avoid using `SP` as a general register

Just after CHECK, the code saves a copy of the pointer (not an index as the comment falsely claims) to the SP register. However, SP is a stack pointer, so this code can only be used in an environment in which the stack is not used. That could be the case, but it makes the code much less portable to code it that way, especially because the AX or BX registers could just as easily have been used here.

Consider using standard length lines

The comments in the code are very long and the semicolon is right after the instruction. Neither of these things are necessarily wrong, but they are different from the usual convention which is to align the semicolon character in some column and making sure that lines are no more than 72 characters long (some use 78).

Question 4

We are determining if a substring appears in a string. This algorithm/code snippet is written for x86_64 Intel on Linux with NASM. It's quite shorter but because of the repe cmpsb, it's certainly not the fastest.

section .data
 ; the string
 string: db "this is the string we are searching in"
 stringlength: equ $-string
 ; the substring
 substring: db "string we are searching"
 substringlength: equ $-substring 
 mov rsi, string ; pointer to string in RSI
 mov rdx, stringlength ; length of string in RDX
 ; Subtract substring length to prevent looking beyond the string length,
 ; We can also check here if the substring fits in the string.
 ; If not we never can find the substring in the string
 sub rdx, substringlength
 cmp rdx, 0
 jl .@@notfound
 ; enter the compare loop
.@@repeat: 
 mov rdi, substring ; pointer to substring in RDI
 mov rcx, substringlength ; length substring in RCX (loop counter)
 cld
 repe cmpsb ; compare string at rdi with length rcx with string in rsi
 jz .@@found ; if zero flag then substring is found within string, exit loop
 ; substring is not found yet, put substring pointer at begin of substring
 dec rdx ; decrement length of string
 and rdx, rdx ; check remaining length to search in
 jnz .@@repeat ; remaining length non-zero, repeat
.@@notfound:
 ; else, substring wasn't found, exit loop
 ; substring not found actions
.@@found:
 ; substring found actions
 ; rsi has address to start of substring+the length of the substring
 ; subtracting the start of the string we can calculate the offset (or index) in the string where substring starts
 sub rsi, substringlength

Question 5

As it stands, the algorithm posted by @Agguro would not match ABC against AABC, as it always "eats up" at least substring length characters from the main string. This can be corrected as below.

section .data
 ; the string
 string: db "this is the string we are searching in"
 stringlength: equ $-string
 ; the substring
 substring: db "string we are searching"
 substringlength: equ $-substring 
 mov rsi, string ; pointer to string in RSI
 mov rdx, stringlength ; length of string in RDX
 ; Subtract substring length to prevent looking beyond the string length,
 ; We can also check here if the substring fits in the string.
 ; If not we never can find the substring in the string
 sub rdx, substringlength
 cmp rdx, 0
 jl .@@notfound
 ; enter the compare loop
.@@repeat: 
 mov rdi, substring ; pointer to substring in RDI
 mov rcx, substringlength ; length substring in RCX (loop counter)
 cld
 mov rax,rsi ; save rsi
 repe cmpsb ; compare string at rdi with length rcx with string in rsi
 jz .@@found ; if zero flag then substring is found within string, exit loop
 ; substring is not found yet, put substring pointer at begin of substring
 inc rax ; increment rax to proceed one character in string
 mov rsi,rax ; restore the modified rsi value from rax
 dec rdx ; decrement length of string
 and rdx, rdx ; check remaining length to search in
 jnz .@@repeat ; remaining length non-zero, repeat
.@@notfound:
 ; else, substring wasn't found, exit loop
 ; substring not found actions
.@@found:
 ; substring found actions
 ; rsi has address to start of substring+the length of the substring
 ; subtracting the start of the string we can calculate the offset (or index) in the string where substring starts
 sub rsi, substringlength

Not directly related: also, depending on the use cases for the function, comparing rdx (the remaining string length) with substringlength might be worth it as an optimization just above line and rdx, rdx. This is only worth it if both the string and substring are very long, but close in length. It would on average slow down the match if the string is very long, but the substring - very short.

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2014-09-28 16:51:05Z

There are a number of things that could be improved with this code. I hope you find these suggestions helpful.

Specify which assembler

Unlike C or Python, there are a great many variations in assembler syntax, even for the same architecture, such as the x86 of this code. Generally, it's useful to note which assembler, which target processor and which OS (if any) in the comments at the top of the file. In this case, it looked most like 16-bit TASM, so that's the compiler I used to test this code.

Use an `ASSUME` directive

The code would not assemble for me until I added an ASSUME directive. The ASSUME directive doesn't actually generate any code. It simply specifies which assumptions the assembler should make when generating the output. It also helps human readers of your code understand the intended context. In this particular case, I added this line just after the CODE SEGMENT declaration:

ASSUME CS:CODE, DS:DATA, ES:DATA

The CS and DS assumptions are obvious, but the ES assumption is less so. However, the code uses the CMPSB instruction and based on the context, this means an implicit assumption that ES also points to the DATA segment. In my case, (emulated 16-bit DOS), I had to add a few statements to the start of the code to actually load the DS and ES segment registers appropriately.

Avoid instructions outside any segment

The EXIT code currently looks like this:

EXIT: 
 CODE ENDS
 END
 RET

The problem is that the CODE ENDS closes the CODE segment and the END directive tells the assembler that there is no more code and thus the RET instruction may or may not be assembled, and may or may not actually be placed in the CODE segment. You probably meant instead to do this:

EXIT: 
 RET
 CODE ENDS
 END

Eliminate convoluted branching

Avoid needless branching. They make your code harder to read and slower to execute. For example, the code currently has this:

 JA EXIT
 JE SAMELENGTH
 JB FIND
SAMELENGTH:
 CLD
 REPE CMPSB
 JNE RED
 JMP GREEN
 ; ... code elided
GREEN: 
 MOV BX, 0001H; substring found
 JMP EXIT
RED: 
 MOV BX, 0000H; substring not found
 JMP EXIT
EXIT:

This could be very much simplified:

 JA EXIT
 JB FIND
 ; fall through to same length
SAMELENGTH:
 XOR BX,BX ; assume string not found
 CLD
 REPE CMPSB
 JNE EXIT
 INC BX ; indicate that string was found
EXIT:

There are a number of such simplifications possible with little effort.

Know your instruction set

The code currently has this set of instructions

 DEC DX
 CMP DX, 0000H
 JE RED

However, the DEC instruction already sets the Z flag, so the CMP instruction is not needed.

Use `REPNE SCASB` as appropriate

The code at the location FIND is largely the same as would have been done by using REPNE SCASB. The only difference is in which registers are used. The code you have isn't necessarily wrong, but it could probably be shorter.

Avoid using `SP` as a general register

Just after CHECK, the code saves a copy of the pointer (not an index as the comment falsely claims) to the SP register. However, SP is a stack pointer, so this code can only be used in an environment in which the stack is not used. That could be the case, but it makes the code much less portable to code it that way, especially because the AX or BX registers could just as easily have been used here.

Consider using standard length lines

The comments in the code are very long and the semicolon is right after the instruction. Neither of these things are necessarily wrong, but they are different from the usual convention which is to align the semicolon character in some column and making sure that lines are no more than 72 characters long (some use 78).

Stack Exchange Network

Checking substring in 8086 ASM

3 Answers 3

Specify which assembler

Use an `ASSUME` directive

Avoid instructions outside any segment

Eliminate convoluted branching

Know your instruction set

Use `REPNE SCASB` as appropriate

Avoid using `SP` as a general register

Consider using standard length lines

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Checking substring in 8086 ASM

3 Answers 3

Specify which assembler

Use an ASSUME directive

Avoid instructions outside any segment

Eliminate convoluted branching

Know your instruction set

Use REPNE SCASB as appropriate

Avoid using SP as a general register

Consider using standard length lines

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use an `ASSUME` directive

Use `REPNE SCASB` as appropriate

Avoid using `SP` as a general register