Assembler for CPU

Question 1

I recently put together an assembler for a CPU I designed. I'm looking for feedback on my program structure, formatting, or anything else. I'm self taught on all of this so I don't have opportunities for it to be reviewed. I'm also new to python programming so if anything else doesn't look right please set me on the right track.

Assembler.py

import sys
from tables import *
symbols = {}
# Memory map
symbols["IN0"] = 0xfff8
symbols["IN1"] = 0xfff9
symbols["OUT0"] = 0xfffa
symbols["OUT1"] = 0xfffb
def to_bin(n, bits):
 n = bin(n & 2**bits-1)[2:]
 return "{:0>{}}".format(n, bits)
def reg(s):
 n = int(s[1:])
 return to_bin(n, 3)
def value(s):
 n = 0
 if s[0].isdigit():
 if s[:2] == "0b":
 n = int(s[2:], 2)
 elif s[:2] == "0x":
 n = int(s[2:], 16)
 else:
 n = int(s)
 else:
 if s in symbols:
 n = symbols[s] # get address
 else:
 print("Error: undefined symbol \"{}\"".format(s))
 quit()
 return n
with open("Programs/"+sys.argv[1], "r") as fileIn, \
 open("Programs/"+sys.argv[1]+".asm", "w+") as fileOut:
 print("### First Pass - Mapping Symbols to Addresses ###")
 address = 0
 for lineNum, line in enumerate(fileIn, start = 1):
 tokens = line.split("#")[0].split()
 if not tokens:
 continue # skip empty lines
 if tokens[0][-1] == ":": # found symbol
 if tokens[0][:-1] in symbols:
 print("Error: duplicate symbol \"{}\" on line {}".format(tokens[0], lineNum))
 quit()
 else:
 symbols[tokens[0][:-1]] = address # add symbol to dictionary
 del tokens[0]
 if tokens:
 address += 2 if tokens[0] == "movi" else 1
 fileIn.seek(0)
 print("...Done\n")
 print("### Second Pass - Translating into machine code ###")
 address = 0
 for lineNum, line in enumerate(fileIn, start = 1):
 tokens = line.split("#")[0].split(":")[-1].split() # remove comments and symbols
 if not tokens:
 continue # skip empty lines
 asm = ""
 if tokens[0] in RRR and len(tokens) == 4:
 asm = "000" + reg(tokens[1]) + reg(tokens[2]) + reg(tokens[3]) + RRR[tokens[0]]
 elif tokens[0] in RRI and len(tokens) == 4:
 asm = RRI[tokens[0]] + reg(tokens[1]) + reg(tokens[2]) + to_bin(value(tokens[3]), 7)
 elif tokens[0] in RI and len(tokens) == 3:
 asm = RI[tokens[0]] + reg(tokens[1]) + to_bin(value(tokens[2]), 10)
 elif tokens[0] in JMP and len(tokens) == 3:
 asm = JMP[tokens[0]] + reg(tokens[1]) + to_bin(value(tokens[2]) - address, 10)
 elif tokens[0] == "nop" and len(tokens) == 1:
 asm = "0"*16
 elif tokens[0] == "halt" and len(tokens) == 1:
 asm = JMP["brfl"] + "0"*13
 elif tokens[0] == "movi" and len(tokens) == 2:
 asm = RI["lui"] + reg(tokens[1]) + to_bin(value(tokens[2]) >> 6 & 0x3ff, 10)
 fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
 address += 1
 asm = RRI["addi"] + reg(tokens[1]) + reg(tokens[1]) + to_bin(value(tokens[2]) & 0x3f, 7)
 elif tokens[0] == ".fill" and len(tokens) == 2:
 asm = to_bin(value(tokens[1]), 16)
 else:
 print("Error: invalid instruction on line {}".format(lineNum))
 quit()
 address += 1
 fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
 print("...Done\n")
print("Assembling finished")

Tables.py

# Categorized opcodes based on number of arguments and type
# RRR, RRI, RI, JMP
RRR = {
 "add" : "0000",
 "sub" : "0001",
 "nor" : "0010",
 "and" : "0011",
 "ior" : "0100",
 "eor" : "0101",
 "shl" : "0110",
 "shr" : "0111",
 "eql" : "1000",
 "neq" : "1001",
 "gtr" : "1010",
 "lss" : "1011",
 "mul" : "1100",
 "mulu" : "1101",
 "div" : "1110",
 "mod" : "1111"
}
RRI = {
 "addi" : "001",
 "jalr" : "010",
 "lwm" : "011",
 "swm" : "100"
}
RI = {
 "lui" : "101"
}
JMP = {
 "brtr" : "110",
 "brfl" : "111"
}

Example multiplication program

start: lwm r1 r0 numA
 lwm r1 r1 0
 lwm r2 r0 numB
 lwm r2 r2 0
 addi r3 r0 0 # r3 = 0
 addi r7 r0 1 # r7 = 1
 brfl r0 enter # enter loop
doAdd: add r3 r3 r1 # r3 += A
loop: shl r1 r1 r7 # r1 << 1
enter: and r6 r2 r7 # r6 = r2 & 1
 shr r2 r2 r7 # r2 >> 1
 brtr r6 doAdd # was B odd?
 brtr r2 loop
 lwm r1 r0 prod
 swm r3 r1 0
 brfl r0 start
numA: .fill IN0
numB: .fill IN1
prod: .fill OUT0

Question 2

It looks like you are writing a text file, not a binary one, was this intentional? Usually the output of an assembler is executable machine code, which you would not have newlines and readable characters in (like a .bin file if it had to have an extension).

Question 3

@RonBeyer It's for a CPU that I designed in a simulator. I wanted the output to be formatted as a hex text file so I can copy and paste programs into the simulator.

Question 4

Some simplifications

You can initialize the symbols dictionnary in one instruction:

# Memory map
symbols = {
 "IN0": 0xfff8,
 "IN1": 0xfff9,
 "OUT0": 0xfffa,
 "OUT1": 0xfffb,
}

The format template string, when applied to a number, can take a base specifyier. So '{:b}'.format(x) will pretty much return the same thing than bin(x) except without the '0b' prefix. You can thus turn to_bin into:
```
def to_bin(n, bits):
 return "{:0>{}b}".format(n & 2**bits-1, bits)
```
As regard to applying the bitmask to limit the length of the output, you also have the possibility to cut the string afterwards:
```
def to_bin(n, bits):
 return "{:0>{}b}".format(n, bits)[-bits:]
```
I find it somewhat clearer of what is going on, but it might be slower. You’ll need to time it if it ever turns out to be an issue.
When dealing with formating stuff using a template like '{:<xxx>}', if <xxx> does not contain any other parameter, it might be clearer to use the format function directly. Combine that with the fact that the print function can be used to write in files, you can turn:
```
fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
```
into
```
print(format(int(asm, 2), '04x'), file=fileOut)
```
You can use the "magic" base 0 of the int function to let python automatically "guess" the base of your number:
```
>>> int('0b101', 0)
5
>>> int('0x1f', 0)
31
>>> int('42', 0)
42
```
Note however, that python can't disambiguate between octal and decimal if the string contains only digits but starts with a '0':
```
>>> int('0644', 0)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 0: '0644'
>>> int('0o644', 0)
420
>>> int('644', 0)
644
```
It may not apply to you, so you could simplify value to:
```
def value(s):
 try:
 return int(s, 0)
 except ValueError:
 try:
 return symbols[s] # get address
 except KeyError:
 sys.exit("Error: undefined symbol \"{}\"".format(s))
```
A few other improvements here: use of EAFP to make the intent more direct (let's convert this value into an integer; it doesn't work? let's pick its address; still doesn't work? then give up). And use of sys.exit instead of quit that should only be used within an interactive interpreter. exit has the advantage, if passed a string as parameter, to print it to stderr and to exit with a non-zero status code. Same improvement can be made to the "invalid instruction" near the end.

You appear to have duplicated code to extract out comments and empty lines from your input file. Why not extract this behaviour into a function instead? This will allow you to avoid the call to seek too. And to avoid filling up the memory with the whole file at once, let's write a generator instead:

def filter_out_comments(filename):
 with open(filename) as f:
 for line_num, line in enumerate(fileIn, start=1):
 tokens = line.split("#")[0].split()
 if tokens:
 yield tokens, line_num

And use it like:

with open("Programs/"+sys.argv[1]+".asm", "w+") as fileOut:
 print("### First Pass - Mapping Symbols to Addresses ###")
 address = 0
 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]):
 if tokens[0][-1] == ":": # found symbol
 ...
 print("### Second Pass - Translating into machine code ###")
 address = 0
 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]):
 asm = ""
 if tokens[0] in RRR and len(tokens) == 4:
 ...

Some improvements

Instead of leaving some code at the top-level of the file, you should wrap it into a function. It let you test and re-use it more easily. You should also make use of the if __name__ == '__main__': idiom:

def compile_asm(filename)
 with open(filename + ".asm", "w+") as fileOut:
 print("### First Pass - Mapping Symbols to Addresses ###")
 address = 0
 for tokens, line_num in filter_out_comments(filename):
 if tokens[0][-1] == ":": # found symbol
 ...
 print("### Second Pass - Translating into machine code ###")
 address = 0
 for tokens, line_num in filter_out_comments(filename):
 asm = ""
 if tokens[0] in RRR and len(tokens) == 4:
 ...
if __name__ == '__main__':
 compile_asm("Programs/" + sys.argv[1])

Second, you should document your code a bit more, especially when sharing it like that, as it may be sometimes obscure why you are doing things like you do. It makes sense eventually but it would be easier to understand with a few comments and some docstrings.

And, lastly, follow PEP8, the official coding style, if you want your code to look like Python code.

One pass algorithm

There might not be a real need to perform 2 passes over the input file. Whenever a symbol cannot be resolved, store it in a dictionnary as a key and its associated value should be a list of every line this symbol was encountered. Use dict.setdefault(symbol, []) for that. This may require that you modify value so that unresolved symbols doesn't terminate the program but you can tell they don't exist. dict.get(key) might help here as it return the value associated to the key if it exist in the dictionary or None if it doesn't.

Whenever a new symbol is discovered, check if it exist in this dictionary and patch each line accordingly. Then delete it from the dictionary. If the dictionnary is not empty at the end, you had unresolved symbols...

For it to work, though, you may need to store at least the incomplete lines and all that follows in memory. Depending on your needs, it is a tradeoff that may or may not be acceptable.

score 7 · Accepted Answer · 2016-11-10 20:29:22Z

Some simplifications

You can initialize the symbols dictionnary in one instruction:

# Memory map
symbols = {
 "IN0": 0xfff8,
 "IN1": 0xfff9,
 "OUT0": 0xfffa,
 "OUT1": 0xfffb,
}

The format template string, when applied to a number, can take a base specifyier. So '{:b}'.format(x) will pretty much return the same thing than bin(x) except without the '0b' prefix. You can thus turn to_bin into:
```
def to_bin(n, bits):
 return "{:0>{}b}".format(n & 2**bits-1, bits)
```
As regard to applying the bitmask to limit the length of the output, you also have the possibility to cut the string afterwards:
```
def to_bin(n, bits):
 return "{:0>{}b}".format(n, bits)[-bits:]
```
I find it somewhat clearer of what is going on, but it might be slower. You’ll need to time it if it ever turns out to be an issue.
When dealing with formating stuff using a template like '{:<xxx>}', if <xxx> does not contain any other parameter, it might be clearer to use the format function directly. Combine that with the fact that the print function can be used to write in files, you can turn:
```
fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
```
into
```
print(format(int(asm, 2), '04x'), file=fileOut)
```
You can use the "magic" base 0 of the int function to let python automatically "guess" the base of your number:
```
>>> int('0b101', 0)
5
>>> int('0x1f', 0)
31
>>> int('42', 0)
42
```
Note however, that python can't disambiguate between octal and decimal if the string contains only digits but starts with a '0':
```
>>> int('0644', 0)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 0: '0644'
>>> int('0o644', 0)
420
>>> int('644', 0)
644
```
It may not apply to you, so you could simplify value to:
```
def value(s):
 try:
 return int(s, 0)
 except ValueError:
 try:
 return symbols[s] # get address
 except KeyError:
 sys.exit("Error: undefined symbol \"{}\"".format(s))
```
A few other improvements here: use of EAFP to make the intent more direct (let's convert this value into an integer; it doesn't work? let's pick its address; still doesn't work? then give up). And use of sys.exit instead of quit that should only be used within an interactive interpreter. exit has the advantage, if passed a string as parameter, to print it to stderr and to exit with a non-zero status code. Same improvement can be made to the "invalid instruction" near the end.

You appear to have duplicated code to extract out comments and empty lines from your input file. Why not extract this behaviour into a function instead? This will allow you to avoid the call to seek too. And to avoid filling up the memory with the whole file at once, let's write a generator instead:

def filter_out_comments(filename):
 with open(filename) as f:
 for line_num, line in enumerate(fileIn, start=1):
 tokens = line.split("#")[0].split()
 if tokens:
 yield tokens, line_num

And use it like:

with open("Programs/"+sys.argv[1]+".asm", "w+") as fileOut:
 print("### First Pass - Mapping Symbols to Addresses ###")
 address = 0
 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]):
 if tokens[0][-1] == ":": # found symbol
 ...
 print("### Second Pass - Translating into machine code ###")
 address = 0
 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]):
 asm = ""
 if tokens[0] in RRR and len(tokens) == 4:
 ...

Some improvements

Instead of leaving some code at the top-level of the file, you should wrap it into a function. It let you test and re-use it more easily. You should also make use of the if __name__ == '__main__': idiom:

def compile_asm(filename)
 with open(filename + ".asm", "w+") as fileOut:
 print("### First Pass - Mapping Symbols to Addresses ###")
 address = 0
 for tokens, line_num in filter_out_comments(filename):
 if tokens[0][-1] == ":": # found symbol
 ...
 print("### Second Pass - Translating into machine code ###")
 address = 0
 for tokens, line_num in filter_out_comments(filename):
 asm = ""
 if tokens[0] in RRR and len(tokens) == 4:
 ...
if __name__ == '__main__':
 compile_asm("Programs/" + sys.argv[1])

Second, you should document your code a bit more, especially when sharing it like that, as it may be sometimes obscure why you are doing things like you do. It makes sense eventually but it would be easier to understand with a few comments and some docstrings.

And, lastly, follow PEP8, the official coding style, if you want your code to look like Python code.

One pass algorithm

There might not be a real need to perform 2 passes over the input file. Whenever a symbol cannot be resolved, store it in a dictionnary as a key and its associated value should be a list of every line this symbol was encountered. Use dict.setdefault(symbol, []) for that. This may require that you modify value so that unresolved symbols doesn't terminate the program but you can tell they don't exist. dict.get(key) might help here as it return the value associated to the key if it exist in the dictionary or None if it doesn't.

Whenever a new symbol is discovered, check if it exist in this dictionary and patch each line accordingly. Then delete it from the dictionary. If the dictionnary is not empty at the end, you had unresolved symbols...

For it to work, though, you may need to store at least the incomplete lines and all that follows in memory. Depending on your needs, it is a tradeoff that may or may not be acceptable.

Stack Exchange Network

Assembler for CPU

1 Answer 1

Some simplifications

Some improvements

One pass algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Assembler for CPU

1 Answer 1

Some simplifications

Some improvements

One pass algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions