I recently put together an assembler for a CPU I designed. I'm looking for feedback on my program structure, formatting, or anything else. I'm self taught on all of this so I don't have opportunities for it to be reviewed. I'm also new to python programming so if anything else doesn't look right please set me on the right track.
Assembler.py
import sys
from tables import *
symbols = {}
# Memory map
symbols["IN0"] = 0xfff8
symbols["IN1"] = 0xfff9
symbols["OUT0"] = 0xfffa
symbols["OUT1"] = 0xfffb
def to_bin(n, bits):
n = bin(n & 2**bits-1)[2:]
return "{:0>{}}".format(n, bits)
def reg(s):
n = int(s[1:])
return to_bin(n, 3)
def value(s):
n = 0
if s[0].isdigit():
if s[:2] == "0b":
n = int(s[2:], 2)
elif s[:2] == "0x":
n = int(s[2:], 16)
else:
n = int(s)
else:
if s in symbols:
n = symbols[s] # get address
else:
print("Error: undefined symbol \"{}\"".format(s))
quit()
return n
with open("Programs/"+sys.argv[1], "r") as fileIn, \
open("Programs/"+sys.argv[1]+".asm", "w+") as fileOut:
print("### First Pass - Mapping Symbols to Addresses ###")
address = 0
for lineNum, line in enumerate(fileIn, start = 1):
tokens = line.split("#")[0].split()
if not tokens:
continue # skip empty lines
if tokens[0][-1] == ":": # found symbol
if tokens[0][:-1] in symbols:
print("Error: duplicate symbol \"{}\" on line {}".format(tokens[0], lineNum))
quit()
else:
symbols[tokens[0][:-1]] = address # add symbol to dictionary
del tokens[0]
if tokens:
address += 2 if tokens[0] == "movi" else 1
fileIn.seek(0)
print("...Done\n")
print("### Second Pass - Translating into machine code ###")
address = 0
for lineNum, line in enumerate(fileIn, start = 1):
tokens = line.split("#")[0].split(":")[-1].split() # remove comments and symbols
if not tokens:
continue # skip empty lines
asm = ""
if tokens[0] in RRR and len(tokens) == 4:
asm = "000" + reg(tokens[1]) + reg(tokens[2]) + reg(tokens[3]) + RRR[tokens[0]]
elif tokens[0] in RRI and len(tokens) == 4:
asm = RRI[tokens[0]] + reg(tokens[1]) + reg(tokens[2]) + to_bin(value(tokens[3]), 7)
elif tokens[0] in RI and len(tokens) == 3:
asm = RI[tokens[0]] + reg(tokens[1]) + to_bin(value(tokens[2]), 10)
elif tokens[0] in JMP and len(tokens) == 3:
asm = JMP[tokens[0]] + reg(tokens[1]) + to_bin(value(tokens[2]) - address, 10)
elif tokens[0] == "nop" and len(tokens) == 1:
asm = "0"*16
elif tokens[0] == "halt" and len(tokens) == 1:
asm = JMP["brfl"] + "0"*13
elif tokens[0] == "movi" and len(tokens) == 2:
asm = RI["lui"] + reg(tokens[1]) + to_bin(value(tokens[2]) >> 6 & 0x3ff, 10)
fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
address += 1
asm = RRI["addi"] + reg(tokens[1]) + reg(tokens[1]) + to_bin(value(tokens[2]) & 0x3f, 7)
elif tokens[0] == ".fill" and len(tokens) == 2:
asm = to_bin(value(tokens[1]), 16)
else:
print("Error: invalid instruction on line {}".format(lineNum))
quit()
address += 1
fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
print("...Done\n")
print("Assembling finished")
Tables.py
# Categorized opcodes based on number of arguments and type
# RRR, RRI, RI, JMP
RRR = {
"add" : "0000",
"sub" : "0001",
"nor" : "0010",
"and" : "0011",
"ior" : "0100",
"eor" : "0101",
"shl" : "0110",
"shr" : "0111",
"eql" : "1000",
"neq" : "1001",
"gtr" : "1010",
"lss" : "1011",
"mul" : "1100",
"mulu" : "1101",
"div" : "1110",
"mod" : "1111"
}
RRI = {
"addi" : "001",
"jalr" : "010",
"lwm" : "011",
"swm" : "100"
}
RI = {
"lui" : "101"
}
JMP = {
"brtr" : "110",
"brfl" : "111"
}
Example multiplication program
start: lwm r1 r0 numA
lwm r1 r1 0
lwm r2 r0 numB
lwm r2 r2 0
addi r3 r0 0 # r3 = 0
addi r7 r0 1 # r7 = 1
brfl r0 enter # enter loop
doAdd: add r3 r3 r1 # r3 += A
loop: shl r1 r1 r7 # r1 << 1
enter: and r6 r2 r7 # r6 = r2 & 1
shr r2 r2 r7 # r2 >> 1
brtr r6 doAdd # was B odd?
brtr r2 loop
lwm r1 r0 prod
swm r3 r1 0
brfl r0 start
numA: .fill IN0
numB: .fill IN1
prod: .fill OUT0
-
\$\begingroup\$ It looks like you are writing a text file, not a binary one, was this intentional? Usually the output of an assembler is executable machine code, which you would not have newlines and readable characters in (like a .bin file if it had to have an extension). \$\endgroup\$Ron Beyer– Ron Beyer2016年11月10日 18:11:35 +00:00Commented Nov 10, 2016 at 18:11
-
1\$\begingroup\$ @RonBeyer It's for a CPU that I designed in a simulator. I wanted the output to be formatted as a hex text file so I can copy and paste programs into the simulator. \$\endgroup\$user121955– user1219552016年11月10日 18:14:55 +00:00Commented Nov 10, 2016 at 18:14
1 Answer 1
Some simplifications
You can initialize the
symbols
dictionnary in one instruction:# Memory map symbols = { "IN0": 0xfff8, "IN1": 0xfff9, "OUT0": 0xfffa, "OUT1": 0xfffb, }
The format template string, when applied to a number, can take a base specifyier. So
'{:b}'.format(x)
will pretty much return the same thing thanbin(x)
except without the'0b'
prefix. You can thus turnto_bin
into:def to_bin(n, bits): return "{:0>{}b}".format(n & 2**bits-1, bits)
As regard to applying the bitmask to limit the length of the output, you also have the possibility to cut the string afterwards:
def to_bin(n, bits): return "{:0>{}b}".format(n, bits)[-bits:]
I find it somewhat clearer of what is going on, but it might be slower. You’ll need to time it if it ever turns out to be an issue.
When dealing with formating stuff using a template like
'{:<xxx>}'
, if<xxx>
does not contain any other parameter, it might be clearer to use theformat
function directly. Combine that with the fact that theprint
function can be used to write in files, you can turn:fileOut.write("{:04x}".format(int(asm, 2)) + "\n")
into
print(format(int(asm, 2), '04x'), file=fileOut)
You can use the "magic" base
0
of theint
function to let python automatically "guess" the base of your number:>>> int('0b101', 0) 5 >>> int('0x1f', 0) 31 >>> int('42', 0) 42
Note however, that python can't disambiguate between octal and decimal if the string contains only digits but starts with a
'0'
:>>> int('0644', 0) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 0: '0644' >>> int('0o644', 0) 420 >>> int('644', 0) 644
It may not apply to you, so you could simplify
value
to:def value(s): try: return int(s, 0) except ValueError: try: return symbols[s] # get address except KeyError: sys.exit("Error: undefined symbol \"{}\"".format(s))
A few other improvements here: use of EAFP to make the intent more direct (let's convert this value into an integer; it doesn't work? let's pick its address; still doesn't work? then give up). And use of
sys.exit
instead ofquit
that should only be used within an interactive interpreter.exit
has the advantage, if passed a string as parameter, to print it tostderr
and to exit with a non-zero status code. Same improvement can be made to the "invalid instruction" near the end.You appear to have duplicated code to extract out comments and empty lines from your input file. Why not extract this behaviour into a function instead? This will allow you to avoid the call to
seek
too. And to avoid filling up the memory with the whole file at once, let's write a generator instead:def filter_out_comments(filename): with open(filename) as f: for line_num, line in enumerate(fileIn, start=1): tokens = line.split("#")[0].split() if tokens: yield tokens, line_num
And use it like:
with open("Programs/"+sys.argv[1]+".asm", "w+") as fileOut: print("### First Pass - Mapping Symbols to Addresses ###") address = 0 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]): if tokens[0][-1] == ":": # found symbol ... print("### Second Pass - Translating into machine code ###") address = 0 for tokens, line_num in filter_out_comments("Programs/" + sys.argv[1]): asm = "" if tokens[0] in RRR and len(tokens) == 4: ...
Some improvements
Instead of leaving some code at the top-level of the file, you should wrap it into a function. It let you test and re-use it more easily. You should also make use of the if __name__ == '__main__':
idiom:
def compile_asm(filename)
with open(filename + ".asm", "w+") as fileOut:
print("### First Pass - Mapping Symbols to Addresses ###")
address = 0
for tokens, line_num in filter_out_comments(filename):
if tokens[0][-1] == ":": # found symbol
...
print("### Second Pass - Translating into machine code ###")
address = 0
for tokens, line_num in filter_out_comments(filename):
asm = ""
if tokens[0] in RRR and len(tokens) == 4:
...
if __name__ == '__main__':
compile_asm("Programs/" + sys.argv[1])
Second, you should document your code a bit more, especially when sharing it like that, as it may be sometimes obscure why you are doing things like you do. It makes sense eventually but it would be easier to understand with a few comments and some docstrings.
And, lastly, follow PEP8, the official coding style, if you want your code to look like Python code.
One pass algorithm
There might not be a real need to perform 2 passes over the input file. Whenever a symbol cannot be resolved, store it in a dictionnary as a key and its associated value should be a list of every line this symbol was encountered. Use dict.setdefault(symbol, [])
for that. This may require that you modify value
so that unresolved symbols doesn't terminate the program but you can tell they don't exist. dict.get(key)
might help here as it return the value associated to the key if it exist in the dictionary or None if it doesn't.
Whenever a new symbol is discovered, check if it exist in this dictionary and patch each line accordingly. Then delete it from the dictionary. If the dictionnary is not empty at the end, you had unresolved symbols...
For it to work, though, you may need to store at least the incomplete lines and all that follows in memory. Depending on your needs, it is a tradeoff that may or may not be acceptable.