I am currently taking a compilers course where we are designing a compiler for C- (which is a subset of C). Our first step was the lexer and I have written that but I believe that it is not very "pythonic" and I was hoping someone could help me make it more "pythonic" as I believe doing this would make the future parts of this assignment far simpler and more manageable.
I will first discuss the rules of the language and then give my program.
The accepted keywords are as follows:
else if int return void while float
The special symbols are:
+ - * / < <= > >= == != = ; , ( ) [ ] { } /* */ //
Other tokens are ID, NUM (for ints) or FLOAT defined by the following regular expressions:
FLOAT = (\d+(\.\d+)?([E][+|-]?\d+)?) ID = letter letter* NUM = digit digit* letter = a|...|z|A|...|Z digit = 0|...|9
Lowercase and uppercase are distinct.
Whitespace consists of blanks, newlines, and tabs. White space is ignore except that it must separate
ID
s,NUM
s,FLOAT
s, and keywords.- Comments are surrounded by the C notations
/* ... */ //
and CAN (don't know why) be nested.
The program will read in a C- file and output the line followed by every ID, keyword,
NUM, and
FLOAT` in order that they appear as well as outputting every special symbol. (Comments are ignored and so is white space. Anything that is invalid is to be displayed as an error and the program resume as normal.) The program does not determine if the program is valid it is simply breaking it up.
Sample input:
/**/ /*/* */ */ /*/*/****This**********/*/ */ /**************/ /************************* i = 333; ******************/ */ iiii = 3@33; int g 4 cd (int u, int v) {
Sample output:
INPUT: /**/ /*/* */ */ INPUT: /*/*/****This**********/*/ */ INPUT: /**************/ INPUT: /************************* INPUT: i = 333; ******************/ */ * / INPUT: iiii = 3@33; ID: iiii = NUM: 3 Error: @33 ; INPUT: int g 4 cd (int u, int v) { keyword: int ID: g NUM: 4 ID: cd ( keyword: int ID: u , keyword: int ID: v ) {
I am currently running through line-by-line and then character by character and building up the tokens but I feel like there is a much more straight forward way of doing it.
I would like to be able to just read the line in, break it up and then check each item to see what it is.
from sys import argv
import re
keyword = ['else', 'if', 'int', 'while', 'return', 'void', 'float']
oper = ['+', '-', '*', '/', '=', '<', '>', '<=', '>=', '==', '!=']
delim = ['\t','\n',',',';','(',')','{','}','[',']', ' ']
num = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
p = re.compile(r'(\d+(\.\d+)?([E][+|-]?\d+)?)')
scripty, filename = argv
#file = open(filename)
comment_count = 0
line_comment = 0
is_comment = False
i = 0
iden = "" #null string for identifiers to be built up
print_list = []
end_comment = False #This is a bool value for a block comment
float_str = ""
def is_keyword(kw):
if kw in keyword:
return True
return False
def is_delim(char):
if char in delim:
return True
return False
def which_delim(char):
if char in delim:
if char != '\t' and char != '\n' and char != ' ':
print char
def is_digit(char):
if char in num:
return True
return False
def is_char(char):
c = 0
c = ord(char)
if c >= 97 and c <= 122:
return True
return False
def is_oper(char):
if char in oper:
return True
return False
def is_num(str):
try:
int(str)
return True
except:
return False
def is_float(str):
m = p.match(str)
length = len(str)
if m and length == len(m.group(0)):
print "FLOAT: %s" %m.group(0)
return True
else:
return False
for line in open(filename):
if line != '\n':
print "Input: %s" % (line),
while line[i] != '\n': #i and enumerate allows to iterate through line
if line[i] is '/':
if line[i + 1] is '/' and comment_count is 0: # it's a line comment print it out
line_comment += 1
elif line[i + 1] is '*':
i += 1
comment_count += 1
elif (line[i] is '*') and (line[i+1] is '/') and comment_count > 0:
comment_count -= 1
i += 1
if comment_count == 0:
end_comment = True
if comment_count is 0 and line_comment is 0 and end_comment == False:
if is_digit(line[i]): #check for float
j = i
while not is_delim(line[j]):
float_str += line[j]
j += 1
if is_float(float_str):
if(j < len(line)):
i = j
iden = ''
float_str = '' #reset string at end use
if is_char(line[i]) or is_digit(line[i]) and not is_oper(line[i]):
iden += line[i]
if is_delim(line[i]) and iden == '': #for delims w/ blank space
which_delim(line[i])
if is_oper(line[i]) and iden is '':
temp = line[i] + line[i + 1]
if(is_oper(temp)):
print temp
i += 1
else:
print line[i]
if not is_char(line[i]) and not is_digit(line[i]) and not is_oper(line[i]) and iden is not '' and not is_delim(line[i]):
if is_keyword(iden):
print "keyword: %s" % iden
print "ERROR: %s" % line[i]
elif is_oper(iden):
print iden
print "Error: %s" % line[i]
elif is_num(iden):
print "NUM: %s" % iden
print "Error: %s" % line[i]
else:
print "ID: %s" % iden
print "Error: %s" % line[i]
iden = ''
elif not is_char(line[i]) and not is_digit(line[i]) and not is_oper(line[i]) and not is_delim(line[i]):
print "Error: %s" % line[i]
if (is_delim(line[i]) or is_oper(line[i])) and iden != '':
if is_keyword(iden):
print "keyword: %s" % iden
elif is_oper(line[i]):
temp = line[i] + line[i + 1]
if is_oper(temp):
if is_keyword(iden):
print "keyword: %s" % iden
print temp
i += 1
else:
print "ID: %s" % iden
print line[i]
elif is_num(iden):
print "NUM: %s" % iden
elif is_oper(iden):
temp = iden + line[i + 1]
if is_oper(temp):
print temp
i += 1
else:
print iden
else:
print "ID: %s" % iden
which_delim(line[i])
iden = ''
i += 1 #increment i
end_comment = False
if line[i] == '\n' and iden != '':
if is_keyword(iden):
print "keyword: %s" % iden
elif is_oper(iden):
print iden
else:
print "ID: %s" % iden
iden = ''
line_comment = 0 # reset line commment number
i = 0 #reset i
3 Answers 3
Proper string formatting
As of Python 2.6.x, the string formatting operator %
has been deprecated, and the new string method, str.format
, is now preferred instead. Here's an example of it's usage at the Python command line:
>>> print "hello {}".format("world")
hello world
You can also specify positional, or named parameters, like the below, as well:
>>> print "{1} {0}".format("world", "hello")
hello world
>>> print "{hello} {world}".format(hello="hello", world="world")
hello world
except
ing properly
Never ever do something like this:
try:
int(str)
return True
except:
return False
While doing something like this in a minuscule codebase probably won't affect much, doing this in general can result in some bad issues:
- You caught an error that wasn't supposed to be caught, like a
SystemError
,RuntimeError
, or what-not. - You're getting incorrect output because again, an error that wasn't supposed to be caught, was caught.
In general, you should never do something like this. In the case of this example, you should be catching a ValueError
, like this:
try:
int(str)
return True
except ValueError:
return False
Properly opening files
Just using open
, and assigning the return of it to a variable like this is not something you should get into the habit of doing:
f = open( ... )
If you try to open a file using the above method, and your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed.
In order to make sure that the resources are properly freed, you should be using a context manager to open the file, like this:
with open( ... ) as f:
...
Once you're using the context manager, it's guaranteed that the resources taken up by the open file will be properly freed, even if the program unexpectedly exits.
Properly matching blank lines
In addition, you also have a bug, right here in the top-level for
loop at the end of your code:
for line in open(filename):
if line != "\n": # Bug here
...
While in theory, this works if the user writes perfect code, and doesn't have extra spaces on a blank line, it could still fail if the user doesn't write perfect code, or accidentally includes extra spaces on a line. Here's an example of valid input, that wouldn't be properly matched. s
es are spaces and n
s are beeline continuation characters:
ssn
sn
ssssn
A Good alternative might be to do something like this, although pattern-matching the line to make sure it doesn't contain illegal characters might be better:
for line in open(filename):
if line[-1] != "\n":
...
Style/nitpicks
You don't have many style violations, there are a few things worth mentioning:
- There should be two blank lines between top-level code/functions/classes.
You should have spaces between commas in lists/dictionaries/tuples, like this:
spam = [1, 2, 3, 4, 5]
Not like this:
spam = [1,2,3,4,5]
-
\$\begingroup\$ Could you please expand on the bug related to "Properly matching blank lines"? I think I see it, but the text/example is not very clear. \$\endgroup\$holroy– holroy2015年09月29日 11:07:20 +00:00Commented Sep 29, 2015 at 11:07
-
\$\begingroup\$ @holroy I've added an example. \$\endgroup\$Ethan Bierlein– Ethan Bierlein2015年09月29日 12:01:10 +00:00Commented Sep 29, 2015 at 12:01
-
1\$\begingroup\$ "If [...] your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed." – that's generally incorrect, since the operating system will close unreferenced file descriptors after a process terminates. For a simple program like this, managing resources is irrelevant. However, it would be a subtle leak in a long-running server program. Many systems only allow one thousand files to be opened by a process at any time. Python's
with
has way cooler applications than closing files, and should be seen as a properly encapsulatedtry ... finally
. \$\endgroup\$amon– amon2015年09月29日 19:08:20 +00:00Commented Sep 29, 2015 at 19:08
First, your is_something
functions. You don't need to use an if
test. You could just return the condition itself. Also I'd name the parameter something rather than kw
. kw
isn't the clearest shortened form of keyword
, and also that name implies that you already think it's a keyword. @Mast points out that WORD could lead to confusion so it might be better to use something like test_string
.
def is_keyword(test_string):
return test_string in keyword
I'd do the same with is_delim
, but I also noticed that you don't call is_delim
in which_delim
, which seems silly. Also instead of having multiple !=
, you can use not in
and a list of values. Like this:
def which_delim(char):
if is_delim(char) and char not in ('\t', '\n', ' '):
print char
I'm also confused why you're printing the results of which_delim
considering I'd have assumed output meant a string rather than printing piecemeal. A comment or docstring would clear that up. Even at the high level
For is_char
. You don't need to instantiate c
as 0 first. In fact you could just put it directly into the expression.
def is_char(char):
return ord(char) >= 97 and ord(char) <= 122
Yes, this currently involves calling it twice, but in Python you can actually put both the conditions together into one a < b < c
expression. While we're at it, I'd call ord('a')
and ord('b')
rather than have 97 and 122 that don't indicate why you chose them at all.
return ord('a') <= ord(char) <= ord('z')
You can actually check if a string is a number using str.isdigit()
. Though it wont work if you have any whitespace around it, so I'd actually call strip()
too, as that removes whitespace at the start or end of the string. ie. " 12 ".strip() => "12"
def is_num(string):
return string.strip().isdigit()
Also I changed the name from str
. str
is a builtin method, and you're shadowing it by using the name. You should avoid doing that.
p
is a confusing name heare since you defined the regex pattern so long back. Why not call it pattern
? Again, don't use str
also instead of using the %
for formatting, use "FLOAT: {}".format(m.group(0))
. str.format
is the new method of formatting and it has tons more useful features than the old way.
Now onto the huge for
loop. There's too much that's hard to read for me to critique the overall logic but I can make Pythonic style notes.
First, don't nest the entire block inside an if
statement. Instead, reverse the statement and use the continue
keyword. It tells Python to go to the next iteration of the loop, meaning it wont run the rest of your block. This way you don't need to indent as deep.
for line in open(filename):
if line == '\n':
continue
is
is the identity operator, don't use it to test strings. Just use ==
, that works perfectly fine for equality in Python. Likewise, though you can use is 0
more safely, it's accepted to use ==
instead.
Also you have some pretty unnecessary comments. It's clear what these lines of code all do:
i += 1 #increment i
line_comment = 0 # reset line commment number
i = 0 #reset i
Instead, you should be including comments on what variables are for, what the more confusing syntax in your code does, some context on the more abstract intent of the code. A lot of this is unclear in your code because it's just one big block of if
s, while
s and for
s. If those could be dissected more, you and others could read them easier and find ways to improve the code.
-
1
I am currently running through line-by-line and then character by character and building up the tokens but I feel like there is a much more straight forward way of doing it.
I would like to be able to just read the line in, break it up and then check each item to see what it is.
Yes. To deal with the "break it up" part, you should take better advantage of regular expressions. Currently you only use one regex, to validate a float you have already extracted. You could use regexes also to extract tokens for you. Write a regex for each type of tokens and try matching them at the current position in a loop. (Note that the match
method of a compiled regex take a pos
argument which is handy for this) Take care to try the matches in the right order to avoid incorrectly identifying a FLOAT as a NUM, for example.
Don't use
is
to compare values:
if line[i + 1] is '/' and comment_count is 0:
Use ==
instead. is
tests object identity. The fact that it happens to work here is due to details of the implementation.