Lexer for C- in Python

Question 1

I am currently taking a compilers course where we are designing a compiler for C- (which is a subset of C). Our first step was the lexer and I have written that but I believe that it is not very "pythonic" and I was hoping someone could help me make it more "pythonic" as I believe doing this would make the future parts of this assignment far simpler and more manageable.

I will first discuss the rules of the language and then give my program.

The accepted keywords are as follows:
```
else if int return void while float
```

The special symbols are:

+ - * / < <= > >= == != = ; , ( ) [ ] { } /* */ //

Other tokens are ID, NUM (for ints) or FLOAT defined by the following regular expressions:
```
FLOAT = (\d+(\.\d+)?([E][+|-]?\d+)?)
ID = letter letter*
NUM = digit digit*
letter = a|...|z|A|...|Z
digit = 0|...|9
```
Lowercase and uppercase are distinct.
Whitespace consists of blanks, newlines, and tabs. White space is ignore except that it must separate IDs, NUMs, FLOATs, and keywords.
Comments are surrounded by the C notations /* ... */ // and CAN (don't know why) be nested.

The program will read in a C- file and output the line followed by every ID, keyword,NUM, andFLOAT` in order that they appear as well as outputting every special symbol. (Comments are ignored and so is white space. Anything that is invalid is to be displayed as an error and the program resume as normal.) The program does not determine if the program is valid it is simply breaking it up.

Sample input:

/**/ /*/* */ */
/*/*/****This**********/*/ */
/**************/
/*************************
i = 333; ******************/ */
iiii = 3@33;
int g 4 cd (int u, int v) {

Sample output:

INPUT: /**/ /*/* */ */
INPUT: /*/*/****This**********/*/ */
INPUT: /**************/
INPUT: /*************************
INPUT: i = 333; ******************/ */
* 
/ 
INPUT: iiii = 3@33;
ID: iiii 
=
NUM: 3
Error: @33
;
INPUT: int g 4 cd (int u, int v) {
keyword: int
ID: g
NUM: 4
ID: cd
(
keyword: int
ID: u
,
keyword: int
ID: v
)
{

I am currently running through line-by-line and then character by character and building up the tokens but I feel like there is a much more straight forward way of doing it.

I would like to be able to just read the line in, break it up and then check each item to see what it is.

from sys import argv
import re
keyword = ['else', 'if', 'int', 'while', 'return', 'void', 'float']
oper = ['+', '-', '*', '/', '=', '<', '>', '<=', '>=', '==', '!=']
delim = ['\t','\n',',',';','(',')','{','}','[',']', ' ']
num = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
p = re.compile(r'(\d+(\.\d+)?([E][+|-]?\d+)?)')
scripty, filename = argv
#file = open(filename)
comment_count = 0
line_comment = 0
is_comment = False
i = 0
iden = "" #null string for identifiers to be built up
print_list = []
end_comment = False #This is a bool value for a block comment
float_str = ""
def is_keyword(kw):
 if kw in keyword:
 return True
 return False
def is_delim(char):
 if char in delim:
 return True
 return False
def which_delim(char):
 if char in delim:
 if char != '\t' and char != '\n' and char != ' ':
 print char
def is_digit(char):
 if char in num:
 return True
 return False
def is_char(char):
 c = 0
 c = ord(char)
 if c >= 97 and c <= 122:
 return True
 return False
def is_oper(char):
 if char in oper:
 return True
 return False
def is_num(str):
 try:
 int(str)
 return True
 except:
 return False
def is_float(str):
 m = p.match(str)
 length = len(str)
 if m and length == len(m.group(0)):
 print "FLOAT: %s" %m.group(0)
 return True
 else:
 return False
for line in open(filename):
 if line != '\n':
 print "Input: %s" % (line),
 while line[i] != '\n': #i and enumerate allows to iterate through line
 if line[i] is '/':
 if line[i + 1] is '/' and comment_count is 0: # it's a line comment print it out
 line_comment += 1
 elif line[i + 1] is '*':
 i += 1
 comment_count += 1
 elif (line[i] is '*') and (line[i+1] is '/') and comment_count > 0: 
 comment_count -= 1 
 i += 1
 if comment_count == 0:
 end_comment = True
 if comment_count is 0 and line_comment is 0 and end_comment == False:
 if is_digit(line[i]): #check for float
 j = i
 while not is_delim(line[j]):
 float_str += line[j]
 j += 1
 if is_float(float_str):
 if(j < len(line)):
 i = j
 iden = ''
 float_str = '' #reset string at end use
 if is_char(line[i]) or is_digit(line[i]) and not is_oper(line[i]):
 iden += line[i]
 if is_delim(line[i]) and iden == '': #for delims w/ blank space
 which_delim(line[i])
 if is_oper(line[i]) and iden is '':
 temp = line[i] + line[i + 1]
 if(is_oper(temp)):
 print temp
 i += 1
 else:
 print line[i]
 if not is_char(line[i]) and not is_digit(line[i]) and not is_oper(line[i]) and iden is not '' and not is_delim(line[i]):
 if is_keyword(iden):
 print "keyword: %s" % iden
 print "ERROR: %s" % line[i]
 elif is_oper(iden):
 print iden
 print "Error: %s" % line[i]
 elif is_num(iden):
 print "NUM: %s" % iden
 print "Error: %s" % line[i]
 else:
 print "ID: %s" % iden
 print "Error: %s" % line[i]
 iden = ''
 elif not is_char(line[i]) and not is_digit(line[i]) and not is_oper(line[i]) and not is_delim(line[i]):
 print "Error: %s" % line[i]
 if (is_delim(line[i]) or is_oper(line[i])) and iden != '':
 if is_keyword(iden):
 print "keyword: %s" % iden
 elif is_oper(line[i]):
 temp = line[i] + line[i + 1]
 if is_oper(temp):
 if is_keyword(iden):
 print "keyword: %s" % iden
 print temp
 i += 1
 else:
 print "ID: %s" % iden
 print line[i]
 elif is_num(iden):
 print "NUM: %s" % iden 
 elif is_oper(iden):
 temp = iden + line[i + 1]
 if is_oper(temp):
 print temp
 i += 1
 else:
 print iden
 else:
 print "ID: %s" % iden
 which_delim(line[i])
 iden = ''
 i += 1 #increment i
 end_comment = False
 if line[i] == '\n' and iden != '':
 if is_keyword(iden):
 print "keyword: %s" % iden
 elif is_oper(iden):
 print iden
 else:
 print "ID: %s" % iden
 iden = ''
 line_comment = 0 # reset line commment number
 i = 0 #reset i

Question 2

Proper string formatting

As of Python 2.6.x, the string formatting operator % has been deprecated, and the new string method, str.format, is now preferred instead. Here's an example of it's usage at the Python command line:

>>> print "hello {}".format("world")
hello world

You can also specify positional, or named parameters, like the below, as well:

>>> print "{1} {0}".format("world", "hello")
hello world
>>> print "{hello} {world}".format(hello="hello", world="world")
hello world

`except`ing properly

Never ever do something like this:

try:
 int(str)
 return True
except:
 return False

While doing something like this in a minuscule codebase probably won't affect much, doing this in general can result in some bad issues:

You caught an error that wasn't supposed to be caught, like a SystemError, RuntimeError, or what-not.
You're getting incorrect output because again, an error that wasn't supposed to be caught, was caught.

In general, you should never do something like this. In the case of this example, you should be catching a ValueError, like this:

try:
 int(str)
 return True
except ValueError:
 return False

Properly opening files

Just using open, and assigning the return of it to a variable like this is not something you should get into the habit of doing:

f = open( ... )

If you try to open a file using the above method, and your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed.

In order to make sure that the resources are properly freed, you should be using a context manager to open the file, like this:

with open( ... ) as f:
 ...

Once you're using the context manager, it's guaranteed that the resources taken up by the open file will be properly freed, even if the program unexpectedly exits.

Properly matching blank lines

In addition, you also have a bug, right here in the top-level for loop at the end of your code:

for line in open(filename):
 if line != "\n": # Bug here
 ...

While in theory, this works if the user writes perfect code, and doesn't have extra spaces on a blank line, it could still fail if the user doesn't write perfect code, or accidentally includes extra spaces on a line. Here's an example of valid input, that wouldn't be properly matched. ses are spaces and ns are beeline continuation characters:

ssn
sn
ssssn

A Good alternative might be to do something like this, although pattern-matching the line to make sure it doesn't contain illegal characters might be better:

for line in open(filename):
 if line[-1] != "\n":
 ...

Style/nitpicks

You don't have many style violations, there are a few things worth mentioning:

There should be two blank lines between top-level code/functions/classes.
You should have spaces between commas in lists/dictionaries/tuples, like this:
```
spam = [1, 2, 3, 4, 5]
```
Not like this:
```
spam = [1,2,3,4,5]
```

Question 3

Could you please expand on the bug related to "Properly matching blank lines"? I think I see it, but the text/example is not very clear.

Question 4

@holroy I've added an example.

Question 5

"If [...] your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed." – that's generally incorrect, since the operating system will close unreferenced file descriptors after a process terminates. For a simple program like this, managing resources is irrelevant. However, it would be a subtle leak in a long-running server program. Many systems only allow one thousand files to be opened by a process at any time. Python's with has way cooler applications than closing files, and should be seen as a properly encapsulated try ... finally.

Question 6

First, your is_something functions. You don't need to use an if test. You could just return the condition itself. Also I'd name the parameter something rather than kw. kw isn't the clearest shortened form of keyword, and also that name implies that you already think it's a keyword. @Mast points out that WORD could lead to confusion so it might be better to use something like test_string.

def is_keyword(test_string):
 return test_string in keyword

I'd do the same with is_delim, but I also noticed that you don't call is_delim in which_delim, which seems silly. Also instead of having multiple !=, you can use not in and a list of values. Like this:

def which_delim(char):
 if is_delim(char) and char not in ('\t', '\n', ' '):
 print char

I'm also confused why you're printing the results of which_delim considering I'd have assumed output meant a string rather than printing piecemeal. A comment or docstring would clear that up. Even at the high level

For is_char. You don't need to instantiate c as 0 first. In fact you could just put it directly into the expression.

def is_char(char):
 return ord(char) >= 97 and ord(char) <= 122

Yes, this currently involves calling it twice, but in Python you can actually put both the conditions together into one a < b < c expression. While we're at it, I'd call ord('a') and ord('b') rather than have 97 and 122 that don't indicate why you chose them at all.

 return ord('a') <= ord(char) <= ord('z')

You can actually check if a string is a number using str.isdigit(). Though it wont work if you have any whitespace around it, so I'd actually call strip() too, as that removes whitespace at the start or end of the string. ie. " 12 ".strip() => "12"

def is_num(string):
 return string.strip().isdigit()

Also I changed the name from str. str is a builtin method, and you're shadowing it by using the name. You should avoid doing that.

p is a confusing name heare since you defined the regex pattern so long back. Why not call it pattern? Again, don't use str also instead of using the % for formatting, use "FLOAT: {}".format(m.group(0)). str.format is the new method of formatting and it has tons more useful features than the old way.

Now onto the huge for loop. There's too much that's hard to read for me to critique the overall logic but I can make Pythonic style notes.

First, don't nest the entire block inside an if statement. Instead, reverse the statement and use the continue keyword. It tells Python to go to the next iteration of the loop, meaning it wont run the rest of your block. This way you don't need to indent as deep.

for line in open(filename):
 if line == '\n':
 continue

is is the identity operator, don't use it to test strings. Just use ==, that works perfectly fine for equality in Python. Likewise, though you can use is 0 more safely, it's accepted to use == instead.

Also you have some pretty unnecessary comments. It's clear what these lines of code all do:

i += 1 #increment i
line_comment = 0 # reset line commment number
i = 0 #reset i

Instead, you should be including comments on what variables are for, what the more confusing syntax in your code does, some context on the more abstract intent of the code. A lot of this is unclear in your code because it's just one big block of ifs, whiles and fors. If those could be dissected more, you and others could read them easier and find ways to improve the code.

Question 7

Also I'd name the parameter word rather than **kw. Please don't. word is already reserved in many contexts.

Question 8

I am currently running through line-by-line and then character by character and building up the tokens but I feel like there is a much more straight forward way of doing it.

I would like to be able to just read the line in, break it up and then check each item to see what it is.

Yes. To deal with the "break it up" part, you should take better advantage of regular expressions. Currently you only use one regex, to validate a float you have already extracted. You could use regexes also to extract tokens for you. Write a regex for each type of tokens and try matching them at the current position in a loop. (Note that the match method of a compiled regex take a pos argument which is handy for this) Take care to try the matches in the right order to avoid incorrectly identifying a FLOAT as a NUM, for example.

Don't use is to compare values:

if line[i + 1] is '/' and comment_count is 0:

Use == instead. is tests object identity. The fact that it happens to work here is due to details of the implementation.

score 6 · Answer 1 · 2015-09-28 22:05:08Z

Proper string formatting

As of Python 2.6.x, the string formatting operator % has been deprecated, and the new string method, str.format, is now preferred instead. Here's an example of it's usage at the Python command line:

>>> print "hello {}".format("world")
hello world

You can also specify positional, or named parameters, like the below, as well:

>>> print "{1} {0}".format("world", "hello")
hello world
>>> print "{hello} {world}".format(hello="hello", world="world")
hello world

`except`ing properly

Never ever do something like this:

try:
 int(str)
 return True
except:
 return False

While doing something like this in a minuscule codebase probably won't affect much, doing this in general can result in some bad issues:

You caught an error that wasn't supposed to be caught, like a SystemError, RuntimeError, or what-not.
You're getting incorrect output because again, an error that wasn't supposed to be caught, was caught.

In general, you should never do something like this. In the case of this example, you should be catching a ValueError, like this:

try:
 int(str)
 return True
except ValueError:
 return False

Properly opening files

Just using open, and assigning the return of it to a variable like this is not something you should get into the habit of doing:

f = open( ... )

If you try to open a file using the above method, and your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed.

In order to make sure that the resources are properly freed, you should be using a context manager to open the file, like this:

with open( ... ) as f:
 ...

Once you're using the context manager, it's guaranteed that the resources taken up by the open file will be properly freed, even if the program unexpectedly exits.

Properly matching blank lines

In addition, you also have a bug, right here in the top-level for loop at the end of your code:

for line in open(filename):
 if line != "\n": # Bug here
 ...

While in theory, this works if the user writes perfect code, and doesn't have extra spaces on a blank line, it could still fail if the user doesn't write perfect code, or accidentally includes extra spaces on a line. Here's an example of valid input, that wouldn't be properly matched. ses are spaces and ns are beeline continuation characters:

ssn
sn
ssssn

A Good alternative might be to do something like this, although pattern-matching the line to make sure it doesn't contain illegal characters might be better:

for line in open(filename):
 if line[-1] != "\n":
 ...

Style/nitpicks

You don't have many style violations, there are a few things worth mentioning:

There should be two blank lines between top-level code/functions/classes.
You should have spaces between commas in lists/dictionaries/tuples, like this:
```
spam = [1, 2, 3, 4, 5]
```
Not like this:
```
spam = [1,2,3,4,5]
```

Could you please expand on the bug related to "Properly matching blank lines"? I think I see it, but the text/example is not very clear.
"If [...] your program unexpectedly quits before it's fully complete, resources used up by the file aren't freed." – that's generally incorrect, since the operating system will close unreferenced file descriptors after a process terminates. For a simple program like this, managing resources is irrelevant. However, it would be a subtle leak in a long-running server program. Many systems only allow one thousand files to be opened by a process at any time. Python's with has way cooler applications than closing files, and should be seen as a properly encapsulated try ... finally.

score 5 · Answer 2 · 2015-09-28 22:15:46Z

First, your is_something functions. You don't need to use an if test. You could just return the condition itself. Also I'd name the parameter something rather than kw. kw isn't the clearest shortened form of keyword, and also that name implies that you already think it's a keyword. @Mast points out that WORD could lead to confusion so it might be better to use something like test_string.

def is_keyword(test_string):
 return test_string in keyword

I'd do the same with is_delim, but I also noticed that you don't call is_delim in which_delim, which seems silly. Also instead of having multiple !=, you can use not in and a list of values. Like this:

def which_delim(char):
 if is_delim(char) and char not in ('\t', '\n', ' '):
 print char

I'm also confused why you're printing the results of which_delim considering I'd have assumed output meant a string rather than printing piecemeal. A comment or docstring would clear that up. Even at the high level

For is_char. You don't need to instantiate c as 0 first. In fact you could just put it directly into the expression.

def is_char(char):
 return ord(char) >= 97 and ord(char) <= 122

Yes, this currently involves calling it twice, but in Python you can actually put both the conditions together into one a < b < c expression. While we're at it, I'd call ord('a') and ord('b') rather than have 97 and 122 that don't indicate why you chose them at all.

 return ord('a') <= ord(char) <= ord('z')

You can actually check if a string is a number using str.isdigit(). Though it wont work if you have any whitespace around it, so I'd actually call strip() too, as that removes whitespace at the start or end of the string. ie. " 12 ".strip() => "12"

def is_num(string):
 return string.strip().isdigit()

Also I changed the name from str. str is a builtin method, and you're shadowing it by using the name. You should avoid doing that.

p is a confusing name heare since you defined the regex pattern so long back. Why not call it pattern? Again, don't use str also instead of using the % for formatting, use "FLOAT: {}".format(m.group(0)). str.format is the new method of formatting and it has tons more useful features than the old way.

Now onto the huge for loop. There's too much that's hard to read for me to critique the overall logic but I can make Pythonic style notes.

First, don't nest the entire block inside an if statement. Instead, reverse the statement and use the continue keyword. It tells Python to go to the next iteration of the loop, meaning it wont run the rest of your block. This way you don't need to indent as deep.

for line in open(filename):
 if line == '\n':
 continue

is is the identity operator, don't use it to test strings. Just use ==, that works perfectly fine for equality in Python. Likewise, though you can use is 0 more safely, it's accepted to use == instead.

Also you have some pretty unnecessary comments. It's clear what these lines of code all do:

i += 1 #increment i
line_comment = 0 # reset line commment number
i = 0 #reset i

Instead, you should be including comments on what variables are for, what the more confusing syntax in your code does, some context on the more abstract intent of the code. A lot of this is unclear in your code because it's just one big block of ifs, whiles and fors. If those could be dissected more, you and others could read them easier and find ways to improve the code.

Also I'd name the parameter word rather than **kw. Please don't. word is already reserved in many contexts.

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Answer 3 · 2015-09-29 15:00:09Z

I am currently running through line-by-line and then character by character and building up the tokens but I feel like there is a much more straight forward way of doing it.

I would like to be able to just read the line in, break it up and then check each item to see what it is.

Yes. To deal with the "break it up" part, you should take better advantage of regular expressions. Currently you only use one regex, to validate a float you have already extracted. You could use regexes also to extract tokens for you. Write a regex for each type of tokens and try matching them at the current position in a loop. (Note that the match method of a compiled regex take a pos argument which is handy for this) Take care to try the matches in the right order to avoid incorrectly identifying a FLOAT as a NUM, for example.

Don't use is to compare values:

if line[i + 1] is '/' and comment_count is 0:

Use == instead. is tests object identity. The fact that it happens to work here is due to details of the implementation.

Stack Exchange Network

Lexer for C- in Python

3 Answers 3

Proper string formatting

`except`ing properly

Properly opening files

Properly matching blank lines

Style/nitpicks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Lexer for C- in Python

3 Answers 3

Proper string formatting

excepting properly

Properly opening files

Properly matching blank lines

Style/nitpicks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`except`ing properly