Lexer for C++ in python

Question 1

I wrote a simple C++ lexer in python. Although it already works but I kinda feel I didn't follow the pythonic way. Any suggestion for improvements?

Sample C++ Code

Sample output

Here's my code:

mysrc.py:

def keywords():
 keywords = [
 "auto",
 "bool",
 "break",
 "case",
 "catch",
 "char",
 "word",
 "class",
 "const",
 "continue",
 "delete",
 "do",
 "double",
 "else",
 "enum",
 "false",
 "float",
 "for",
 "goto",
 "if",
 "#include",
 "int",
 "long",
 "namespace",
 "not",
 "or",
 "private",
 "protected",
 "public",
 "return",
 "short",
 "signed",
 "sizeof",
 "static",
 "struct",
 "switch",
 "true",
 "try",
 "unsigned",
 "void",
 "while",
 ]
 return keywords
def operators():
 operators = {
 "+": "PLUS",
 "-": "MINUS",
 "*": "MUL",
 "/": "DIV",
 "%": "MOD",
 "+=": "PLUSEQ",
 "-=": "MINUSEQ",
 "*=": "MULEQ",
 "/=": "DIVEQ",
 "++": "INC",
 "--": "DEC",
 "|": "OR",
 "&&": "AND",
 }
 return operators
def delimiters():
 delimiters = {
 "\t": "TAB",
 "\n": "NEWLINE",
 "(": "LPAR",
 ")": "RPAR",
 "[": "LBRACE",
 "]": "RBRACE",
 "{": "LCBRACE",
 "}": "RCBRACE",
 "=": "ASSIGN",
 ":": "COLON",
 ",": "COMMA",
 ";": "SEMICOL",
 "<<": "OUT",
 ">>": "IN",
 }
 return delimiters

main file:

import re
import mysrc
def basicCheck(token):
 varPtrn = re.compile(r"[a-zA-Z_][a-zA-Z0-9_]") # variables
 headerPtrn = re.compile(r"\w[a-zA-Z]+[.]h") # header files
 digitPtrn = re.compile(r'\d')
 floatPtrn = re.compile(r'\d+[.]\d+')
 if token in mysrc.keywords():
 print(token + " KEYWORD")
 elif token in mysrc.operators().keys():
 print(token + " ", mysrc.operators()[token])
 elif token in mysrc.delimiters():
 description = mysrc.delimiters()[token]
 if description == 'TAB' or description == 'NEWLINE':
 print(description)
 else:
 print(token + " ", description)
 elif re.search(headerPtrn, token):
 print(token + " HEADER")
 elif re.match(varPtrn, token) or "'" in token or '"' in token:
 print(token + ' IDENTIFIER' )
 elif re.match(digitPtrn, token):
 if re.match(floatPtrn, token):
 print(token + ' FLOAT')
 else:
 print(token + ' INT')
 return True
def delimiterCorrection(line):
 tokens = line.split(" ")
 for delimiter in mysrc.delimiters().keys():
 for token in tokens:
 if token == delimiter:
 pass
 elif delimiter in token:
 pos = token.find(delimiter)
 tokens.remove(token)
 token = token.replace(delimiter, " ")
 extra = token[:pos]
 token = token[pos + 1 :]
 tokens.append(delimiter)
 tokens.append(extra)
 tokens.append(token)
 else:
 pass
 for token in tokens:
 if isWhiteSpace(token):
 tokens.remove(token)
 elif ' ' in token:
 tokens.remove(token)
 token = token.split(' ')
 for d in token:
 tokens.append(d)
 return tokens
def isWhiteSpace(word):
 ptrn = [ " ", "\t", "\n"]
 for item in ptrn:
 if word == item:
 return True
 else:
 return False
#def hasWhiteSpace(token):
 ptrn = ['\t', '\n']
 if isWhiteSpace(token) == False:
 for item in ptrn:
 if item in token:
 result = "'" + item + "'"
 return result
 else:
 pass
 return False
def tokenize(path):
 try:
 f = open(path).read()
 lines = f.split("\n")
 count = 0
 for line in lines:
 count = count + 1
 tokens = delimiterCorrection(line)
 print("\n#LINE ", count)
 print("Tokens: ", tokens)
 for token in tokens:
 basicCheck(token)
 return True
 except FileNotFoundError:
 print("\nInvald Path. Retry")
 run()
def run():
 path = input("Enter Source Code's Path: ")
 tokenize(path)
 again = int(input("""\n1. Retry\n2. Quit\n"""))
 if again == 1:
 run()
 elif again == 2:
 print("Quitting...")
 else:
 print('Invalid Request.')
 run()
run()

Question 2

You'll run out of stack memory if you recursively call run() at so many places. You can remove those.

You should consider opening the file with a another context manager to avoid having the file handle kept open.

Consider restricting the try block to only the part of code you expect to fail.

Question 3

thx. I edited that part.

Question 4

A lot of could be written in a more functional style. That means that functions return the result of their execution instead of printing it. It makes your code more readable, more composable and generally easier to reason about. Also, docstrings at the beginning of all the function would help immensely in determining their purpose.

For the details like naming and formatting, PEP8 defines the Pythonic way

Some functions that could be written more pythonic:

def isWhiteSpace(word):
 return word in [" ", "\t", "\n"]

I partially rewrote the following functions and did a bit of obvious stuff, but the starting paragraph still applies.

def delimiterCorrection(line):
 tokens = line.split(" ")
 for delimiter in mysrc.delimiters().keys():
 for token in tokens:
 if token != delimiter and delimiter in token:
 pos = token.find(delimiter)
 tokens.remove(token)
 token = token.replace(delimiter, " ")
 extra = token[:pos]
 token = token[pos + 1 :]
 tokens.append(delimiter)
 tokens.append(extra)
 tokens.append(token)
 for token in tokens:
 if ' ' in token:
 tokens.remove(token)
 token = token.split(' ')
 tokens += token
 return [t for t in tokens if not isWhiteSpace(token)] # Remove any tokens that are whitespace
def tokenize(path):
 """Return a list of (line_number, [token]) pairs.
 Raise exception on error."""
 if not isfile(path):
 raise ValueError("File \"" + path + "\" doesn't exist!")
 res = []
 with open(path) as f:
 for line_count, line in enumerate(f):
 tokens = delimiterCorrection(line)
 res.append((line_count, tokens))
 for token in tokens:
 # This has a side effect which makes it hard to rewrite
 # Also, what does basic check do?
 basicCheck(token)
 return res

DISCLAIMER: I'm not a python expert in any way and my programming style is heavily inspired by functional programming. Feel free to disagree on everything I said. The fact that you're making an effort to improve your programming style already sets you apart. I hope I provided may be one or two insights!

Question 5

I have one answer - along with your questions about how to write the code, you should also consider writing unit tests for it.

That is an answer to your question "any suggestions for improvements?"

For instance you could write a module that imports unittest and in there import your module and create a test class per the unittest documentation.

Then you'd write a function called test_isWhiteSpace_truepositive that calls <your module>.isWhiteSpace(" ") and then checks if the result is true. That test would tell you that your function correctly identifies a space as a whitespace character. You could then either add to that function a check for "\t" and "\n", or you could write separate tests for those characters (I'd do them all in one).

The tests would not need much maintenance unless you change your code - you'd write one for each function you create, they serve as documentation for the expected behavior of the function and they could be easily run (python mytestsuite.py) whenever you update your code, and they would tell you if you've broken anything. You could even automate them as part of your source code management (git, etc.) process.

These lower level functions don't benefit very much from this but eventually your program is going to get large and as you make more changes, you might need to add different kinds of whitespace and it would be helpful to have this test to run against your updated function to make sure it behaves as expected.

For instance you might change "isWhiteSpace" to take an entire string rather than one character, and then you'd want to really exercise it with isWhiteSpace(" \t \n \t\t \n\n\n\n\n"), etc.

Just a toy example, but it would be a suggestion for improving your code.

Question 6

def run():
 path = input("Enter Source Code's Path: ")
 tokenize(path)
 again = int(input("""\n1. Retry\n2. Quit\n"""))
 if again == 1:
 run()
 elif again == 2:
 print("Quitting...")
 else:
 print('Invalid Request.')
 run()

Instead of reinventing argument parsing, you should rely on the existing tools. The go-to module when dealing with user provided arguments is argparse; so you could write something like:

import argparse
def run():
 parser = argparse.ArgumentParser()
 parser.add_argument(
 'files', metavar='FILE', nargs='+', type=argparse.FileType('r'),
 help='path to the files you want to tokenize')
 args = parser.parse_args()
 for f in files:
 with f:
 tokenize(f)

Note that, due to argparse.FileType, f are already opened files so you don't need to handle it yourself in tokenize. argparse will also handle invalid or unreadable files itself before returning from parse_args.

Now, if your only arguments on the command-line are files that you want to read, the fileinput module is even more specific for the task and you can get rid of run altogether:

import fileinput
def tokenize(path):
 for line in fileinput.input():
 count = fileinput.filelineno()
 tokens = delimiterCorrection(line)
 print("\n#LINE ", count)
 print("Tokens: ", tokens)
 for token in tokens:
 if basicCheck(token) != None: # empty char -> ignore
 print(basicCheck(token))

Lastly, avoid calling code (such as run() here) from the top-level of the file and guard it with an if __name__ == '__main__' clause.

fayaz fayaz 613 bronze badges · Answer 1 · 2019-05-14 10:28:04Z