7
\$\begingroup\$

I wrote a simple C++ lexer in python. Although it already works but I kinda feel I didn't follow the pythonic way. Any suggestion for improvements?

Sample C++ Code

Sample output

Here's my code:

mysrc.py:

def keywords():
 keywords = [
 "auto",
 "bool",
 "break",
 "case",
 "catch",
 "char",
 "word",
 "class",
 "const",
 "continue",
 "delete",
 "do",
 "double",
 "else",
 "enum",
 "false",
 "float",
 "for",
 "goto",
 "if",
 "#include",
 "int",
 "long",
 "namespace",
 "not",
 "or",
 "private",
 "protected",
 "public",
 "return",
 "short",
 "signed",
 "sizeof",
 "static",
 "struct",
 "switch",
 "true",
 "try",
 "unsigned",
 "void",
 "while",
 ]
 return keywords
def operators():
 operators = {
 "+": "PLUS",
 "-": "MINUS",
 "*": "MUL",
 "/": "DIV",
 "%": "MOD",
 "+=": "PLUSEQ",
 "-=": "MINUSEQ",
 "*=": "MULEQ",
 "/=": "DIVEQ",
 "++": "INC",
 "--": "DEC",
 "|": "OR",
 "&&": "AND",
 }
 return operators
def delimiters():
 delimiters = {
 "\t": "TAB",
 "\n": "NEWLINE",
 "(": "LPAR",
 ")": "RPAR",
 "[": "LBRACE",
 "]": "RBRACE",
 "{": "LCBRACE",
 "}": "RCBRACE",
 "=": "ASSIGN",
 ":": "COLON",
 ",": "COMMA",
 ";": "SEMICOL",
 "<<": "OUT",
 ">>": "IN",
 }
 return delimiters

main file:

import re
import mysrc
def basicCheck(token):
 varPtrn = re.compile(r"[a-zA-Z_][a-zA-Z0-9_]") # variables
 headerPtrn = re.compile(r"\w[a-zA-Z]+[.]h") # header files
 digitPtrn = re.compile(r'\d')
 floatPtrn = re.compile(r'\d+[.]\d+')
 if token in mysrc.keywords():
 print(token + " KEYWORD")
 elif token in mysrc.operators().keys():
 print(token + " ", mysrc.operators()[token])
 elif token in mysrc.delimiters():
 description = mysrc.delimiters()[token]
 if description == 'TAB' or description == 'NEWLINE':
 print(description)
 else:
 print(token + " ", description)
 elif re.search(headerPtrn, token):
 print(token + " HEADER")
 elif re.match(varPtrn, token) or "'" in token or '"' in token:
 print(token + ' IDENTIFIER' )
 elif re.match(digitPtrn, token):
 if re.match(floatPtrn, token):
 print(token + ' FLOAT')
 else:
 print(token + ' INT')
 return True
def delimiterCorrection(line):
 tokens = line.split(" ")
 for delimiter in mysrc.delimiters().keys():
 for token in tokens:
 if token == delimiter:
 pass
 elif delimiter in token:
 pos = token.find(delimiter)
 tokens.remove(token)
 token = token.replace(delimiter, " ")
 extra = token[:pos]
 token = token[pos + 1 :]
 tokens.append(delimiter)
 tokens.append(extra)
 tokens.append(token)
 else:
 pass
 for token in tokens:
 if isWhiteSpace(token):
 tokens.remove(token)
 elif ' ' in token:
 tokens.remove(token)
 token = token.split(' ')
 for d in token:
 tokens.append(d)
 return tokens
def isWhiteSpace(word):
 ptrn = [ " ", "\t", "\n"]
 for item in ptrn:
 if word == item:
 return True
 else:
 return False
#def hasWhiteSpace(token):
 ptrn = ['\t', '\n']
 if isWhiteSpace(token) == False:
 for item in ptrn:
 if item in token:
 result = "'" + item + "'"
 return result
 else:
 pass
 return False
def tokenize(path):
 try:
 f = open(path).read()
 lines = f.split("\n")
 count = 0
 for line in lines:
 count = count + 1
 tokens = delimiterCorrection(line)
 print("\n#LINE ", count)
 print("Tokens: ", tokens)
 for token in tokens:
 basicCheck(token)
 return True
 except FileNotFoundError:
 print("\nInvald Path. Retry")
 run()
def run():
 path = input("Enter Source Code's Path: ")
 tokenize(path)
 again = int(input("""\n1. Retry\n2. Quit\n"""))
 if again == 1:
 run()
 elif again == 2:
 print("Quitting...")
 else:
 print('Invalid Request.')
 run()
run()
301_Moved_Permanently
29.4k3 gold badges48 silver badges98 bronze badges
asked May 13, 2019 at 22:20
\$\endgroup\$

4 Answers 4

6
\$\begingroup\$

You'll run out of stack memory if you recursively call run() at so many places. You can remove those.

You should consider opening the file with a another context manager to avoid having the file handle kept open.

Consider restricting the try block to only the part of code you expect to fail.

answered May 14, 2019 at 10:28
\$\endgroup\$
1
  • \$\begingroup\$ thx. I edited that part. \$\endgroup\$ Commented May 17, 2019 at 4:08
6
\$\begingroup\$

A lot of could be written in a more functional style. That means that functions return the result of their execution instead of printing it. It makes your code more readable, more composable and generally easier to reason about. Also, docstrings at the beginning of all the function would help immensely in determining their purpose.

For the details like naming and formatting, PEP8 defines the Pythonic way

Some functions that could be written more pythonic:

def isWhiteSpace(word):
 return word in [" ", "\t", "\n"]

I partially rewrote the following functions and did a bit of obvious stuff, but the starting paragraph still applies.

def delimiterCorrection(line):
 tokens = line.split(" ")
 for delimiter in mysrc.delimiters().keys():
 for token in tokens:
 if token != delimiter and delimiter in token:
 pos = token.find(delimiter)
 tokens.remove(token)
 token = token.replace(delimiter, " ")
 extra = token[:pos]
 token = token[pos + 1 :]
 tokens.append(delimiter)
 tokens.append(extra)
 tokens.append(token)
 for token in tokens:
 if ' ' in token:
 tokens.remove(token)
 token = token.split(' ')
 tokens += token
 return [t for t in tokens if not isWhiteSpace(token)] # Remove any tokens that are whitespace
def tokenize(path):
 """Return a list of (line_number, [token]) pairs.
 Raise exception on error."""
 if not isfile(path):
 raise ValueError("File \"" + path + "\" doesn't exist!")
 res = []
 with open(path) as f:
 for line_count, line in enumerate(f):
 tokens = delimiterCorrection(line)
 res.append((line_count, tokens))
 for token in tokens:
 # This has a side effect which makes it hard to rewrite
 # Also, what does basic check do?
 basicCheck(token)
 return res

DISCLAIMER: I'm not a python expert in any way and my programming style is heavily inspired by functional programming. Feel free to disagree on everything I said. The fact that you're making an effort to improve your programming style already sets you apart. I hope I provided may be one or two insights!

answered May 14, 2019 at 17:15
\$\endgroup\$
4
\$\begingroup\$

I have one answer - along with your questions about how to write the code, you should also consider writing unit tests for it.

That is an answer to your question "any suggestions for improvements?"

For instance you could write a module that imports unittest and in there import your module and create a test class per the unittest documentation.

Then you'd write a function called test_isWhiteSpace_truepositive that calls <your module>.isWhiteSpace(" ") and then checks if the result is true. That test would tell you that your function correctly identifies a space as a whitespace character. You could then either add to that function a check for "\t" and "\n", or you could write separate tests for those characters (I'd do them all in one).

The tests would not need much maintenance unless you change your code - you'd write one for each function you create, they serve as documentation for the expected behavior of the function and they could be easily run (python mytestsuite.py) whenever you update your code, and they would tell you if you've broken anything. You could even automate them as part of your source code management (git, etc.) process.

These lower level functions don't benefit very much from this but eventually your program is going to get large and as you make more changes, you might need to add different kinds of whitespace and it would be helpful to have this test to run against your updated function to make sure it behaves as expected.

For instance you might change "isWhiteSpace" to take an entire string rather than one character, and then you'd want to really exercise it with isWhiteSpace(" \t \n \t\t \n\n\n\n\n"), etc.

Just a toy example, but it would be a suggestion for improving your code.

Mast
13.8k12 gold badges56 silver badges127 bronze badges
answered Jul 23, 2019 at 11:15
\$\endgroup\$
2
\$\begingroup\$
def run():
 path = input("Enter Source Code's Path: ")
 tokenize(path)
 again = int(input("""\n1. Retry\n2. Quit\n"""))
 if again == 1:
 run()
 elif again == 2:
 print("Quitting...")
 else:
 print('Invalid Request.')
 run()

Instead of reinventing argument parsing, you should rely on the existing tools. The go-to module when dealing with user provided arguments is argparse; so you could write something like:

import argparse
def run():
 parser = argparse.ArgumentParser()
 parser.add_argument(
 'files', metavar='FILE', nargs='+', type=argparse.FileType('r'),
 help='path to the files you want to tokenize')
 args = parser.parse_args()
 for f in files:
 with f:
 tokenize(f)

Note that, due to argparse.FileType, f are already opened files so you don't need to handle it yourself in tokenize. argparse will also handle invalid or unreadable files itself before returning from parse_args.


Now, if your only arguments on the command-line are files that you want to read, the fileinput module is even more specific for the task and you can get rid of run altogether:

import fileinput
def tokenize(path):
 for line in fileinput.input():
 count = fileinput.filelineno()
 tokens = delimiterCorrection(line)
 print("\n#LINE ", count)
 print("Tokens: ", tokens)
 for token in tokens:
 if basicCheck(token) != None: # empty char -> ignore
 print(basicCheck(token))

Lastly, avoid calling code (such as run() here) from the top-level of the file and guard it with an if __name__ == '__main__' clause.

answered May 17, 2019 at 8:29
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.