So I came across this question on SO and I felt like it would be a cool thing to try and write a parser for since I always wanted to try it. So I present to you:
My first time writing a parser.
It converts strings like this:
"a,s,[c,f],[f,t], [[a,c],[d3,32]]"
into list objects
['a', 's', ['c', 'f'], ['f', 't'], [['a', 'c'], ['d3', '32']]]
Here is my code for now
def parseToList(string, cont=0):
result = list()
temp = ''
i = cont
while i < len(string):
if string[i] == ',':
if len(temp) and temp != ' ':
result.append(temp)
temp = ''
elif string[i] == '[':
res = parseToList(string, i+1)
i = res[1]
result.append(res[0])
elif string[i] == ']':
if len(temp) and temp != ' ':
result.append(temp)
return (result,i)
else:
temp += string[i]
i += 1
if len(temp) and temp != ' ':
result.append(temp)
return (result, i)
def listParse(string):
return parseToList(string)[0]
s = 'a,s,[c,f],[f,t], [[a,c],[d3,32]]'
print(s)
print(listParse(s))
Is there anything I am doing wrong? Something I should change?
-
1\$\begingroup\$ Do you want a review of design or only of functionality? \$\endgroup\$Hawk– Hawk2020年08月12日 10:30:59 +00:00Commented Aug 12, 2020 at 10:30
-
\$\begingroup\$ @Hawk I would prefer the design since I know there really isn't much functionality. It only handles strings \$\endgroup\$Marko Borković– Marko Borković2020年08月12日 10:37:30 +00:00Commented Aug 12, 2020 at 10:37
2 Answers 2
Here are a few things that came to my mind:
Bug
if temp != ' '
will not work when there are more than 1 consecutive spaces.
To fix this, useif not temp.isspace()
instead of comparing with a hard-coded string.
For example,s = 'a, [b]'
will output['a', ['b'], ' ']
for your current code.Your code outputs
['a', ' b']
fora, b
. I'll assume that including the space is a feature and not a bug.
Design
Wrap the test code inside
if __name__ == '__main__'
. This will prevent the code from being called when being imported from another module.Function names should preferably be lowercase. Change the CamelCase names to snake_case.
In return statements, you need not enclose the items in a parenthesis if you are returning a tuple
result = list()
can be replaced with justresult = []
if len(temp)
can be replaced with justif temp
. The bool of empty values areFalse
in python.
res = parse_to_list(string, i + 1)
i = res[1]
result.append(res[0])
The above can be a bit simplified and can be made a bit more understandable.
nested_list, i = parse_to_list(string, i + 1)
result.append(nested_list)
Instead of using
string[i]
, you can declare a new elementchar
which is equal tostring[i]
(This is just my personal preference)You can declare
parse_to_list
to insidelist_parse
. This will remove the need to passstring
inside a recursion repeatedly, and will also make the inner function "private".
(But this is also just my personal preference)
The final code should look something like this after applying the above:
def list_parse(string):
def parse_to_list(cont=0):
result = []
temp = ''
i = cont
while i < len(string):
char = string[i]
if char == ',':
if temp and not temp.isspace():
result.append(temp)
temp = ''
elif char == '[':
nested_list, i = parse_to_list(i + 1)
result.append(nested_list)
elif char == ']':
if temp and not temp.isspace():
result.append(temp)
return result, i
else:
temp += char
i += 1
if temp and not temp.isspace():
result.append(temp)
return result, i
return parse_to_list()[0]
if __name__ == '__main__':
s = 'a,s,[c,f],[f,t], [[a,c],[d3,32]]'
print(list_parse(s))
-
1\$\begingroup\$ Thanks for the review! I didn't know you can declare a function inside another. This is going to be super helpful! And yeah outputting
['a', ' b']
fora, b
is intended. \$\endgroup\$Marko Borković– Marko Borković2020年08月12日 11:05:23 +00:00Commented Aug 12, 2020 at 11:05
Disclaimer
I am more of a Java dev, so please excuse my non-pythonesque ideas.
Style review
Write code for someone else, not yourself (i.e. readable & understandable).
You have non-descriptive variable names.
i
: usually there is a better name for it, I would consideri
viable in something likefor i in range
temp
: what does temp represent? Already processed characters, so maybe call itprocessed_chars
or somethingresult
,res
- almost identical, very confusing. A single variable namedresult
could be OK in a function, Martin Fowler uses it, although Uncle Bob despises it. You are doing parsing, so a probable alternative could beparsed
or the like.res
: why do you have this variable in the first place? Just use a tuple deconstruction into something more meaningful:
parsed_list, new_i = parseToList(string, i+1)
I am not sure how python work, but maybe you could even replace new_i
directly with i
.
Functionality review
You never fail. Weird. Are you sure you can always parse everything successfully? Even though this is a very simple and permissive language, probably not. Edge cases:
[
[a,]
[,a]
Design review
First of all I will create a grammar. It will ease my review and it should have simplified you your implementation:
list = "[" values "]"
# maybe values could be modified to accept dangling commas if you want
values = value { "," value }
value = list | string
string = <anything except "[" "]" "," trimmed (i.e. no leadind or trailing whitespace)>
Now we have a (context-free) grammar given by pseudo-EBNF. Usually lexer and parser are separate, but we don't really need special tokens, we could just use single characters as tokens. Usually a parser accepts a stream of tokens and outputs an AST. We don't need an AST, it could be directly interpreted as python values.
An alternative to using your whole string
and i
as a cursor is to use string
as a stream of tokens, from which you take how many you want and return the rest (substring).
Now to implement a grammar, I would create a function for each non-terminal symbol (rule), f.e. parse_list() -> []
, parse_values() -> []
, parse_value()
, parse_string() -> str
. parse()
would just call parse_values()
. If you wrap these in a class. If you fail to match a symbol, you should raise an exception or let it known in your return value.
So I would suggest signatures either:
class Parser:
def parse(input: string) -> []:
self.input = input
parsed, unprocessed = self.parse_values(input)
if unprocessed:
# handle exception, maybe print
return parsed
def parse_list(cursor: int) -> []
# Parameter: cursor index in `input`
# raises exception on error
# the whole input is stored in class field
def parse_list(unprocessed: str) -> []
# Parameter: the unprocessed input
# raises exception on error
def parse_list(unprocessed: str) -> ([], str)
# Parameter: the unprocessed input
# Returns: (parsedList, new_unprocessed) on success
# (None, unprocessed) on error
# takes from unprocessed[0]
Example implementation draft:
def parse_list(unprocessed: str) -> ([], str):
matched, unprocessed = match(unprocessed, '[')
if not matched:
return None, unprocessed
values, unprocessed = parse_values()
if values == None:
return None, unprocessed
matched, unprocessed = match(unprocessed, ']')
if not matched:
return None, unprocessed
return values
def match(unprocessed: str, to_match: str) -> (bool, str):
stripped = unprocessed.lstrip()
if stripped.startswith(to_match):
return True, stripped[to_match.len:]
else:
return False, unprocessed
If you keep a note of the remaining unprocessed input or the current cursor, you could report it when finding an error (f.e. in the raised exception)