Simple python string to list parser

Question 1

So I came across this question on SO and I felt like it would be a cool thing to try and write a parser for since I always wanted to try it. So I present to you:

My first time writing a parser.

It converts strings like this:

"a,s,[c,f],[f,t], [[a,c],[d3,32]]"

into list objects

['a', 's', ['c', 'f'], ['f', 't'], [['a', 'c'], ['d3', '32']]]

Here is my code for now

def parseToList(string, cont=0):
 result = list()
 temp = ''
 i = cont
 while i < len(string):
 if string[i] == ',':
 if len(temp) and temp != ' ':
 result.append(temp)
 temp = ''
 elif string[i] == '[':
 res = parseToList(string, i+1)
 i = res[1]
 result.append(res[0])
 elif string[i] == ']':
 if len(temp) and temp != ' ':
 result.append(temp)
 return (result,i)
 else:
 temp += string[i] 
 i += 1
 if len(temp) and temp != ' ':
 result.append(temp)
 return (result, i)
def listParse(string):
 return parseToList(string)[0]
s = 'a,s,[c,f],[f,t], [[a,c],[d3,32]]'
print(s)
print(listParse(s))

Is there anything I am doing wrong? Something I should change?

Question 2

Do you want a review of design or only of functionality?

Question 3

@Hawk I would prefer the design since I know there really isn't much functionality. It only handles strings

Question 4

Here are a few things that came to my mind:

Bug

if temp != ' ' will not work when there are more than 1 consecutive spaces.
To fix this, use if not temp.isspace() instead of comparing with a hard-coded string.
For example, s = 'a, [b]' will output ['a', ['b'], ' '] for your current code.
Your code outputs ['a', ' b'] for a, b. I'll assume that including the space is a feature and not a bug.

Design

Wrap the test code inside if __name__ == '__main__'. This will prevent the code from being called when being imported from another module.
Function names should preferably be lowercase. Change the CamelCase names to snake_case.
In return statements, you need not enclose the items in a parenthesis if you are returning a tuple
result = list() can be replaced with just result = []
if len(temp) can be replaced with just if temp. The bool of empty values are False in python.

res = parse_to_list(string, i + 1)
i = res[1]
result.append(res[0])

The above can be a bit simplified and can be made a bit more understandable.

nested_list, i = parse_to_list(string, i + 1)
result.append(nested_list)

Instead of using string[i], you can declare a new element char which is equal to string[i]
(This is just my personal preference)
You can declare parse_to_list to inside list_parse. This will remove the need to pass string inside a recursion repeatedly, and will also make the inner function "private".
(But this is also just my personal preference)

The final code should look something like this after applying the above:

def list_parse(string):
 def parse_to_list(cont=0):
 result = []
 temp = ''
 i = cont
 while i < len(string):
 char = string[i]
 if char == ',':
 if temp and not temp.isspace():
 result.append(temp)
 temp = ''
 elif char == '[':
 nested_list, i = parse_to_list(i + 1)
 result.append(nested_list)
 elif char == ']':
 if temp and not temp.isspace():
 result.append(temp)
 return result, i
 else:
 temp += char
 i += 1
 if temp and not temp.isspace():
 result.append(temp)
 return result, i
 return parse_to_list()[0]
if __name__ == '__main__':
 s = 'a,s,[c,f],[f,t], [[a,c],[d3,32]]'
 print(list_parse(s))

Question 5

Thanks for the review! I didn't know you can declare a function inside another. This is going to be super helpful! And yeah outputting ['a', ' b'] for a, b is intended.

Question 6

Disclaimer

I am more of a Java dev, so please excuse my non-pythonesque ideas.

Style review

Write code for someone else, not yourself (i.e. readable & understandable).

You have non-descriptive variable names.

i: usually there is a better name for it, I would consider i viable in something like for i in range
temp: what does temp represent? Already processed characters, so maybe call it processed_chars or something
result, res - almost identical, very confusing. A single variable named result could be OK in a function, Martin Fowler uses it, although Uncle Bob despises it. You are doing parsing, so a probable alternative could be parsed or the like.
res: why do you have this variable in the first place? Just use a tuple deconstruction into something more meaningful:

parsed_list, new_i = parseToList(string, i+1)

I am not sure how python work, but maybe you could even replace new_i directly with i.

Functionality review

You never fail. Weird. Are you sure you can always parse everything successfully? Even though this is a very simple and permissive language, probably not. Edge cases:

[
[a,]
[,a]

Design review

First of all I will create a grammar. It will ease my review and it should have simplified you your implementation:

list = "[" values "]"
# maybe values could be modified to accept dangling commas if you want
values = value { "," value }
value = list | string
string = <anything except "[" "]" "," trimmed (i.e. no leadind or trailing whitespace)>

Now we have a (context-free) grammar given by pseudo-EBNF. Usually lexer and parser are separate, but we don't really need special tokens, we could just use single characters as tokens. Usually a parser accepts a stream of tokens and outputs an AST. We don't need an AST, it could be directly interpreted as python values. An alternative to using your whole string and i as a cursor is to use string as a stream of tokens, from which you take how many you want and return the rest (substring).

Now to implement a grammar, I would create a function for each non-terminal symbol (rule), f.e. parse_list() -> [], parse_values() -> [], parse_value(), parse_string() -> str. parse() would just call parse_values(). If you wrap these in a class. If you fail to match a symbol, you should raise an exception or let it known in your return value.

So I would suggest signatures either:

class Parser:
 def parse(input: string) -> []:
 self.input = input
 parsed, unprocessed = self.parse_values(input)
 if unprocessed:
 # handle exception, maybe print
 return parsed
 def parse_list(cursor: int) -> []
 # Parameter: cursor index in `input`
 # raises exception on error
 # the whole input is stored in class field
 def parse_list(unprocessed: str) -> []
 # Parameter: the unprocessed input
 # raises exception on error
 def parse_list(unprocessed: str) -> ([], str)
 # Parameter: the unprocessed input
 # Returns: (parsedList, new_unprocessed) on success
 # (None, unprocessed) on error
 # takes from unprocessed[0]

Example implementation draft:

def parse_list(unprocessed: str) -> ([], str):
 matched, unprocessed = match(unprocessed, '[')
 if not matched:
 return None, unprocessed
 values, unprocessed = parse_values()
 if values == None:
 return None, unprocessed
 matched, unprocessed = match(unprocessed, ']')
 if not matched:
 return None, unprocessed
 return values
def match(unprocessed: str, to_match: str) -> (bool, str):
 stripped = unprocessed.lstrip()
 if stripped.startswith(to_match):
 return True, stripped[to_match.len:]
 else:
 return False, unprocessed

If you keep a note of the remaining unprocessed input or the current cursor, you could report it when finding an error (f.e. in the raised exception)

Sriv 2,8002 gold badges14 silver badges37 bronze badges · Accepted Answer · 2020-08-12 10:59:09Z

Here are a few things that came to my mind:

Bug

if temp != ' ' will not work when there are more than 1 consecutive spaces.
To fix this, use if not temp.isspace() instead of comparing with a hard-coded string.
For example, s = 'a, [b]' will output ['a', ['b'], ' '] for your current code.
Your code outputs ['a', ' b'] for a, b. I'll assume that including the space is a feature and not a bug.

Design

Wrap the test code inside if __name__ == '__main__'. This will prevent the code from being called when being imported from another module.
Function names should preferably be lowercase. Change the CamelCase names to snake_case.
In return statements, you need not enclose the items in a parenthesis if you are returning a tuple
result = list() can be replaced with just result = []
if len(temp) can be replaced with just if temp. The bool of empty values are False in python.

res = parse_to_list(string, i + 1)
i = res[1]
result.append(res[0])

The above can be a bit simplified and can be made a bit more understandable.

nested_list, i = parse_to_list(string, i + 1)
result.append(nested_list)

Instead of using string[i], you can declare a new element char which is equal to string[i]
(This is just my personal preference)
You can declare parse_to_list to inside list_parse. This will remove the need to pass string inside a recursion repeatedly, and will also make the inner function "private".
(But this is also just my personal preference)

The final code should look something like this after applying the above:

def list_parse(string):
 def parse_to_list(cont=0):
 result = []
 temp = ''
 i = cont
 while i < len(string):
 char = string[i]
 if char == ',':
 if temp and not temp.isspace():
 result.append(temp)
 temp = ''
 elif char == '[':
 nested_list, i = parse_to_list(i + 1)
 result.append(nested_list)
 elif char == ']':
 if temp and not temp.isspace():
 result.append(temp)
 return result, i
 else:
 temp += char
 i += 1
 if temp and not temp.isspace():
 result.append(temp)
 return result, i
 return parse_to_list()[0]
if __name__ == '__main__':
 s = 'a,s,[c,f],[f,t], [[a,c],[d3,32]]'
 print(list_parse(s))

Thanks for the review! I didn't know you can declare a function inside another. This is going to be super helpful! And yeah outputting ['a', ' b'] for a, b is intended.

Stack Exchange Network

Simple python string to list parser

2 Answers 2

Bug

Design

Disclaimer

Style review

Functionality review

Design review

You must log in to answer this question.

Hot Network Questions

Simple python string to list parser

2 Answers 2

Bug

Design

Disclaimer

Style review

Functionality review

Design review

You must log in to answer this question.

Related

Hot Network Questions