Python - basis of a parser library

Question 1

This is a followup of sorts to this question: link. 'Of sorts' because, as pointed out in the comments, this question should have been asked first.

I'm writing a parser library in Python, mostly for fun, but I do plan to use it in some of my projects. I would like to know if the foundation of this library is sound: whether it is Pythonic, performant and provides a decent API.

The entire code can be seen in this repo: link, in the frozen branch review-04-02-2018, file core.py. I'll post relevant portions below.

The library considers a parser any callable that takes a single State object and returns a new one. A State object's purpose is to keep track of three things: input that's left to parse, the portion of input that was consumed by a parser and the value of a parser chain (the idea is that a parser chain would construct some object as it goes along). Here's the State class:

class State():
 """ An object representing current parser state. """
 def __init__(self, left, value=None, parsed=""):
 self.value = value
 self.left = left
 self.parsed = parsed
 def __eq__(self, other):
 return (self.value == other.value and
 self.left == other.left and
 self.parsed == other.parsed)
 def __repr__(self):
 return f"State({repr(self.value)}, {repr(self.left[0:40])}, {repr(self.parsed)})"
 def copy(self):
 """ Shallow copy the State object. """
 return State(self.left, self.value, self.parsed)
 def deepcopy(self):
 """ Deep copy the State object. """
 return deepcopy(self)
 def set(self, **kwargs):
 """ Return a new State object with given attributes. """
 left = kwargs.get("left", self.left)
 value = kwargs.get("value", self.value)
 parsed = kwargs.get("parsed", self.parsed)
 return State(left, value, parsed)
 def consume(self, how_many):
 """ Return a new State object with 'how_many' characters consumed. """
 return State(self.left[how_many:], self.value, self.left[0:how_many])
 def blank(self):
 """
 Return a new State object with the same 'left' but with None 'parsed'
 value.
 """
 return State(self.left)
 def split(self, at):
 """
 Split the State object in two. Return a tuple with two State objects,
 the first will have 'left' up to, but not including, index 'at', the
 second - starting with 'at' and until the end.
 """
 first = self.copy()
 first.left = self.left[:at]
 second = self.copy()
 second.left = self.left[at:]
 return first, second

Most are, hopefully, self-explanatory. I have doubts about set, though: the purpose is to be able to do things like lambda s: s.set(parsed="foo, left="baz", value=None) to compensate for inability to assign things in lambdas. Could it be written cleaner?

There are also two exception classes, to signal when parsing has failed or is stopped prematurely, respectively. Here are they:

class ParsingFailure(Exception):
 """ An exception of this type should be thrown if parsing fails. """
 pass
class ParsingEnd(Exception):
 """
 An exception of this type should be thrown if parsing ends successfully,
 but early.
 """
 def __init__(self, state):
 super().__init__()
 self.state = state

All of the above things are used in the main function of the library, parse:

def parse(state_or_string, parser, verbose=False):
 """
 Run a given parser on a given state object or a string.
 Return parser's return value on success, or None on failure.
 If 'verbose' is truthy, return terminating ParsingFailure exception on
 failure instead of None.
 """
 if isinstance(state_or_string, str):
 state = State(state_or_string)
 else:
 state = state_or_string
 try:
 return parser(state)
 except ParsingFailure as failure:
 if verbose:
 return failure
 return None
 except ParsingEnd as end:
 return end.state

There is a problem with this, namely that it's impossible to tell apart None returned from a failure and None legitimately produced by a parser, but I feel in most cases None is a pretty good indicator of a failure. This raises the question: should I somehow separate these two cases?

Question 2

I'm just going to review the State class. You'll see that there's a lot to write about here, more than enough for one review.

Understanding the State class is crucial to understanding the operation of the parser, so its docstring needs to be more detailed than "An object representing current parser state." In particular, what attributes does it have, what do they mean, and what are the relationships between them? The explanation from your question would make a good starting point:
```
class State:
 """State of a parser, with attributes:
 left: str -- substring of input remaining to be parsed
 value -- arbitrary value constructed by parser so far
 parsed: str -- substring of input that was parsed to produce value
 """
```
Do I have this documentation right? For example, if we have:
```
parse(State(a, value1, b)) → State(c, value2, d)
```
is it supposed to be the case that b + a == d + c? That was my first guess, but it looks as if this is not the case: I can see methods in various places that set parsed to the empty string after parsing some text. Clearly I have misunderstood what the parsed field is supposed to contain.

I think it's vital for you to figure out what the relationship between value and parsed is supposed to be, and document the corresponding invariant on the State class. At the moment it's impossible for me to reliably write a parser function, because I don't know what I am supposed to put in the parsed attribute of the returned state object.
There are two possible designs for the use of state objects. In the first, we have a single object representing the current state, and as each piece of the input is parsed, the state is updated. In the second, we treat the state objects as immutable, and make a new one each time we parse a piece of the input.

The first design saves memory (we only need one state object) but alternation is tricky (after exploring one alternative we have to reverse the updates to the state object before exploring the next alternative). The second design makes many state objects, but alternation is easy (we make separate objects for each alternative).

It looks to me as though you are using the second design. For example, in the branch parser you make a copy of the state for each alternative. However, the code does not enforce the design. The State objects are mutable, and indeed we can see this mutability being used in a couple of places, for example in noconsume where output.left gets modified, or in chain where state.parsed gets modified. So the reader is going to be confused about the design: are the state objects supposed to be immutable? (If so, the mutations in noconsume and chain must be oversights.) Or are state objects supposed to be mutable? (In which case, why all the calls to copy? Why not just modify the original?)

I would recommend enforcing the design principle at the code level so that you (and your users) can't accidentally break the design. In this case I would make the State object immutable, using collections.namedtuple, like this:
```
from collections import namedtuple
class State(namedtuple('State', 'left value parsed')):
 """State of a parser, with attributes:
 left: str -- substring of input remaining to be parsed
 value -- arbitrary value constructed by parser
 parsed: str -- substring of input that was parsed to produce value
 """
 __slots__ = ()
 def __new__(cls, left, value=None, parsed=''):
 return super().__new__(cls, left, value, parsed)
```
(The use of __slots__ = () is recommended by the Python documentation to "keep memory requirements low by preventing the creation of instance dictionaries.")

Now that State objects are immutable, we have to call set in all the places where attributes were previously assigned. There are not too many of these. In chain, we need:
```
return state.set(parsed=''.join(pieces))
```
In noconsume we need:
```
return parser(state).set(left=state.left)
```
And there are a few instances in parsers.py.

But the big benefit is that we can now drop the copy method and all calls to it.
Using collections.namedtuple has a couple of extra benefits. First, we don't need to implement our own __eq__ method, as namedtuple already implements equality. Second, we can use namedtuple's _replace method instead of writing our own set method.
Is the deepcopy method necessary, and does it even make sense? Surely it depends on what the caller puts in the value field — you can imagine cases in which the value field is immutable (for example, it's a parse tree) and so deepcopy would be pointless.

If we look for uses of the deepcopy method, there's only one use use, in one of the test cases, where value is a dictionary. But I think even in that case a deep copy is not necessary — we only need a shallow copy of the dictionary itself, not copies of its keys and values too.

Only the caller knows what's in the value field, so it would be better, I think, not to offer the deepcopy method, and let the caller use:
```
state._replace(value=state.value.copy())
```
or:
```
state._replace(value=deepcopy(state.value))
```
depending on what kind of copying they need (if any).
The docstring for blank neglects to mention that the value field is copied as well as the left field.
The blank method is only used in one place, so is it really needed? It is only a little longer to write state._replace(parsed='') instead of state.blank(), and the former is more explicit.

Also, I'm not convinced that blanking parsed is even the right thing to do — it doesn't seem to respect any invariant of the State class, as discussed in §1 above. The only use of blank is in absorb, and the only use of absorb is in a test case, so could there be a different approach to implementing the test that didn't rely on blanking parsed?
The consume method could be simplified:
```
left = self.left
return self._replace(left=left[how_many:], parsed=left[0:how_many])
```
and the split method too (though I don't know how to rewrite split because I don't know what invariant it needs to respect; see §1 above).
Python doesn't have memory-efficient representation of slices of strings: that is, when you write left[how_many:], Python copies out the remainder of the string to make a new string object.

This has the consequence that the State class ends up copying the input over and over again, leading to quadratic runtime performance.

To avoid this, its necessary to redesign the state class so that it does not used slices of strings. One possibility would be to store the original string, and three indexes into the string:
```
class State(namedtuple('State', 'input value start cur stop')):
 """State of a parser, with attributes:
 input: str -- the input being parsed
 value -- arbitrary value constructed by parser
 start: int -- index in input where parsing started
 cur: int -- index in input from where parsing should continue
 stop: int -- index in input where parsing must stop
 It must be the case that 0 <= start <= cur <= stop <= len(input).
 This means that parsing input[start:cur] produced value, and next
 the parser must go on to parse input[cur:stop].
 """
```
The invariant can be enforced by the __new__ method:
```
def __new__(cls, input, value=None, start=0, cur=0, stop=None):
 if stop is None:
 stop = len(input)
 assert 0 <= start <= cur <= stop <= len(input)
 return super().__new__(cls, input, value, start, cur, stop)
```
In this version of the class, the consume method would become something like:
```
return self._replace(cur=self.cur + how_many)
```
which runs in constant time. Of course, the rest of the code would have to be updated to use the new representation, but I suspect there will turn out to be simplifications — for example in chain you won't need to join the pieces, you can just copy the stop field from the last state in the chain.

Question 3

Thank you very much for the in-depth review! It really means a lot. I will wait a day to see if somebody else has some more advice, just in case, but I doubt there will be anything equally comprehensive.

Gareth Rees 50.1k3 gold badges130 silver badges211 bronze badges · Accepted Answer · 2018-02-04 16:49:12Z

I'm just going to review the State class. You'll see that there's a lot to write about here, more than enough for one review.

Understanding the State class is crucial to understanding the operation of the parser, so its docstring needs to be more detailed than "An object representing current parser state." In particular, what attributes does it have, what do they mean, and what are the relationships between them? The explanation from your question would make a good starting point:
```
class State:
 """State of a parser, with attributes:
 left: str -- substring of input remaining to be parsed
 value -- arbitrary value constructed by parser so far
 parsed: str -- substring of input that was parsed to produce value
 """
```
Do I have this documentation right? For example, if we have:
```
parse(State(a, value1, b)) → State(c, value2, d)
```
is it supposed to be the case that b + a == d + c? That was my first guess, but it looks as if this is not the case: I can see methods in various places that set parsed to the empty string after parsing some text. Clearly I have misunderstood what the parsed field is supposed to contain.

I think it's vital for you to figure out what the relationship between value and parsed is supposed to be, and document the corresponding invariant on the State class. At the moment it's impossible for me to reliably write a parser function, because I don't know what I am supposed to put in the parsed attribute of the returned state object.
There are two possible designs for the use of state objects. In the first, we have a single object representing the current state, and as each piece of the input is parsed, the state is updated. In the second, we treat the state objects as immutable, and make a new one each time we parse a piece of the input.

The first design saves memory (we only need one state object) but alternation is tricky (after exploring one alternative we have to reverse the updates to the state object before exploring the next alternative). The second design makes many state objects, but alternation is easy (we make separate objects for each alternative).

It looks to me as though you are using the second design. For example, in the branch parser you make a copy of the state for each alternative. However, the code does not enforce the design. The State objects are mutable, and indeed we can see this mutability being used in a couple of places, for example in noconsume where output.left gets modified, or in chain where state.parsed gets modified. So the reader is going to be confused about the design: are the state objects supposed to be immutable? (If so, the mutations in noconsume and chain must be oversights.) Or are state objects supposed to be mutable? (In which case, why all the calls to copy? Why not just modify the original?)

I would recommend enforcing the design principle at the code level so that you (and your users) can't accidentally break the design. In this case I would make the State object immutable, using collections.namedtuple, like this:
```
from collections import namedtuple
class State(namedtuple('State', 'left value parsed')):
 """State of a parser, with attributes:
 left: str -- substring of input remaining to be parsed
 value -- arbitrary value constructed by parser
 parsed: str -- substring of input that was parsed to produce value
 """
 __slots__ = ()
 def __new__(cls, left, value=None, parsed=''):
 return super().__new__(cls, left, value, parsed)
```
(The use of __slots__ = () is recommended by the Python documentation to "keep memory requirements low by preventing the creation of instance dictionaries.")

Now that State objects are immutable, we have to call set in all the places where attributes were previously assigned. There are not too many of these. In chain, we need:
```
return state.set(parsed=''.join(pieces))
```
In noconsume we need:
```
return parser(state).set(left=state.left)
```
And there are a few instances in parsers.py.

But the big benefit is that we can now drop the copy method and all calls to it.
Using collections.namedtuple has a couple of extra benefits. First, we don't need to implement our own __eq__ method, as namedtuple already implements equality. Second, we can use namedtuple's _replace method instead of writing our own set method.
Is the deepcopy method necessary, and does it even make sense? Surely it depends on what the caller puts in the value field — you can imagine cases in which the value field is immutable (for example, it's a parse tree) and so deepcopy would be pointless.

If we look for uses of the deepcopy method, there's only one use use, in one of the test cases, where value is a dictionary. But I think even in that case a deep copy is not necessary — we only need a shallow copy of the dictionary itself, not copies of its keys and values too.

Only the caller knows what's in the value field, so it would be better, I think, not to offer the deepcopy method, and let the caller use:
```
state._replace(value=state.value.copy())
```
or:
```
state._replace(value=deepcopy(state.value))
```
depending on what kind of copying they need (if any).
The docstring for blank neglects to mention that the value field is copied as well as the left field.
The blank method is only used in one place, so is it really needed? It is only a little longer to write state._replace(parsed='') instead of state.blank(), and the former is more explicit.

Also, I'm not convinced that blanking parsed is even the right thing to do — it doesn't seem to respect any invariant of the State class, as discussed in §1 above. The only use of blank is in absorb, and the only use of absorb is in a test case, so could there be a different approach to implementing the test that didn't rely on blanking parsed?
The consume method could be simplified:
```
left = self.left
return self._replace(left=left[how_many:], parsed=left[0:how_many])
```
and the split method too (though I don't know how to rewrite split because I don't know what invariant it needs to respect; see §1 above).
Python doesn't have memory-efficient representation of slices of strings: that is, when you write left[how_many:], Python copies out the remainder of the string to make a new string object.

This has the consequence that the State class ends up copying the input over and over again, leading to quadratic runtime performance.

To avoid this, its necessary to redesign the state class so that it does not used slices of strings. One possibility would be to store the original string, and three indexes into the string:
```
class State(namedtuple('State', 'input value start cur stop')):
 """State of a parser, with attributes:
 input: str -- the input being parsed
 value -- arbitrary value constructed by parser
 start: int -- index in input where parsing started
 cur: int -- index in input from where parsing should continue
 stop: int -- index in input where parsing must stop
 It must be the case that 0 <= start <= cur <= stop <= len(input).
 This means that parsing input[start:cur] produced value, and next
 the parser must go on to parse input[cur:stop].
 """
```
The invariant can be enforced by the __new__ method:
```
def __new__(cls, input, value=None, start=0, cur=0, stop=None):
 if stop is None:
 stop = len(input)
 assert 0 <= start <= cur <= stop <= len(input)
 return super().__new__(cls, input, value, start, cur, stop)
```
In this version of the class, the consume method would become something like:
```
return self._replace(cur=self.cur + how_many)
```
which runs in constant time. Of course, the rest of the code would have to be updated to use the new representation, but I suspect there will turn out to be simplifications — for example in chain you won't need to join the pieces, you can just copy the stop field from the last state in the chain.

Thank you very much for the in-depth review! It really means a lot. I will wait a day to see if somebody else has some more advice, just in case, but I doubt there will be anything equally comprehensive.

Stack Exchange Network

Python - basis of a parser library

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Python - basis of a parser library

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions