This is a followup of sorts to this question: link. 'Of sorts' because, as pointed out in the comments, this question should have been asked first.
I'm writing a parser library in Python, mostly for fun, but I do plan to use it in some of my projects. I would like to know if the foundation of this library is sound: whether it is Pythonic, performant and provides a decent API.
The entire code can be seen in this repo: link, in the frozen branch review-04-02-2018
, file core.py
. I'll post relevant portions below.
The library considers a parser any callable that takes a single State
object and returns a new one. A State
object's purpose is to keep track of three things: input that's left to parse, the portion of input that was consumed by a parser and the value of a parser chain (the idea is that a parser chain would construct some object as it goes along). Here's the State
class:
class State():
""" An object representing current parser state. """
def __init__(self, left, value=None, parsed=""):
self.value = value
self.left = left
self.parsed = parsed
def __eq__(self, other):
return (self.value == other.value and
self.left == other.left and
self.parsed == other.parsed)
def __repr__(self):
return f"State({repr(self.value)}, {repr(self.left[0:40])}, {repr(self.parsed)})"
def copy(self):
""" Shallow copy the State object. """
return State(self.left, self.value, self.parsed)
def deepcopy(self):
""" Deep copy the State object. """
return deepcopy(self)
def set(self, **kwargs):
""" Return a new State object with given attributes. """
left = kwargs.get("left", self.left)
value = kwargs.get("value", self.value)
parsed = kwargs.get("parsed", self.parsed)
return State(left, value, parsed)
def consume(self, how_many):
""" Return a new State object with 'how_many' characters consumed. """
return State(self.left[how_many:], self.value, self.left[0:how_many])
def blank(self):
"""
Return a new State object with the same 'left' but with None 'parsed'
value.
"""
return State(self.left)
def split(self, at):
"""
Split the State object in two. Return a tuple with two State objects,
the first will have 'left' up to, but not including, index 'at', the
second - starting with 'at' and until the end.
"""
first = self.copy()
first.left = self.left[:at]
second = self.copy()
second.left = self.left[at:]
return first, second
Most are, hopefully, self-explanatory. I have doubts about set
, though: the purpose is to be able to do things like lambda s: s.set(parsed="foo, left="baz", value=None)
to compensate for inability to assign things in lambdas. Could it be written cleaner?
There are also two exception classes, to signal when parsing has failed or is stopped prematurely, respectively. Here are they:
class ParsingFailure(Exception):
""" An exception of this type should be thrown if parsing fails. """
pass
class ParsingEnd(Exception):
"""
An exception of this type should be thrown if parsing ends successfully,
but early.
"""
def __init__(self, state):
super().__init__()
self.state = state
All of the above things are used in the main function of the library, parse
:
def parse(state_or_string, parser, verbose=False):
"""
Run a given parser on a given state object or a string.
Return parser's return value on success, or None on failure.
If 'verbose' is truthy, return terminating ParsingFailure exception on
failure instead of None.
"""
if isinstance(state_or_string, str):
state = State(state_or_string)
else:
state = state_or_string
try:
return parser(state)
except ParsingFailure as failure:
if verbose:
return failure
return None
except ParsingEnd as end:
return end.state
There is a problem with this, namely that it's impossible to tell apart None
returned from a failure and None
legitimately produced by a parser, but I feel in most cases None
is a pretty good indicator of a failure. This raises the question: should I somehow separate these two cases?
1 Answer 1
I'm just going to review the State
class. You'll see that there's a lot to write about here, more than enough for one review.
Understanding the
State
class is crucial to understanding the operation of the parser, so its docstring needs to be more detailed than "An object representing current parser state." In particular, what attributes does it have, what do they mean, and what are the relationships between them? The explanation from your question would make a good starting point:class State: """State of a parser, with attributes: left: str -- substring of input remaining to be parsed value -- arbitrary value constructed by parser so far parsed: str -- substring of input that was parsed to produce value """
Do I have this documentation right? For example, if we have:
parse(State(a, value1, b)) → State(c, value2, d)
is it supposed to be the case that
b + a == d + c
? That was my first guess, but it looks as if this is not the case: I can see methods in various places that setparsed
to the empty string after parsing some text. Clearly I have misunderstood what theparsed
field is supposed to contain.I think it's vital for you to figure out what the relationship between
value
andparsed
is supposed to be, and document the corresponding invariant on theState
class. At the moment it's impossible for me to reliably write a parser function, because I don't know what I am supposed to put in theparsed
attribute of the returned state object.There are two possible designs for the use of state objects. In the first, we have a single object representing the current state, and as each piece of the input is parsed, the state is updated. In the second, we treat the state objects as immutable, and make a new one each time we parse a piece of the input.
The first design saves memory (we only need one state object) but alternation is tricky (after exploring one alternative we have to reverse the updates to the state object before exploring the next alternative). The second design makes many state objects, but alternation is easy (we make separate objects for each alternative).
It looks to me as though you are using the second design. For example, in the
branch
parser you make a copy of the state for each alternative. However, the code does not enforce the design. TheState
objects are mutable, and indeed we can see this mutability being used in a couple of places, for example innoconsume
whereoutput.left
gets modified, or inchain
wherestate.parsed
gets modified. So the reader is going to be confused about the design: are the state objects supposed to be immutable? (If so, the mutations innoconsume
andchain
must be oversights.) Or are state objects supposed to be mutable? (In which case, why all the calls tocopy
? Why not just modify the original?)I would recommend enforcing the design principle at the code level so that you (and your users) can't accidentally break the design. In this case I would make the
State
object immutable, usingcollections.namedtuple
, like this:from collections import namedtuple class State(namedtuple('State', 'left value parsed')): """State of a parser, with attributes: left: str -- substring of input remaining to be parsed value -- arbitrary value constructed by parser parsed: str -- substring of input that was parsed to produce value """ __slots__ = () def __new__(cls, left, value=None, parsed=''): return super().__new__(cls, left, value, parsed)
(The use of
__slots__ = ()
is recommended by the Python documentation to "keep memory requirements low by preventing the creation of instance dictionaries.")Now that
State
objects are immutable, we have to callset
in all the places where attributes were previously assigned. There are not too many of these. Inchain
, we need:return state.set(parsed=''.join(pieces))
In
noconsume
we need:return parser(state).set(left=state.left)
And there are a few instances in
parsers.py
.But the big benefit is that we can now drop the
copy
method and all calls to it.Using
collections.namedtuple
has a couple of extra benefits. First, we don't need to implement our own__eq__
method, asnamedtuple
already implements equality. Second, we can usenamedtuple
's_replace
method instead of writing our ownset
method.Is the
deepcopy
method necessary, and does it even make sense? Surely it depends on what the caller puts in thevalue
field — you can imagine cases in which thevalue
field is immutable (for example, it's a parse tree) and sodeepcopy
would be pointless.If we look for uses of the
deepcopy
method, there's only one use use, in one of the test cases, wherevalue
is a dictionary. But I think even in that case a deep copy is not necessary — we only need a shallow copy of the dictionary itself, not copies of its keys and values too.Only the caller knows what's in the
value
field, so it would be better, I think, not to offer thedeepcopy
method, and let the caller use:state._replace(value=state.value.copy())
or:
state._replace(value=deepcopy(state.value))
depending on what kind of copying they need (if any).
The docstring for
blank
neglects to mention that thevalue
field is copied as well as theleft
field.The
blank
method is only used in one place, so is it really needed? It is only a little longer to writestate._replace(parsed='')
instead ofstate.blank()
, and the former is more explicit.Also, I'm not convinced that blanking
parsed
is even the right thing to do — it doesn't seem to respect any invariant of theState
class, as discussed in §1 above. The only use ofblank
is inabsorb
, and the only use ofabsorb
is in a test case, so could there be a different approach to implementing the test that didn't rely on blankingparsed
?The
consume
method could be simplified:left = self.left return self._replace(left=left[how_many:], parsed=left[0:how_many])
and the
split
method too (though I don't know how to rewritesplit
because I don't know what invariant it needs to respect; see §1 above).Python doesn't have memory-efficient representation of slices of strings: that is, when you write
left[how_many:]
, Python copies out the remainder of the string to make a new string object.This has the consequence that the
State
class ends up copying the input over and over again, leading to quadratic runtime performance.To avoid this, its necessary to redesign the state class so that it does not used slices of strings. One possibility would be to store the original string, and three indexes into the string:
class State(namedtuple('State', 'input value start cur stop')): """State of a parser, with attributes: input: str -- the input being parsed value -- arbitrary value constructed by parser start: int -- index in input where parsing started cur: int -- index in input from where parsing should continue stop: int -- index in input where parsing must stop It must be the case that 0 <= start <= cur <= stop <= len(input). This means that parsing input[start:cur] produced value, and next the parser must go on to parse input[cur:stop]. """
The invariant can be enforced by the
__new__
method:def __new__(cls, input, value=None, start=0, cur=0, stop=None): if stop is None: stop = len(input) assert 0 <= start <= cur <= stop <= len(input) return super().__new__(cls, input, value, start, cur, stop)
In this version of the class, the
consume
method would become something like:return self._replace(cur=self.cur + how_many)
which runs in constant time. Of course, the rest of the code would have to be updated to use the new representation, but I suspect there will turn out to be simplifications — for example in
chain
you won't need to join the pieces, you can just copy thestop
field from the last state in the chain.
-
2\$\begingroup\$ Thank you very much for the in-depth review! It really means a lot. I will wait a day to see if somebody else has some more advice, just in case, but I doubt there will be anything equally comprehensive. \$\endgroup\$Michail– Michail2018年02月04日 17:18:36 +00:00Commented Feb 4, 2018 at 17:18