Python - lookahead driver for a parsing chain

Question 1

This is retroactive followup (that is, logically this is a followup, chronologically it isn't) to this question: link.

I'm writing a parser combinator library in Python. An important thing that must be in the library is ability for a parser chain to perform some lookahead, so analogs of regex (.*)foo are 1) useful, 2) work as expected, and not fail because (.*) has gobbled up all input.

I've concocted a solution for this, which seems to work based on my testing, but I want to see if it can be improved in any way. My main concerns are

Correctness - are weird edge-cases accomodated by my algorithm?
Performance - could it be faster?

There's also a couple of minor points I would like to know:

Is simply attaching an extra field to a function like greedy and reluctant do acceptable, or should I wrap changed functions in a class instead?
I'm doing except ParsingEnd as end: raise end in a couple of places as a visual cue that I remember about those. Should I drop it?

The algorithm is as follows:

Traverse the chain of parsers, feeding output of one to another
If a parser that should perform lookahead is encountered, start adding all parsers to a 'retry chain', whether they perform lookahead or not.
If some parser fails, fall back to the last parser with lookahead and change the portion of input it's allowed to operate on.
Reset input restrictions for all parsers to the right of this parser, and retry parsing from this point.
If all combinations of input restrictions were tried unsuccessfully, fail.

Now for the code. The entire project can be seen in this repo: link, in a frozen branch review-03022018 (core file). I'll post critical points below.

Here's the chain function, where lookahead driver is located. It expects an iterable of parsers as the first argument, where a parser is just a callable that takes one State object and returns a new one. If a parser fails, it's expected to raise a ParsingFailure exception. A parser is marked as having lookahead capabilities with greedy and reluctant functions.

def chain(funcs, combine=True, stop_on_failure=False):
 """
 Create a parser that chains a given iterable of parsers together, using
 output of one parser as input for another.
 If 'combine' is truthy, combine 'parsed's of the parsers in the chain,
 otherwise use the last one.
 If 'stop_on_failure' is truthy, stop parsing instead of failing it when a
 parser in the chain raises a ParsingFailure exception. Note that this also
 includes parsers with lookahead, effectively disabling it.
 A note on using iterators in chains: if supplied 'funcs' is an iterator,
 there is a possibility of 'funcs' being exausted on the first attempt to
 parse, with subsequent attempts silently succeeding because that's the
 default behaviour for empty 'funcs'. If you want to run your chain several
 times - be it because of lookahead or for different reasons - make sure to
 wrap the iterator in list or some other *reusable* iterable (or call
 'reuse_iter' on it, if it comes from a function).
 """
 def res(state):
 """ A chain of parsers. """
 pieces = _CachedAppender() if combine else None
 def maybe_combine(state):
 """ Concatenate 'parsed's if 'combine' is truthy. """
 if combine:
 state.parsed = "".join(pieces)
 return state
 lookahead_chain = None
 for parser in funcs:
 if lookahead_chain is None and has_lookahead(parser):
 lookahead_chain = _CachedAppender()
 if lookahead_chain is not None or has_lookahead(parser):
 parser = _restrict(parser)
 lookahead_chain.append(parser)
 try:
 state = parser(state)
 if combine:
 pieces.append(state.parsed)
 except ParsingEnd as end:
 raise end
 except ParsingFailure as failure:
 if stop_on_failure:
 return maybe_combine(state)
 if lookahead_chain is None:
 raise failure
 pos = len(lookahead_chain) - 1
 while True:
 try_shift = _shift(lookahead_chain, pos)
 if try_shift is None:
 raise ParsingFailure(
 "Failed to find a combination of inputs that allows "
 "successful parsing")
 _reset_chain(lookahead_chain, try_shift)
 state, failed = _try_chain(lookahead_chain, try_shift, pieces)
 if state is None:
 pos = failed
 continue
 break
 return maybe_combine(state)
 return res

There are a couple helper things that chain uses (directly or not). The most important is _RestrictedParser, which is a wrapper over a parser that remembers two things: the state it took as an argument the last time and which portion of input it is allowed to operate on:

class _RestrictedParser():
 """ A parser that only operates on a restricted portion of input. """
 def __init__(self, parser):
 self.parser = parser
 self.lookahead = get_lookahead(parser)
 self.delta = 0
 self.state_before = None
 def __call__(self, state):
 self.state_before = state
 if self.lookahead is None:
 return self.parser(state)
 if self.lookahead is Lookahead.GREEDY:
 return _subparse(state, self.parser, len(state.left) - self.delta)
 # is reluctant
 return _subparse(state, self.parser, self.delta)
 def overrestricted(self):
 """
 Return True if restrictions have reached their maximum - that is, if
 either allowed input portion is shrinked into an empty string, or has
 extended beyond the bounds of leftover input.
 """
 return self.delta > len(self.state_before.left)
 def reset(self):
 """ Reset restrictions. """
 self.delta = 0
 self.state_before = None
 def restrict_more(self):
 """ Increase restriction level on the input. """
 self.delta += 1

_subparse simply splits a State object in two at the specified position and uses a given parser on the first part, then joins leftovers with the second part.

The second important thing is _shift, which attempts to restrict parsers starting at a given position and going to the left. If a parser cannot be restricted further, the next parser is also restricted (and if it also cannot, the next is tried, and so on). If all restriction configurations were tried the function will return None, otherwise it'll return the index of the last restricted parser.

_reset_chain helper function used by chain simply resets restrictions on parsers starting at a given position and to the right.

Finally, _try_chain attempts to parse the state of its first _RestrictedParser using current configuration of restrictions in the chain:

def _try_chain(parsers, from_pos, pieces):
 """
 Try to parse the state the first parser in the chain remembers.
 Return a tuple (state, index of the first parser to fail).
 In case of failure, 'state' will be None.
 Also, if 'pieces' is not None, append every parsed chunk to it, having
 first dropped every invalidated piece. If an attempt to parse fails,
 'pieces' will not be affected.
 """
 state = parsers[from_pos].state_before
 i = from_pos
 new_pieces = None if pieces is None else deque()
 for i in range(from_pos, len(parsers)):
 try:
 state = parsers[i](state)
 if pieces is not None:
 new_pieces.append(state.parsed)
 except ParsingFailure:
 return (None, i)
 except ParsingEnd as end:
 raise end
 if pieces is not None:
 # '-1' is here because the last parser does not contribute a piece, as
 # it has failed
 pieces.drop(len(parsers) - from_pos - 1)
 pieces.extend(new_pieces)
 return state, i

The rest of used helpers are:

_CachedAppender - just a simple wrapper over a deque that sort of allows faster indexing if the deque changes not very often.
_restrict - wraps a parser with lookahead in _RestrictedParser and does nothing to a parser without lookahead.

Mentioned greedy and reluctant are simply:

def greedy(parser):
 """ Return a greedy version of 'parser'. """
 try:
 if parser.lookahead is Lookahead.GREEDY:
 return parser
 except AttributeError:
 pass
 res = lambda state: parser(state)
 res.lookahead = Lookahead.GREEDY
 return res

I do not want to pollute the post with extraneous code so I don't post the code of everything I mention, so if anything needs to be added, please tell me.

Question 2

This question asks us to review the most complicated part of your parsing system. This approach isn't likely to get the best results out of Code Review. First, in order to review this code, we would need a thorough understanding of the foundation of your parsing system, but no-one here is likely to have such knowledge or expertise. Second, until we've reviewed the foundation, we can't be sure that the superstructure is on a firm footing. (cont...)

Question 3

For better results, start by asking us to review the foundation of the parser: the State class, the parse function, and so on. (There's much that could be improved in these parts of the system.) Once we've reviewed the foundations (and you've fixed the review points) then it would make sense to present the more complicated code for review. (cont...)

Question 4

I hope this advice is not too discouraging — it looks like an interesting system, but in order to do a decent review, we have to start with the foundations, not the penthouse.

Question 5

@GarethReess This makes a lot of sense, and advice seems quite solid. I'll write up a question about the foundations shortly. What would be the best thing to do with this question? Leave it as it is until the new question is answered (and code is fixed), or delete it and then write a new one?

Question 6

It's up to you, but there's nothing wrong with leaving this question open for now — it is a reasonable question, and maybe I'm too pessimistic and some expert in parser combinators will show up.

Stack Exchange Network

Python - lookahead driver for a parsing chain

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Python - lookahead driver for a parsing chain

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions