3
\$\begingroup\$

Any suggestions to make this code clearer, more Pythonic, or otherwise better? I'm open to changes to the design as well as the code (but probably won't drop features or error checking since everything still in it has shown worth to me).

"""
Parsing with PEGs, or a minimal usable subset thereof.
Background at http://bford.info/packrat/
"""
import re
def _memo(f):
 """Return a function like f but caching its results. Its arguments
 must be hashable."""
 memos = {}
 def memoized(*args):
 try: return memos[args]
 except KeyError:
 result = memos[args] = f(*args)
 return result
 return memoized
_identifier = r'[A-Za-z_]\w*'
def Parser(grammar, **actions):
 r"""Make a parsing function from a PEG grammar. You supply the
 grammar as a string of rules like "a = b c | d". All the tokens
 making up the rules must be whitespace-separated. Each token
 (besides '=' and '|') is a regex, a rule name, or an action
 name. (Possibly preceded by '!' for negation: !foo successfully
 parses when foo *fails* to parse.)
 A regex token is either /<chars>/ or any non-identifier; an
 identifier that's not a defined rule or action name is an
 error. (So, an incomplete grammar gets you a BadGrammar exception
 instead of a wrong parse.)
 Results get added by regex captures and transformed by actions.
 (Use keyword arguments to bind the action names to functions.)
 The parsing function maps a string to a results tuple or raises
 Unparsable. (It can optionally take a rule name to start from, by
 default the first in the grammar.) It doesn't necessarily match
 the whole input, just a prefix.
 >>> parse_s_expression = Parser(r'''
 ... one_expr = _ expr !.
 ... _ = \s*
 ... expr = \( _ exprs \) _ hug
 ... | ([^()\s]+) _
 ... exprs = expr exprs
 ... | ''', hug = lambda *vals: vals)
 >>> parse_s_expression(' (hi (john mccarthy) (()))')
 (('hi', ('john', 'mccarthy'), ((),)),)
 >>> parse_s_expression('(too) (many) (exprs)')
 Traceback (most recent call last):
 Unparsable: ('one_expr', '(too) ', '(many) (exprs)')
 """
 parts = re.split(' ('+_identifier+') += ', ' '+re.sub(r'\s', ' ', grammar))
 if not parts: raise BadGrammar("No grammar")
 if parts[0].strip(): raise BadGrammar("Missing left hand side", parts[0])
 if len(set(parts[1::2])) != len(parts[1::2]):
 raise BadGrammar("Multiply-defined rule(s)", grammar)
 rules = dict((lhs, [alt.split() for alt in (' '+rhs+' ').split(' | ')])
 for lhs, rhs in zip(parts[1::2], parts[2::2]))
 return lambda text, rule=parts[1]: _parse(rules, actions, rule, text)
class BadGrammar(Exception): pass
class Unparsable(Exception): pass
def attempt(parse, *args, **kwargs):
 "Call a parser, but return None on failure instead of raising Unparsable."
 try: return parse(*args, **kwargs)
 except Unparsable: return None
def _parse(rules, actions, rule, text):
 # Each function takes a position pos (and maybe a values tuple
 # vals) and returns either (far, pos1, vals1) on success or (far,
 # None, ignore) on failure (where far is the rightmost position
 # reached in the attempt).
 @_memo
 def parse_rule(name, pos):
 farthest = pos
 for alternative in rules[name]:
 pos1, vals1 = pos, ()
 for token in alternative:
 far, pos1, vals1 = parse_token(token, pos1, vals1)
 farthest = max(farthest, far)
 if pos1 is None: break
 else: return farthest, pos1, vals1
 return farthest, None, ()
 def parse_token(token, pos, vals):
 if re.match(r'!.', token):
 _, pos1, _ = parse_token(token[1:], pos, vals)
 return pos, pos if pos1 is None else None, vals
 elif token in rules:
 far, pos1, vals1 = parse_rule(token, pos)
 return far, pos1, pos1 is not None and vals + vals1
 elif token in actions:
 f = actions[token]
 if hasattr(f, 'is_peg'): return f(text, pos, vals)
 else: return pos, pos, (f(*vals),)
 else:
 if re.match(_identifier+'$', token):
 raise BadGrammar("Missing rule", token)
 if re.match(r'/.+/$', token): token = token[1:-1]
 m = re.match(token, text[pos:])
 if m: return pos + m.end(), pos + m.end(), vals + m.groups()
 else: return pos, None, ()
 far, pos, vals = parse_rule(rule, 0)
 if pos is None: raise Unparsable(rule, text[:far], text[far:])
 else: return vals
# Some often-used actions:
def hug(*xs): return xs
def join(*strs): return ''.join(strs)
def position(text, pos, vals):
 "A peglet action: always succeed, producing the current position."
 return pos, pos, vals + (pos,)
position.is_peg = True

A particular I'm not sure about: maybe it should use the built-in SyntaxError exception? That seems to be meant for errors in Python syntax, but I'm not sure what's handiest.

A weakness: some functions use pos, vals, pos1, vals1, and while I deem this reasonable for locals related the way they are, maybe there are better names?

From https://github.com/darius/peglet

asked Nov 26, 2012 at 20:49
\$\endgroup\$
1

1 Answer 1

4
\$\begingroup\$

Just last week, I was wondering what you were up to lately. :-)

I've been working on nice PEG parsing for Python lately. I've written an adaptation of OMeta, named Parsley. So far I haven't implemented regex tokens.

I wrote a custom ParseError exception class. I definitely think that the builtin SyntaxError should only be used for actual-Python syntax errors.

Parser reads a bit densely to me, both because of the regexes and because of the lack of newlines after colons.

Otherwise it looks reasonable for your goals. I ended up wrapping memoization and location tracking into an InputStream object rather than passing indices around directly; this was partly because I wanted to be able to apply Parsley rules to iterables other than strings.

answered Nov 26, 2012 at 22:41
\$\endgroup\$
3
  • \$\begingroup\$ Hi Allen! Thanks, I've reformatted Parser a bit (at github). \$\endgroup\$ Commented Nov 27, 2012 at 0:32
  • \$\begingroup\$ I came across Parsley (and was reminded of PyMeta) after starting this -- it's on the to-read list now. The InputStream sounds like a good idea. At the moment I'm trying to figure out if a fancier library ought to be redesigned around tree parsing (which my fancier version of this can do, but as kind of an afterthought). Anyway, it tends to be more educational (if more time-consuming) to try to reinvent the wheel before studying others' and then seeing how they made it better. \$\endgroup\$ Commented Nov 27, 2012 at 0:37
  • \$\begingroup\$ I've got some prototype code for tree parsing using Parsley in development now. I'll trot it out in a few days. :) \$\endgroup\$ Commented Nov 27, 2012 at 5:45

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.