Which Python tool can you recommend to parse programming languages? It should allow for a readable representation of the language grammar inside the source. It should also be able to scale to complicated languages (something with a grammar as complex as e.g. Python itself).
Bonus points for good error reporting and source code locations attached to syntax tree elements.
-
I think you will find the problem of defining the programming languages to be the hard part of your task. (If you want to parse Python, I'm sure you can get that off the shelf in Python). Parsing Java >=1.5 will be harder. Parsing C++ will be very difficult; wait till you get to C++11x. And you can't do much unless you do name and type resolution ("build symbol tables") after your parse. There's a lot more work here than you might guess. If your task is manipulating programming languages, you might consider a tool that can do this already, rather than trying to roll your own.Ira Baxter– Ira Baxter2011年07月06日 17:37:55 +00:00Commented Jul 6, 2011 at 17:37
-
Similar to: stackoverflow.com/questions/2945357/… and: stackoverflow.com/questions/1547782/mini-languages-in-python0 _– 0 _2013年07月24日 16:58:06 +00:00Commented Jul 24, 2013 at 16:58
9 Answers 9
I really like pyPEG. Its error reporting isn't very friendly, but it can add source code locations to the AST.
pyPEG doesn't have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I've used pyPEG to build a parser for subset of C# with surprisingly little work.
An example adapted from fdik.org/pyPEG/: A simple language like this:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG parser for that language:
def comment(): return [re.compile(r"//.*"),
re.compile("/\*.*?\*/", re.S)]
def literal(): return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol(): return re.compile(r"\w+")
def operator(): return re.compile(r"\+|\-|\*|\/|\=\=")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return (keyword("if"), "(", expression, ")", block,
keyword("else"), block)
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
3 Comments
from __future__ import unicode_literals, print_function ; from pypeg2 import * ; f = parse(example_string,simpleLanguage). Provided that you load the above example as example_string. But that doesn't work. Also, the syntax is very different from the (current )original example on the pyPEG website. Any suggestions how to run the same code?I would recommend that you check out my library: https://github.com/erezsh/lark
It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.
It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.
3 Comments
pyPEG (a tool I authored) has a tracing facility for error reporting.
Just set pyPEG.print_trace = True and pyPEG will give you a full trace of what's happening inside.
2 Comments
For a more complicated parser I would use pyparsing. Pyparsing
Here is the parsed example from there home page
from pyparsing import Word, alphas
greet = Word(alphas) + "," + Word(alphas) + "!" # <-- grammar
defined here
hello = "Hello, World!"
print(hello, "->", greet.parseString(hello))
3 Comments
Antlr is what you should look at http://www.antlr.org
Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget
8 Comments
Ned Batchelder did a survey of python parsing tools, which apparently he keeps updated (last updated July 2010):
http://nedbatchelder.com/text/python-parsers.html
If I was going to need a parser today, I would either roll my own recursive descent parser, or possibly use PLY or LEPL -- depending on my needs and whether or not I was willing to introduce an external dependency. I wouldn't personally use PyParsing for anything very complicated.
2 Comments
PLY is the preferred solution after possibly prototyping with a lighter and higher level approach like PyParsing. The documentation of PLY is really good, error reporting is good, and it has a robust plain structure of defining the grammar. Even with packrat enabled, PyParsing can be orders of magnitude slower than PLY.For simple task I tend to use the shlex module.
See http://wiki.python.org/moin/LanguageParsing for evaluation of language parsing in python.
2 Comments
If you're evaluating PyParsing, I think you should look at funcparserlib: http://pypi.python.org/pypi/funcparserlib
It's a bit similar, but in my experience resulting code is much cleaner.
Comments
Antlr generates LL(*) parsers. That can be good, but sometimes removing all left recursion can be cumbersome.
If you are LALR(1)-savvy, you can use PyBison. It has similar syntax to Yacc, if you know what it is. Plus, there are a lot of people out there that know how yacc works.