I'm working on writing a scripting language with ANTLR and C++. This is my first actual move from ANTLR grammars into the C++ API, so I'd like to know if this would be a good way to structure the grammar (later I will be adding a tree parser or tree rewriting rules though).
grammar dyst;
options
{
language = C;
output = AST;
ASTLabelType=pANTLR3_BASE_TREE;
}
program : statement*;
statement : stopUsingNamespaceStm|usingNamespaceStm|namespaceDefineStm|functionStm|defineStm|assignStm|funcDefineStm|ifStm|whileStm|returnStm|breakStm|eventDefStm|eventCallStm|linkStm|classDefStm|exitStm|importStm|importOnceStm|directive;
namespaceDefineStm : 'namespace' ident '{' statement* '}';
usingNamespaceStm : 'using' 'namespace' ident (',' ident)* ';';
stopUsingNamespaceStm : 'stop' 'using' 'namespace' ident (',' ident)* ';';
directive : '@' directiveId argList? ';';
directiveId : ID (':' ID)*;
importOnceStm : 'import_once' expression ';';
importStm : 'import' expression ';';
exitStm : 'exit' expression? ';';
classDefStm : 'class' ident ('extends' ident (',' ident)*)? '{' (classSection|funcDefineStm|defineStm|eventDefStm)* '}';
classSection : ('public'|'private'|'protected') ':';
linkStm : 'link' ident 'to' ident (',' ident)* ';';
eventCallStm : 'call' ident (',' argList)? ';';
eventDefStm : 'event' ident '(' paramList? ')' ';';
returnStm : 'return' expression ';';
breakStm : 'break' int ';';
ifStm : 'if' '(' expression ')' '{' statement* '}';
whileStm : 'while' '(' expression ')' '{' statement* '}';
defineStm : 'global'? 'def' ident ('=' expression)? ';';
assignStm : ident '=' expression ';';
funcDefineStm : 'function' ident '(' paramList? ')' ('handles' ident (',' ident)*)? '{' statement* '}';
paramList : param (',' param)?;
param : ident ('=' expression)?;
functionStm : functionCall ';';
functionCall : ident '(' argList? ')';
argList : expression (',' expression)*;
//Expressions!
term : functionCall|value|'(' expression ')';
logic_not : ('!')* term;
bit_not : ('~')* logic_not;
urnary : '-'* bit_not;
mult : urnary (('*'|'/'|'%') urnary)*;
add : mult ('+' mult)*;
relation : add (('<='|'>='|'<'|'>') add)*;
equality : relation (('=='|'!=') relation)*;
bit_and : equality ('&' equality)*;
bit_xor : bit_and ('^' bit_and)*;
bit_or : bit_xor ('|' bit_xor)*;
logic_and : bit_or ('&&' bit_or)*;
logic_or : logic_and ('||' logic_and)*;
expression : logic_or;
value : ident|float|int|string|boolean|newObject|anonFunc|null_val;
anonFunc : 'function' '(' paramList? ')' '{' statement* '}';
newObject : 'new' ident ('(' argList ')')?;
ident : ID (('.'|'::') ID)*;
float : FLOAT;
int : INTEGER;
string : STRING_DOUBLE|STRING_SINGLE;
boolean : BOOL;
null_val : NULL_VAL;
FLOAT : INTEGER '.' INTEGER;
INTEGER : DIGIT+;
BOOL : 'true'|'false';
NULL_VAL : 'null'|'NULL';
STRING_DOUBLE : '"' .* '"';
STRING_SINGLE : '\'' .* '\'';
ID : (LETTER|'_') (LETTER|DIGIT|'_')*;
fragment DIGIT : '0'..'9';
fragment LETTER : 'a'..'z'|'A'..'Z';
NEWLINE : ('\n'|'\r'|'\t'|' ')+ {$channel = HIDDEN;};
COMMENT : '#' .* '\r'? '\n' {$channel = HIDDEN;};
MULTI_COMMENT : '/-' .* '-/' {$channel = HIDDEN;};
If you are wondering about exactly what it is I'm using this for, you can take a look here.
1 Answer 1
The grammar itself is pretty unreadable "as is". A rule like:
statement : stopUsingNamespaceStm|usingNamespaceStm|namespaceDefineStm|functionStm|defineStm|assignStm|funcDefineStm|ifStm|whileStm|returnStm|breakStm|eventDefStm|eventCallStm|linkStm|classDefStm|exitStm|importStm|importOnceStm|directive;
would be far more readable when declared like this:
statement : stopUsingNamespaceStm | usingNamespaceStm | namespaceDefineStm | functionStm | defineStm | assignStm | funcDefineStm | ifStm | whileStm | returnStm | breakStm | eventDefStm | eventCallStm | linkStm | classDefStm | exitStm | importStm | importOnceStm | directive ;
You'll want to explicitly end the entry point of your parser, the rule
program
, with the end-of-file token, otherwise your parser might stop parsing prematurely. WithEOF
, you force the parser to read the entire tokens stream.program : statement* EOF ;
Make explicit tokens for keywords, don't mix them inside your parser rules.
Instead of:
importStm : 'import' expression ';' ;
it's better to do:
importStm : Import expression ';' ; Import : 'import' ;
This will make your life easier at a later (tree walking) stage. Without explicit lexer tokens, it is unclear for you when debugging what tokens there actually are in your tree.
Your lexer rules:
STRING_DOUBLE : '"' .* '"'; STRING_SINGLE : '\'' .* '\'';
can never contain either double- or single quotes. So, it's impossible to have a string literal with a double- and single quote in it.
Better to do something like this:
STRING_DOUBLE : '"' ('\\' ('\\' | '"') | ~('\\' | '"'))* '"' ;
which will allow a double quoted string to contain double quotes as well.
That's all I saw at a first glance. I didn't look real close, so there might be more that can be improved.
-
\$\begingroup\$ Thanks a lot, especially with the quote thing. I was having trouble with that. \$\endgroup\$Sam Bloomberg– Sam Bloomberg2011年03月31日 18:04:20 +00:00Commented Mar 31, 2011 at 18:04
-
\$\begingroup\$ @Sam, you're welcome. Note that the string literals now also accepts line breaks. If you don't want that, do something like this:
STRING_DOUBLE : '"' ('\\' ('\\' | '"') | ~('\\' | '"' | '\r' | '\n'))* '"' ;
\$\endgroup\$user3008– user30082011年03月31日 18:08:58 +00:00Commented Mar 31, 2011 at 18:08