Parsing a basic scripting language

Question 1

I'm working on writing a scripting language with ANTLR and C++. This is my first actual move from ANTLR grammars into the C++ API, so I'd like to know if this would be a good way to structure the grammar (later I will be adding a tree parser or tree rewriting rules though).

grammar dyst;
options
{
 language = C;
 output = AST;
 ASTLabelType=pANTLR3_BASE_TREE;
}
program : statement*;
statement : stopUsingNamespaceStm|usingNamespaceStm|namespaceDefineStm|functionStm|defineStm|assignStm|funcDefineStm|ifStm|whileStm|returnStm|breakStm|eventDefStm|eventCallStm|linkStm|classDefStm|exitStm|importStm|importOnceStm|directive;
namespaceDefineStm : 'namespace' ident '{' statement* '}';
usingNamespaceStm : 'using' 'namespace' ident (',' ident)* ';';
stopUsingNamespaceStm : 'stop' 'using' 'namespace' ident (',' ident)* ';';
directive : '@' directiveId argList? ';';
directiveId : ID (':' ID)*;
importOnceStm : 'import_once' expression ';';
importStm : 'import' expression ';';
exitStm : 'exit' expression? ';';
classDefStm : 'class' ident ('extends' ident (',' ident)*)? '{' (classSection|funcDefineStm|defineStm|eventDefStm)* '}';
classSection : ('public'|'private'|'protected') ':';
linkStm : 'link' ident 'to' ident (',' ident)* ';';
eventCallStm : 'call' ident (',' argList)? ';';
eventDefStm : 'event' ident '(' paramList? ')' ';';
returnStm : 'return' expression ';';
breakStm : 'break' int ';';
ifStm : 'if' '(' expression ')' '{' statement* '}';
whileStm : 'while' '(' expression ')' '{' statement* '}';
defineStm : 'global'? 'def' ident ('=' expression)? ';';
assignStm : ident '=' expression ';';
funcDefineStm : 'function' ident '(' paramList? ')' ('handles' ident (',' ident)*)? '{' statement* '}';
paramList : param (',' param)?;
param : ident ('=' expression)?;
functionStm : functionCall ';';
functionCall : ident '(' argList? ')';
argList : expression (',' expression)*;
//Expressions!
term : functionCall|value|'(' expression ')';
logic_not : ('!')* term;
bit_not : ('~')* logic_not;
urnary : '-'* bit_not;
mult : urnary (('*'|'/'|'%') urnary)*;
add : mult ('+' mult)*;
relation : add (('<='|'>='|'<'|'>') add)*;
equality : relation (('=='|'!=') relation)*;
bit_and : equality ('&' equality)*;
bit_xor : bit_and ('^' bit_and)*;
bit_or : bit_xor ('|' bit_xor)*;
logic_and : bit_or ('&&' bit_or)*;
logic_or : logic_and ('||' logic_and)*;
expression : logic_or;
value : ident|float|int|string|boolean|newObject|anonFunc|null_val;
anonFunc : 'function' '(' paramList? ')' '{' statement* '}';
newObject : 'new' ident ('(' argList ')')?;
ident : ID (('.'|'::') ID)*;
float : FLOAT;
int : INTEGER;
string : STRING_DOUBLE|STRING_SINGLE;
boolean : BOOL;
null_val : NULL_VAL;
FLOAT : INTEGER '.' INTEGER;
INTEGER : DIGIT+;
BOOL : 'true'|'false';
NULL_VAL : 'null'|'NULL';
STRING_DOUBLE : '"' .* '"';
STRING_SINGLE : '\'' .* '\'';
ID : (LETTER|'_') (LETTER|DIGIT|'_')*;
fragment DIGIT : '0'..'9';
fragment LETTER : 'a'..'z'|'A'..'Z';
NEWLINE : ('\n'|'\r'|'\t'|' ')+ {$channel = HIDDEN;};
COMMENT : '#' .* '\r'? '\n' {$channel = HIDDEN;};
MULTI_COMMENT : '/-' .* '-/' {$channel = HIDDEN;};

If you are wondering about exactly what it is I'm using this for, you can take a look here.

Question 2

The grammar itself is pretty unreadable "as is". A rule like:

statement : stopUsingNamespaceStm|usingNamespaceStm|namespaceDefineStm|functionStm|defineStm|assignStm|funcDefineStm|ifStm|whileStm|returnStm|breakStm|eventDefStm|eventCallStm|linkStm|classDefStm|exitStm|importStm|importOnceStm|directive;

would be far more readable when declared like this:

statement 
 : stopUsingNamespaceStm
 | usingNamespaceStm
 | namespaceDefineStm
 | functionStm
 | defineStm
 | assignStm
 | funcDefineStm
 | ifStm
 | whileStm
 | returnStm
 | breakStm
 | eventDefStm
 | eventCallStm
 | linkStm
 | classDefStm
 | exitStm
 | importStm
 | importOnceStm
 | directive
 ;

You'll want to explicitly end the entry point of your parser, the rule program, with the end-of-file token, otherwise your parser might stop parsing prematurely. With EOF, you force the parser to read the entire tokens stream.
```
program 
 : statement* EOF
 ;
```
Make explicit tokens for keywords, don't mix them inside your parser rules.

Instead of:
```
importStm 
 : 'import' expression ';'
 ;
```
it's better to do:
```
importStm 
 : Import expression ';'
 ;
Import
 : 'import'
 ;
```
This will make your life easier at a later (tree walking) stage. Without explicit lexer tokens, it is unclear for you when debugging what tokens there actually are in your tree.
Your lexer rules:
```
STRING_DOUBLE : '"' .* '"';
STRING_SINGLE : '\'' .* '\'';
```
can never contain either double- or single quotes. So, it's impossible to have a string literal with a double- and single quote in it.

Better to do something like this:
```
STRING_DOUBLE 
 : '"' ('\\' ('\\' | '"') | ~('\\' | '"'))* '"'
 ;
```
which will allow a double quoted string to contain double quotes as well.

That's all I saw at a first glance. I didn't look real close, so there might be more that can be improved.

Question 3

Thanks a lot, especially with the quote thing. I was having trouble with that.

Question 4

@Sam, you're welcome. Note that the string literals now also accepts line breaks. If you don't want that, do something like this: STRING_DOUBLE : '"' ('\\' ('\\' | '"') | ~('\\' | '"' | '\r' | '\n'))* '"' ;

user3008user3008 · Accepted Answer · 2011-03-31 13:51:27Z

The grammar itself is pretty unreadable "as is". A rule like:

statement : stopUsingNamespaceStm|usingNamespaceStm|namespaceDefineStm|functionStm|defineStm|assignStm|funcDefineStm|ifStm|whileStm|returnStm|breakStm|eventDefStm|eventCallStm|linkStm|classDefStm|exitStm|importStm|importOnceStm|directive;

would be far more readable when declared like this:

statement 
 : stopUsingNamespaceStm
 | usingNamespaceStm
 | namespaceDefineStm
 | functionStm
 | defineStm
 | assignStm
 | funcDefineStm
 | ifStm
 | whileStm
 | returnStm
 | breakStm
 | eventDefStm
 | eventCallStm
 | linkStm
 | classDefStm
 | exitStm
 | importStm
 | importOnceStm
 | directive
 ;

You'll want to explicitly end the entry point of your parser, the rule program, with the end-of-file token, otherwise your parser might stop parsing prematurely. With EOF, you force the parser to read the entire tokens stream.
```
program 
 : statement* EOF
 ;
```
Make explicit tokens for keywords, don't mix them inside your parser rules.

Instead of:
```
importStm 
 : 'import' expression ';'
 ;
```
it's better to do:
```
importStm 
 : Import expression ';'
 ;
Import
 : 'import'
 ;
```
This will make your life easier at a later (tree walking) stage. Without explicit lexer tokens, it is unclear for you when debugging what tokens there actually are in your tree.
Your lexer rules:
```
STRING_DOUBLE : '"' .* '"';
STRING_SINGLE : '\'' .* '\'';
```
can never contain either double- or single quotes. So, it's impossible to have a string literal with a double- and single quote in it.

Better to do something like this:
```
STRING_DOUBLE 
 : '"' ('\\' ('\\' | '"') | ~('\\' | '"'))* '"'
 ;
```
which will allow a double quoted string to contain double quotes as well.

That's all I saw at a first glance. I didn't look real close, so there might be more that can be improved.

Thanks a lot, especially with the quote thing. I was having trouble with that.
@Sam, you're welcome. Note that the string literals now also accepts line breaks. If you don't want that, do something like this: STRING_DOUBLE : '"' ('\\' ('\\' | '"') | ~('\\' | '"' | '\r' | '\n'))* '"' ;

Stack Exchange Network

Parsing a basic scripting language

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing a basic scripting language

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions