6.1Tokens
6.3Parsing
8.18
top
← prev up next →

amb-parser: Parser generator for ambiguous grammarsπŸ”— i

Ruslan Popov

1IntroductionπŸ”— i

This is a very simple parser generator that mimics brag (and actually its built on top of it). It is designed for parsing English text.

Its main features are:
  • Ambiguous grammar support. Parser generates all possible derivation trees.

  • Support for tokens that have multiple types attached. For example, the word cook in English may be either a verb or a noun.

I don’t know whether the implementation works correctly.

2Parser languageπŸ”— i

The language for amb-parser is similar to one in brag, but with several differences:
  • Semicolon is required at the end of a production.

  • Terminals are written in lowercase, non-terminals are in uppercase.

Note: the parser does not support left-recursion. (Yet Another Parser that Doesn’t Support Left Recursion, yapdslr).

Example:
S:NPVP;
NP:determiner?adjective*noun;
VP:verbNP;

The parser supports ONLY:
  • Rule combinations:

    S : NP VP;

  • Optional rule:

    NP : determiner? noun;

  • Kleene star rule:

    NP : adjective* noun;

  • Alteration on the top level of an expansion:

    A : b | c;

Note: the parser does not support grouping.

Note: you are allowed to combine question and star operators only in the order: rule*?.

2.1Formal parser language definitionπŸ”— i

 
parser-syntax:rule+
 
rule:NON-TERMINALCOLONalterationSEMICOLON
 
alteration:expansion(ALTERATIONexpansion)*
expansion:(non-terminal|terminal)*
 
non-terminal:NON-TERMINAL[STAR][QUESTION]
terminal:TERMINAL[STAR][QUESTION]

3Parser interfaceπŸ”— i

  • Firstly, import amb-parser in your Racket module.

  • Then create a file with

    #lang amb-parser

    line and with your grammar and require it in your Racket module.

You will be provided with the parse function. You can use this function to parse your text. But before, your text should be separated into tokens.

For more details, refer to the parse documentation.

4Usage exampleπŸ”— i

Let’s take the grammar in Parser interface:
S:NPVP;
NP:determiner?adjective*noun;
VP:verbNP;

For sentence "Thebigcatcatchedasmallgreymouse.", which is transformed in the list of tokens:
(list
(token "the"' (determiner))
(token "big"' (adjective))
(token "cat"' (noun))
(token "catched"' (verb))
(token "a"' (determiner))
(token "small"' (adjective))
(token "grey"' (adjective))
(token "mouse"' (noun)))

The parser will generate such a result:
(list
' (S
(NP(determiner"the")
(adjective*(adjective"big")(adjective*))
(noun"cat"))
(VP
(verb"catched")
(NP
(determiner"a")
(adjective*(adjective"small")(adjective*(adjective"grey")(adjective*)))
(noun"mouse"))))
' ()))

Note how the star rule works.

5Parse tree structureπŸ”— i

In this section we will describe the structure of the parse tree without specifying that the parser returns multiple results and every parser-result contains the rest of tokens.

  • Every rule name : expansion; will be parsed as (name expansion).

  • Every expansion rule1 rule2 will be parsed as an association list, where keys and values are organized as list with two elements, keys are rule names and values are the result of parsing the rules.

  • Every terminal terminal will be parsed as (list terminal token-str) where token-str is the string of the matched token.

If some parsing is failed then the parse returns empty (meaning there is no derivation trees and no parser results).

6Types and functionsπŸ”— i

6.1TokensπŸ”— i

struct

(struct token (strpos)
#:transparent)
str:string?
pos:(listof symbol? )
Represents a token.

pos is a list of all possible parts of speech for this token.

When parser tries to parse a terminal rule it checks whether the terminal is a member of pos list.

Note: in fact, the implementation does not really utilizes the info that the str is a string? .

procedure

( token?x)boolean?

x:any/c
Returns whether the x is a token .

procedure

( token-str token)string?

token:token?
Returns the string of the token.

procedure

( token-pos token)(listof symbol? )

token:token?
Returns all possible parts of speech for the token.

6.2Parser resultπŸ”— i

struct

(struct parser-result (datarest)
#:transparent)
data:any/c
rest:(listof token?)
Represents a single result of parsing. It consists of parsed data and the rest of tokens (uparsed part).

procedure

( parser-result? x)boolean?

x:any/c
Returns whether the x is a parser-result .

procedure

( parser-result-data res)any/c

Returns the parsed data of the res.

procedure

( parser-result-rest res)(listof token?)

Returns the uparsed tokens of the res.

6.3ParsingπŸ”— i

procedure

( parse non-terminaltokens)(listof parser-result? )

non-terminal:symbol?
tokens:(listof token?)
Parses non-terminal from the tokens list.

Parser returns a list of all possible derivation trees. The results are organized into parser-result structures.

WARNING: There is a bad design desicion. The parse function requires a list of tokens, instead of a lambda function that generates tokens on demand.

top
← prev up next →

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /