I started learning compilers development and as pet project decided to implement c compiler. The first step is implementing simplest lexer. Also, I decided to try new programming language and chose go. This is my first go program and I will be grateful for any advice. Thanks.
package lexer
import (
"errors"
"fmt"
"unicode"
)
type tokenType int
const (
EOF tokenType = iota
Identifier
NumericConstant
IntType
ReturnKeyword
OpenRoundBracket
CloseRoundBracket
OpenCurlyBracket
CloseCurlyBracket
Semicolon
)
type token struct {
tokenType tokenType
value string
}
type parser struct {
code []rune
currentIndex int
}
func (p *parser) nextToken() token {
p.skipSpaces()
if p.isEOF() {
return token{tokenType: EOF}
}
var newToken token
r := p.currentRune()
if unicode.IsLetter(r) {
newToken = p.parseLetterStartToken()
} else if unicode.IsNumber(r) {
newToken = p.parseNumberStartToken()
} else if r == '(' {
newToken = token{tokenType: OpenRoundBracket, value:string(r)}
} else if r == ')' {
newToken = token{tokenType: CloseRoundBracket, value:string(r)}
} else if r == '{' {
newToken = token{tokenType: OpenCurlyBracket, value:string(r)}
} else if r == '}' {
newToken = token{tokenType: CloseCurlyBracket, value:string(r)}
} else if r == ';' {
newToken = token{tokenType: Semicolon, value:string(r)}
} else {
errorText := fmt.Sprintf("Unexpected character: %c", r)
panic(errors.New(errorText))
}
p.currentIndex += 1
return newToken
}
func (p *parser) skipSpaces() {
if p.isEOF() {
return
}
for unicode.IsSpace(p.currentRune()) {
p.currentIndex += 1
if p.isEOF() {
return
}
}
}
func (p *parser) parseLetterStartToken() token {
startIndex := p.currentIndex
for unicode.IsLetter(p.currentRune()) || unicode.IsNumber(p.currentRune()) {
p.nextRune()
}
endIndex := p.currentIndex
tokenValue := string(p.code[startIndex: endIndex])
p.currentIndex -= 1
return letterStartToken(tokenValue)
}
func letterStartToken(tokenValue string) token {
var tokenType tokenType
switch tokenValue {
case "int":
tokenType = IntType
case "return":
tokenType = ReturnKeyword
default:
tokenType = Identifier
}
return token{tokenType:tokenType, value:tokenValue}
}
func (p *parser) parseNumberStartToken() token {
startIndex := p.currentIndex
for unicode.IsNumber(p.currentRune()) {
p.nextRune()
}
endIndex := p.currentIndex
tokenValue := string(p.code[startIndex: endIndex])
p.currentIndex -= 1
return token{tokenType: NumericConstant, value:tokenValue}
}
func (p parser) isEOF() bool {
return p.currentIndex >= len(p.code)
}
func (p parser) currentRune() rune {
return p.code[p.currentIndex]
}
func (p *parser) nextRune() rune {
p.currentIndex += 1
return p.currentRune()
}
func Tokenize(code []rune) []token {
parser := parser{code:code, currentIndex:0}
tokens := []token{}
for newToken := parser.nextToken(); newToken.tokenType != EOF; {
tokens = append(tokens, newToken)
newToken = parser.nextToken()
}
return tokens
}
1 Answer 1
Your code is a good start, but it is not finished yet. You still need to do:
- character literals
- string literals
- floating-point literals
- hexadecimal int literals
- int literals containing underscores
I suggest you pick a copy of the current C standard and read through chapter 6, which contains the definitions for all these lexemes. Along with the reading, start to write unit tests for each of the lexeme rules, especially the edge cases.
Your current code is well-structured, and you will be able to write the full lexer keeping the structure the same. I've written a very similar lexer in my pkglint project, and the corresponding unit tests in lexer_test.go
in the same directory. Reading this code will bring you up to speed with your own lexer. To see how I use the lexer, look for NewLexer
in the code.
One aspect where my lexer differs from yours is that my lexer only remembers the remaining input string, whereas your lexer remembers the complete input string plus the current index. I chose my data structure since it uses the least amount of memory possible. Taking a subslice is an efficient operation in Go, therefore I haven't been bitten my the extra memory access that l.rest = l.rest[1:]
needs (it updates both the start and the length of l.rest
. Your code is more efficient in that it only increments the index.
-
\$\begingroup\$ Thank you for your answer! I will definitely look at your project code The only reason that my lexer has little functionality is I am going through series of articles (norasandler.com/2017/11/29/Write-a-Compiler.html) and the first practice task is to implement only these lexemes. I am not sure about substrings and memory managment. As I understand mystring = mystring[10:] will keep original mystring in memory because substring has pointer to original value and GC won't clear it (stackoverflow.com/questions/16909917/…). Am I right? \$\endgroup\$Greg Elledge– Greg Elledge2019年10月24日 06:24:05 +00:00Commented Oct 24, 2019 at 6:24
-
\$\begingroup\$ Yes, you're right about the GC. The reason that I did it this way is that when I step through the parking code, I can immediately see the remaining string. That's more comfortable than doing searching for the character by counting characters on the screen. \$\endgroup\$Roland Illig– Roland Illig2019年10月24日 06:41:20 +00:00Commented Oct 24, 2019 at 6:41