Simplest lexer for c in go

Question 1

I started learning compilers development and as pet project decided to implement c compiler. The first step is implementing simplest lexer. Also, I decided to try new programming language and chose go. This is my first go program and I will be grateful for any advice. Thanks.

package lexer
import (
 "errors"
 "fmt"
 "unicode"
)
type tokenType int
const (
 EOF tokenType = iota
 Identifier
 NumericConstant
 IntType
 ReturnKeyword
 OpenRoundBracket
 CloseRoundBracket
 OpenCurlyBracket
 CloseCurlyBracket
 Semicolon
)
type token struct {
 tokenType tokenType
 value string
}
type parser struct {
 code []rune
 currentIndex int
}
func (p *parser) nextToken() token {
 p.skipSpaces()
 if p.isEOF() {
 return token{tokenType: EOF}
 }
 var newToken token
 r := p.currentRune()
 if unicode.IsLetter(r) {
 newToken = p.parseLetterStartToken()
 } else if unicode.IsNumber(r) {
 newToken = p.parseNumberStartToken()
 } else if r == '(' {
 newToken = token{tokenType: OpenRoundBracket, value:string(r)}
 } else if r == ')' {
 newToken = token{tokenType: CloseRoundBracket, value:string(r)}
 } else if r == '{' {
 newToken = token{tokenType: OpenCurlyBracket, value:string(r)}
 } else if r == '}' {
 newToken = token{tokenType: CloseCurlyBracket, value:string(r)}
 } else if r == ';' {
 newToken = token{tokenType: Semicolon, value:string(r)}
 } else {
 errorText := fmt.Sprintf("Unexpected character: %c", r)
 panic(errors.New(errorText))
 }
 p.currentIndex += 1
 return newToken
}
func (p *parser) skipSpaces() {
 if p.isEOF() {
 return
 }
 for unicode.IsSpace(p.currentRune()) {
 p.currentIndex += 1
 if p.isEOF() {
 return
 }
 }
}
func (p *parser) parseLetterStartToken() token {
 startIndex := p.currentIndex
 for unicode.IsLetter(p.currentRune()) || unicode.IsNumber(p.currentRune()) {
 p.nextRune()
 }
 endIndex := p.currentIndex
 tokenValue := string(p.code[startIndex: endIndex])
 p.currentIndex -= 1
 return letterStartToken(tokenValue)
}
func letterStartToken(tokenValue string) token {
 var tokenType tokenType
 switch tokenValue {
 case "int":
 tokenType = IntType
 case "return":
 tokenType = ReturnKeyword
 default:
 tokenType = Identifier
 }
 return token{tokenType:tokenType, value:tokenValue}
}
func (p *parser) parseNumberStartToken() token {
 startIndex := p.currentIndex
 for unicode.IsNumber(p.currentRune()) {
 p.nextRune()
 }
 endIndex := p.currentIndex
 tokenValue := string(p.code[startIndex: endIndex])
 p.currentIndex -= 1
 return token{tokenType: NumericConstant, value:tokenValue}
}
func (p parser) isEOF() bool {
 return p.currentIndex >= len(p.code)
}
func (p parser) currentRune() rune {
 return p.code[p.currentIndex]
}
func (p *parser) nextRune() rune {
 p.currentIndex += 1
 return p.currentRune()
}
func Tokenize(code []rune) []token {
 parser := parser{code:code, currentIndex:0}
 tokens := []token{}
 for newToken := parser.nextToken(); newToken.tokenType != EOF; {
 tokens = append(tokens, newToken)
 newToken = parser.nextToken()
 }
 return tokens
}

Question 2

Your code is a good start, but it is not finished yet. You still need to do:

character literals
string literals
floating-point literals
hexadecimal int literals
int literals containing underscores

I suggest you pick a copy of the current C standard and read through chapter 6, which contains the definitions for all these lexemes. Along with the reading, start to write unit tests for each of the lexeme rules, especially the edge cases.

Your current code is well-structured, and you will be able to write the full lexer keeping the structure the same. I've written a very similar lexer in my pkglint project, and the corresponding unit tests in lexer_test.go in the same directory. Reading this code will bring you up to speed with your own lexer. To see how I use the lexer, look for NewLexer in the code.

One aspect where my lexer differs from yours is that my lexer only remembers the remaining input string, whereas your lexer remembers the complete input string plus the current index. I chose my data structure since it uses the least amount of memory possible. Taking a subslice is an efficient operation in Go, therefore I haven't been bitten my the extra memory access that l.rest = l.rest[1:] needs (it updates both the start and the length of l.rest. Your code is more efficient in that it only increments the index.

Question 3

Thank you for your answer! I will definitely look at your project code The only reason that my lexer has little functionality is I am going through series of articles (norasandler.com/2017/11/29/Write-a-Compiler.html) and the first practice task is to implement only these lexemes. I am not sure about substrings and memory managment. As I understand mystring = mystring[10:] will keep original mystring in memory because substring has pointer to original value and GC won't clear it (stackoverflow.com/questions/16909917/…). Am I right?

Question 4

Yes, you're right about the GC. The reason that I did it this way is that when I step through the parking code, I can immediately see the remaining string. That's more comfortable than doing searching for the character by counting characters on the screen.

Roland Illig Roland Illig 21.9k2 gold badges36 silver badges83 bronze badges · Accepted Answer · 2019-10-23 17:15:33Z

Your code is a good start, but it is not finished yet. You still need to do:

character literals
string literals
floating-point literals
hexadecimal int literals
int literals containing underscores

I suggest you pick a copy of the current C standard and read through chapter 6, which contains the definitions for all these lexemes. Along with the reading, start to write unit tests for each of the lexeme rules, especially the edge cases.

Your current code is well-structured, and you will be able to write the full lexer keeping the structure the same. I've written a very similar lexer in my pkglint project, and the corresponding unit tests in lexer_test.go in the same directory. Reading this code will bring you up to speed with your own lexer. To see how I use the lexer, look for NewLexer in the code.

One aspect where my lexer differs from yours is that my lexer only remembers the remaining input string, whereas your lexer remembers the complete input string plus the current index. I chose my data structure since it uses the least amount of memory possible. Taking a subslice is an efficient operation in Go, therefore I haven't been bitten my the extra memory access that l.rest = l.rest[1:] needs (it updates both the start and the length of l.rest. Your code is more efficient in that it only increments the index.

Thank you for your answer! I will definitely look at your project code The only reason that my lexer has little functionality is I am going through series of articles (norasandler.com/2017/11/29/Write-a-Compiler.html) and the first practice task is to implement only these lexemes. I am not sure about substrings and memory managment. As I understand mystring = mystring[10:] will keep original mystring in memory because substring has pointer to original value and GC won't clear it (stackoverflow.com/questions/16909917/…). Am I right?
Yes, you're right about the GC. The reason that I did it this way is that when I step through the parking code, I can immediately see the remaining string. That's more comfortable than doing searching for the character by counting characters on the screen.

Stack Exchange Network

Simplest lexer for c in go

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simplest lexer for c in go

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions