C# Language Lexer

Question 1

Here is a Lexer for a programming language I am working on. Any feedback would be appreciated. I only started learning C# a couple of days ago, so please excuse my newbie code :)

namespace Sen
{
 enum TokenType
 {
 IDENTIFIER,
 NUMBER,
 STRING,
 SEMICOLON,
 PLUS,
 MINUS,
 STAR,
 SLASH,
 }
 class Token
 {
 public TokenType type;
 public string value;
 public Token(TokenType type, string value = "")
 {
 this.type = type;
 this.value = value;
 }
 }
 class Lexer
 {
 public readonly List<Token> tokens;
 private int charIdx;
 private readonly string sourceRaw;
 char CurrentChar
 {
 get { return sourceRaw[charIdx]; }
 }
 public Lexer(string sourceRaw)
 {
 this.sourceRaw = sourceRaw;
 tokens = new List<Token>();
 }
 bool IsEnd
 {
 get { return charIdx >= sourceRaw.Length; }
 }
 char? NextChar()
 {
 try {
 return sourceRaw[charIdx++];
 } catch (IndexOutOfRangeException) {
 return null;
 }
 }
 public void Lex()
 {
 while (!IsEnd) {
 switch (CurrentChar) {
 case ';': AddToken(TokenType.SEMICOLON); break;
 case ' ': break;
 case '\'':
 case '"':
 LexString();
 break;
 case '+': AddToken(TokenType.PLUS); break;
 case '-': AddToken(TokenType.MINUS); break;
 case '*': AddToken(TokenType.STAR); break;
 case '/': AddToken(TokenType.SLASH); break;
 default:
 if (char.IsLetter(CurrentChar)) {
 LexIdentifier();
 continue;
 } else if (char.IsNumber(CurrentChar)) {
 LexNumber();
 continue;
 }
 throw new UnexpectedCharacterException(CurrentChar);
 }
 NextChar();
 }
 }
 void AddToken(TokenType type, string value = "")
 {
 tokens.Add(new Token(type, value));
 }
 void LexIdentifier()
 {
 int startIdx = charIdx;
 int endIdx = startIdx;
 while (!IsEnd && CurrentChar != ' ' && CurrentChar != ';') {
 if (!char.IsLetterOrDigit(CurrentChar) && CurrentChar != '_')
 throw new UnexpectedCharacterException(CurrentChar);
 NextChar();
 endIdx++;
 }
 string value = sourceRaw[startIdx..endIdx];
 AddToken(TokenType.IDENTIFIER, value);
 }
 
 void LexNumber()
 {
 int startIdx = charIdx;
 int endIdx = startIdx;
 while (!IsEnd && CurrentChar != ' ' && CurrentChar != ';' && char.IsNumber(CurrentChar)) {
 NextChar();
 endIdx++;
 }
 string value = sourceRaw[startIdx..endIdx];
 AddToken(TokenType.NUMBER, value);
 }
 void LexString()
 {
 char opening = CurrentChar;
 int startIdx = charIdx + 1;
 int endIdx = startIdx - 1;
 NextChar();
 while (CurrentChar != opening) {
 if (IsEnd)
 throw new ExpectedCharacterException(opening);
 NextChar();
 endIdx++;
 }
 string value = sourceRaw[startIdx..endIdx];
 AddToken(TokenType.STRING, value);
 }
 }
}

Question 2

Welcome to CR and to C#. First thing first, you should become familiar with C# Naming Conventions. A few that I choose to emphasize in regards to your post:

In Token class, the fields type and value should become properties named Type and Value. In general, fields are private unless they are constant or static. If you wish to expose a field as public, then it should be a property instead. Also, properties and methods should be named with Pascal casing.

Though not required, I personally prefer to decorate all properties, fields, and method with its access modifier, even if it is private. Granted, private is the default but I want to make sure that a beginner has given it thought and explicitly marked it so.

Regarding braces, there are 2 areas for improvement. One, the current thinkng with C# is that the open and close braces occur on their own line. And two, one-liners are frowned upon and should encorporate braces.

Taking that into consideration, this would be a rewrite of one method:

private char? NextChar()
{
 try
 {
 return sourceRaw[charIdx++];
 } 
 catch (IndexOutOfRangeException)
 {
 return null;
 }
}

Except that entire method can use a less expensive if rather than a try-catch block.

private char? NextChar() => (charIdx >= 0 && !IsEnd) 
 ? sourceRaw[charIdx++] 
 : null;

Why both to catch an exception if all you is ignore it? Especially when there is simple code that can easily work around it.

Back to braces, lines such as:

if (IsEnd)
 throw new ExpectedCharacterException(opening);

should be converted to:

if (IsEnd)
{
 throw new ExpectedCharacterException(opening);
}

There are a few properties or methods where you may consider using =>. Example:

private bool IsEnd => charIdx >= sourceRaw.Length;

You seem to use CurrentChar != ' ' && CurrentChar != ';' frequently. Apparently, these are delimiters between tokens and values. The DRY Principle (Don't Repeat Yourself) suggests this could become its own property:

private bool IsDelimiter => CurrentChar == ' ' || CurrentChar == ';'

Elsewhere in code you would replace CurrentChar != ' ' && CurrentChar != ';' with !IsDelimiter. The advantage here, besides readability, is that if you were ever to add a 3rd delimiter in the future, you would only have to change it in one spot.

Rick Davin Rick Davin 6,7321 gold badge20 silver badges32 bronze badges · Accepted Answer · 2022-05-03 16:00:05Z

Welcome to CR and to C#. First thing first, you should become familiar with C# Naming Conventions. A few that I choose to emphasize in regards to your post:

In Token class, the fields type and value should become properties named Type and Value. In general, fields are private unless they are constant or static. If you wish to expose a field as public, then it should be a property instead. Also, properties and methods should be named with Pascal casing.

Though not required, I personally prefer to decorate all properties, fields, and method with its access modifier, even if it is private. Granted, private is the default but I want to make sure that a beginner has given it thought and explicitly marked it so.

Regarding braces, there are 2 areas for improvement. One, the current thinkng with C# is that the open and close braces occur on their own line. And two, one-liners are frowned upon and should encorporate braces.

Taking that into consideration, this would be a rewrite of one method:

private char? NextChar()
{
 try
 {
 return sourceRaw[charIdx++];
 } 
 catch (IndexOutOfRangeException)
 {
 return null;
 }
}

Except that entire method can use a less expensive if rather than a try-catch block.

private char? NextChar() => (charIdx >= 0 && !IsEnd) 
 ? sourceRaw[charIdx++] 
 : null;

Why both to catch an exception if all you is ignore it? Especially when there is simple code that can easily work around it.

Back to braces, lines such as:

if (IsEnd)
 throw new ExpectedCharacterException(opening);

should be converted to:

if (IsEnd)
{
 throw new ExpectedCharacterException(opening);
}

There are a few properties or methods where you may consider using =>. Example:

private bool IsEnd => charIdx >= sourceRaw.Length;

You seem to use CurrentChar != ' ' && CurrentChar != ';' frequently. Apparently, these are delimiters between tokens and values. The DRY Principle (Don't Repeat Yourself) suggests this could become its own property:

private bool IsDelimiter => CurrentChar == ' ' || CurrentChar == ';'

Elsewhere in code you would replace CurrentChar != ' ' && CurrentChar != ';' with !IsDelimiter. The advantage here, besides readability, is that if you were ever to add a 3rd delimiter in the future, you would only have to change it in one spot.

Stack Exchange Network

C# Language Lexer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

C# Language Lexer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions