Here is a Lexer for a programming language I am working on. Any feedback would be appreciated. I only started learning C# a couple of days ago, so please excuse my newbie code :)
namespace Sen
{
enum TokenType
{
IDENTIFIER,
NUMBER,
STRING,
SEMICOLON,
PLUS,
MINUS,
STAR,
SLASH,
}
class Token
{
public TokenType type;
public string value;
public Token(TokenType type, string value = "")
{
this.type = type;
this.value = value;
}
}
class Lexer
{
public readonly List<Token> tokens;
private int charIdx;
private readonly string sourceRaw;
char CurrentChar
{
get { return sourceRaw[charIdx]; }
}
public Lexer(string sourceRaw)
{
this.sourceRaw = sourceRaw;
tokens = new List<Token>();
}
bool IsEnd
{
get { return charIdx >= sourceRaw.Length; }
}
char? NextChar()
{
try {
return sourceRaw[charIdx++];
} catch (IndexOutOfRangeException) {
return null;
}
}
public void Lex()
{
while (!IsEnd) {
switch (CurrentChar) {
case ';': AddToken(TokenType.SEMICOLON); break;
case ' ': break;
case '\'':
case '"':
LexString();
break;
case '+': AddToken(TokenType.PLUS); break;
case '-': AddToken(TokenType.MINUS); break;
case '*': AddToken(TokenType.STAR); break;
case '/': AddToken(TokenType.SLASH); break;
default:
if (char.IsLetter(CurrentChar)) {
LexIdentifier();
continue;
} else if (char.IsNumber(CurrentChar)) {
LexNumber();
continue;
}
throw new UnexpectedCharacterException(CurrentChar);
}
NextChar();
}
}
void AddToken(TokenType type, string value = "")
{
tokens.Add(new Token(type, value));
}
void LexIdentifier()
{
int startIdx = charIdx;
int endIdx = startIdx;
while (!IsEnd && CurrentChar != ' ' && CurrentChar != ';') {
if (!char.IsLetterOrDigit(CurrentChar) && CurrentChar != '_')
throw new UnexpectedCharacterException(CurrentChar);
NextChar();
endIdx++;
}
string value = sourceRaw[startIdx..endIdx];
AddToken(TokenType.IDENTIFIER, value);
}
void LexNumber()
{
int startIdx = charIdx;
int endIdx = startIdx;
while (!IsEnd && CurrentChar != ' ' && CurrentChar != ';' && char.IsNumber(CurrentChar)) {
NextChar();
endIdx++;
}
string value = sourceRaw[startIdx..endIdx];
AddToken(TokenType.NUMBER, value);
}
void LexString()
{
char opening = CurrentChar;
int startIdx = charIdx + 1;
int endIdx = startIdx - 1;
NextChar();
while (CurrentChar != opening) {
if (IsEnd)
throw new ExpectedCharacterException(opening);
NextChar();
endIdx++;
}
string value = sourceRaw[startIdx..endIdx];
AddToken(TokenType.STRING, value);
}
}
}
1 Answer 1
Welcome to CR and to C#. First thing first, you should become familiar with C# Naming Conventions. A few that I choose to emphasize in regards to your post:
In Token
class, the fields type
and value
should become properties named Type
and Value
. In general, fields are private unless they are constant or static. If you wish to expose a field as public, then it should be a property instead. Also, properties and methods should be named with Pascal casing.
Though not required, I personally prefer to decorate all properties, fields, and method with its access modifier, even if it is private. Granted, private is the default but I want to make sure that a beginner has given it thought and explicitly marked it so.
Regarding braces, there are 2 areas for improvement. One, the current thinkng with C# is that the open and close braces occur on their own line. And two, one-liners are frowned upon and should encorporate braces.
Taking that into consideration, this would be a rewrite of one method:
private char? NextChar()
{
try
{
return sourceRaw[charIdx++];
}
catch (IndexOutOfRangeException)
{
return null;
}
}
Except that entire method can use a less expensive if
rather than a try-catch
block.
private char? NextChar() => (charIdx >= 0 && !IsEnd)
? sourceRaw[charIdx++]
: null;
Why both to catch an exception if all you is ignore it? Especially when there is simple code that can easily work around it.
Back to braces, lines such as:
if (IsEnd)
throw new ExpectedCharacterException(opening);
should be converted to:
if (IsEnd)
{
throw new ExpectedCharacterException(opening);
}
There are a few properties or methods where you may consider using =>
. Example:
private bool IsEnd => charIdx >= sourceRaw.Length;
You seem to use CurrentChar != ' ' && CurrentChar != ';'
frequently. Apparently, these are delimiters between tokens and values. The DRY Principle (Don't Repeat Yourself) suggests this could become its own property:
private bool IsDelimiter => CurrentChar == ' ' || CurrentChar == ';'
Elsewhere in code you would replace CurrentChar != ' ' && CurrentChar != ';'
with !IsDelimiter
. The advantage here, besides readability, is that if you were ever to add a 3rd delimiter in the future, you would only have to change it in one spot.
Explore related questions
See similar questions with these tags.