Work-In-Progress Lexer

Question 1

So, purely out of desire to learn, I decided to take on a project of making my own very basic programming language from scratch. After which, I made a lexer.

I've tested the lexer (but only a bit) and it seems to work. But, I tend to doubt myself a lot (which might actually be a good thing). So I I'd like people to review my code.

Just remember that I'm doing this in order to learn, so the code is most likely far from flawless, but I tried my best. Also, as I stated inside the code in comments, the error showing part is still work-in-progress so I've used placeholders instead, just ignore those for now. Ignore everything with the show_error function used.

Also I didn't provide source code for the reader class definition or declaration, but the reader class I used is quite basic so I don't think I would need to post it here, it should be possible to review the lexer without seeing the reader anyway.

Also, the language itself is very C-like for now... I gave the language the codename "Phi". But just treat it as if it were a C lexer for now.

So, here is the code of the lexer:

[declaration] lexer.hpp:

// lexer.hpp
// Contains declaration of the Phi Lexer class
#ifndef PHI_LEXER
#define PHI_LEXER
#include "reader.hpp"
#include <vector>
#include <unordered_set>
class token
{
public:
 enum class kinds : char{kw, iden, op, symbol, int_lit, flt_lit, str_lit, null};
 // Enum class to represent every possible "kind" of token
 token();
 token(kinds kind, std::string token_val, std::size_t pos_line, std::size_t pos_col);
 std::string value() const;
 kinds kind() const;
 std::pair<std::size_t, std::size_t> pos() const;
 bool isNull() const;
 bool operator==(const token& other) const;
 bool operator!=(const token& other) const;
private:
 kinds knd; // Token kind
 std::string val;
 std::size_t line, col;
};
class tokenstream
{
public:
 token get();
 token peek(std::size_t step = 0) const;
 std::size_t size() const;
 void unget(std::size_t amount = 0);
 void push(token tok);
 void pop();
 void clear();
 bool contains(token tok) const;
 bool empty() const;
private:
 std::vector<token> tokens;
 std::size_t index;
};
namespace lexer_dict
{
 static const std::unordered_set<std::string> keywords {"If", "Else", "While", "For"};
 static const std::unordered_set<char> operators {'+', '-', '*', '/', '%',
 '>', '<', '=', '!'};
 static const std::unordered_set<std::string> double_operators
 {">=", "<=", "!=", "+=", "-=", "*=", "/=", "%="};
 static const std::unordered_set<char> symbols {',', ';', ':', '(', ')', '{', '}', '[', ']'};
 static const std::unordered_set<char> whitespace {' ', '\n', '\t', '\v', '\r', '\f'};
} // Lexer dictionary to identify type of a character or string
class lexer
{
public:
 lexer();
 bool tokenize(reader read);
 const tokenstream &value() const;
private:
 tokenstream val; // The output value
};
#endif

[definition] lexer.cpp:

// lexer.cpp
// Contains implementation of the Phi Lexer class
// WARNING: Errors are Work In Progress. Use it at your own risk
#include "lexer.hpp"
#include "errhandler.hpp"
#include <stdexcept>
#include <algorithm>
using kinds = token::kinds;
token::token() : knd{token::kinds::null}, val{},line{}, col{}
{
}
token::token(kinds kind, std::string token_val, std::size_t pos_line, std::size_t pos_col)
: knd{kind}, val{token_val}, line{pos_line}, col{pos_col}
{
}
std::string token::value() const
{
 return val;
}
token::kinds token::kind() const
{
 return knd;
}
std::pair<std::size_t, std::size_t> token::pos() const
{
 return std::make_pair(line, col);
}
bool token::isNull() const
{
 return knd == token::kinds::null;
}
bool token::operator==(const token& other) const
{
 return knd == other.knd && val == other.val;
}
bool token::operator!=(const token& other) const
{
 return !(*this == other);
}
token tokenstream::get()
{
 if(tokens.size() > index)
 return tokens[index++];
 else
 throw std::out_of_range("Trying to read past token stream");
}
token tokenstream::peek(std::size_t step /*= 0*/) const
{
 if(tokens.size() > index + step)
 return tokens[index + step];
 else
 throw std::out_of_range("Trying to read past token stream");
}
std::size_t tokenstream::size() const
{
 return tokens.size() - index;
}
void tokenstream::unget(std::size_t amount /*= 1*/)
{
 if(static_cast<long long>(index - amount) >= 0)
 index -= amount;
 else
 throw std::out_of_range("Trying to read before token stream");
}
void tokenstream::push(token tok)
{
 tokens.push_back(tok);
}
void tokenstream::pop()
{
 tokens.pop_back();
}
void tokenstream::clear()
{
 index = 0;
 tokens.clear();
}
bool tokenstream::contains(token tok) const
{
 return std::find(tokens.begin(), tokens.end(), tok) != tokens.end();
}
bool tokenstream::empty() const
{
 return tokens.size() - index <= 0;
}
lexer::lexer() : val{}
{
}
bool lexer::tokenize(reader read)
{
 auto is_oper = [](char c) -> bool
 {
 return lexer_dict::operators.find(c) != lexer_dict::operators.end();
 };
 auto is_double_oper = [](std::string &s) -> bool
 {
 return lexer_dict::double_operators.find(s) != lexer_dict::double_operators.end();
 };
 auto is_sym = [](char c) -> bool
 {
 return lexer_dict::symbols.find(c) != lexer_dict::symbols.end();
 };
 auto is_space = [](char c) -> bool
 {
 return lexer_dict::whitespace.find(c) != lexer_dict::whitespace.end();
 };
 auto is_kw = [](const std::string &s) -> bool
 {
 return lexer_dict::keywords.find(s) != lexer_dict::keywords.end();
 };
 // Lambda to check if character can be in a Numeric Literal
 std::size_t i{}, line{1}, col{1}, begin_col{};
 // i : index
 // line : current line
 // col : current column
 // begin_col : beginning column of the current token being processed
 const std::string& str = read.value();
 std::string temp_str;
 bool success = true;
 while(i < str.length())
 {
 if(is_space(str[i]))
 {
 if(str[i] == '\n' || str[i] == '\v')
 {
 col = 1;
 ++line;
 ++i;
 }
 else
 {
 ++col;
 ++i;
 }
 }
 else if(is_sym(str[i]))
 {
 val.push(token{kinds::symbol, str.substr(i, 1), line, col});
 ++col;
 ++i;
 }
 else if (is_oper(str[i]))
 {
 // If str[i] is the last character (which is invalid syntax but it's for the parser to deal with)
 // or if the next character isn't an operator,
 // then just insert it normally into the token stream
 if(i == str.length() - 1 || !is_oper(str[i+1]))
 {
 val.push(token{kinds::op, str.substr(i, 1), line, col});
 ++col;
 ++i;
 }
 // If str[i] is not the last character and the next character is also an operator,
 // which means it's a double operator (eg: <=, >=, !=),
 // ensure it's not an invalid operator combination (eg: not +- or *+)
 // and then push it in the token stream
 else
 {
 temp_str = str[i] + str[i + 1];
 if(is_double_oper(temp_str))
 val.push(token{kinds::op, temp_str, line, col});
 else
 {
 show_error("stuff"); // Not finished yet
 success = false;
 break;
 }
 col += 2;
 i += 2;
 temp_str.clear();
 }
 }
 else if (str[i] == '"')
 {
 std::size_t next_quote = str.find('"', i + 1);
 // Find the next quote from starting from the next position
 if(next_quote == std::string::npos)
 {
 // If no closing quote is found, show an error and set success flag to false
 show_error("stuff"); // Not finished yet
 success = false;
 break;
 }
 val.push(token(kinds::str_lit, str.substr(i + 1, next_quote - i - 1), line, col));
 // Make a substring containing everything inside the quotes, turn it into a token
 // And push it into the tokenstream
 col += next_quote - i + 1;
 i = next_quote + 1;
 }
 else if(isdigit(str[i]))
 {
 begin_col = col;
 // Store beginning column of token
 // Since a numeric literal can't span multiple lines,
 // We don't need to store the line
 bool allow_sign = false;
 bool expect_num = false;
 bool dot_found = false;
 bool e_found = false;
 // allow_sign : Flag to allow a plus or minus sign if 'e' is encountered while lexing a numeric literal
 // expect_num : Flag to check if the current char is '.' or 'e' so the next char must be a number
 // dot_found : Flag to determine if '.' has been encountered
 // e_found : Flag to determine if 'e' has been encountered
 while (!(is_space(str[i]) || is_sym(str[i])))
 {
 if(isdigit(str[i]))
 {
 allow_sign = false;
 expect_num = false;
 }
 else if(str[i] == '.')
 {
 if(expect_num || dot_found || e_found)
 {
 show_error("stuff"); // Not finished yet
 success = false;
 temp_str.clear();
 break;
 }
 dot_found = true;
 expect_num = true;
 }
 else if(str[i] == 'e')
 {
 if (expect_num || e_found)
 {
 show_error("stuff"); // Not finished yet
 success = false;
 temp_str.clear();
 break;
 }
 allow_sign = true;
 expect_num = true;
 e_found = true;
 }
 else if(str[i] == '+' || str[i] == '-')
 {
 if (expect_num || !allow_sign)
 {
 show_error("stuff"); // Not finished yet
 success = false;
 temp_str.clear();
 break;
 }
 allow_sign = false;
 expect_num = false;
 }
 temp_str += str[i];
 ++col;
 ++i;
 }
 val.push(
 token((dot_found || e_found) ? kinds::flt_lit : kinds::int_lit,
 temp_str, line, begin_col));
 temp_str.clear();
 }
 // Identifiers must start with an alphabet or an underscore '_'
 // All keywords start with an alphabet
 else if(isalpha(str[i]) || str[i] == '_')
 {
 begin_col = col;
 // Store beginning column of token
 // Since a keyword or identifier can't span multiple lines,
 // We don't need to store the line
 while (!(is_space(str[i]) || is_sym(str[i])))
 {
 temp_str += str[i];
 ++col;
 ++i;
 }
 if(is_kw(temp_str))
 val.push(token(kinds::kw, temp_str, line, begin_col));
 else
 val.push(token(kinds::iden, temp_str, line, begin_col));
 temp_str.clear();
 }
 // If a token begins with an invalid character, issue an error
 else
 {
 show_error("stuff"); // Not finished yet
 success = false;
 break;
 }
 }
 return success;
}
const tokenstream &lexer::value() const
{
 return val;
}

I don't mind criticism. In fact, I would love advice on how to improve this. With that said, it'd be very helpful if you take your valuable time to review this.

Test case: [source] [NOTE: The Program written in the language is completely random and has no purpose]

Int i = 0;
If(i < 0)
{
 i = 2 + 3.1;
 testStr = "Test";
 print("Test");
}
While(i >= 2)
{
 i--;
}

Lexer output (using a print lexer function which prints the content stored in the lexer):

Lexer:
Token: KW, Val: "Int", Pos: (1, 1)
Token: Iden, Val: "i", Pos: (1, 5)
Token: Op, Val: "=", Pos: (1, 7)
Token: IntLit, Val: "0", Pos: (1, 9)
Token: Sym, Val: ";", Pos: (1, 10)
Token: KW, Val: "If", Pos: (3, 1)
Token: Sym, Val: "(", Pos: (3, 3)
Token: Iden, Val: "i", Pos: (3, 4)
Token: Op, Val: "<", Pos: (3, 6)
Token: IntLit, Val: "0", Pos: (3, 8)
Token: Sym, Val: ")", Pos: (3, 9)
Token: Sym, Val: "{", Pos: (4, 1)
Token: Iden, Val: "i", Pos: (5, 5)
Token: Op, Val: "=", Pos: (5, 7)
Token: IntLit, Val: "2", Pos: (5, 9)
Token: Op, Val: "+", Pos: (5, 11)
Token: FltLit, Val: "3.1", Pos: (5, 13)
Token: Sym, Val: ";", Pos: (5, 16)
Token: Iden, Val: "testStr", Pos: (6, 5)
Token: Op, Val: "=", Pos: (6, 13)
Token: StrLit, Val: "Test", Pos: (6, 15)
Token: Sym, Val: ";", Pos: (6, 21)
Token: Iden, Val: "print", Pos: (7, 5)
Token: Sym, Val: "(", Pos: (7, 10)
Token: StrLit, Val: "Test", Pos: (7, 11)
Token: Sym, Val: ")", Pos: (7, 17)
Token: Sym, Val: ";", Pos: (7, 18)
Token: Sym, Val: "}", Pos: (8, 1)
Token: KW, Val: "While", Pos: (10, 1)
Token: Sym, Val: "(", Pos: (10, 6)
Token: Iden, Val: "i", Pos: (10, 7)
Token: Op, Val: ">=", Pos: (10, 9)
Token: IntLit, Val: "2", Pos: (10, 12)
Token: Sym, Val: ")", Pos: (10, 13)
Token: Sym, Val: "{", Pos: (11, 1)
Token: Iden, Val: "i", Pos: (12, 5)
Token: Op, Val: "--", Pos: (12, 6)
Token: Sym, Val: ";", Pos: (12, 8)
Token: Sym, Val: "}", Pos: (13, 1)

Question 2

Welcome to Code Review! Since this is a lexer for your own programming language, it might help to post an example in that language just so that the reviewers can get a feeling what kind of syntax the code has to parse. From a quick look at your code, it seems to be C-like?

Question 3

@AlexV Yes it is C-like. I thought it would be obvious enough so I didn't specifically mention it

Question 4

Can you list the supported language constructs so far?

Question 5

@L.F. I guess for now I'll stick to simple mathematic operations, simple things like If Statements and While Loops. That's all, for now

Question 6

OK. Can you provide some test cases to provide more context regarding how this is supposed to work?

Question 7

Welcome to Code Review!

lexer.hpp

It is common practice to put standard headers before custom headers and sort them in alphabetical order:

#include <unordered_set>
#include <vector>
#include "reader.hpp"

Usually, class names are capitalized.

In general, try to avoid abbreviations. The token kinds are more readable in full name:

enum class kinds {
 keyword,
 identifier,
 operator,
 symbol,
 int_literal,
 float_literal,
 string_literal,
 null,
};

(The last comma is intentional.)

Strings should be passed around by value. Also, storing numbers as string probably isn't a very good idea.

Make a dedicated class for line-column number pairs.

std::unordered_set generally performs worse than std::set without a carefully crafted hash function. Use std::set.

Currently, you are maintaining the class variant yourself. Things will become much easier if you use std::variant.

lexer.cpp

Instead of defining the default constructor yourself, use in-class member initializers and = default the default constructor. The other constructor should move from token_val, not copy.

pos can just return {line, col}. (Also see above for making a dedicated position class.)

The lambdas don't need to explicitly specify -> bool. is_double_oper should accept by const reference.

Consistently put a space after a control keyword like while.

The tokenize function is becoming very very long. It should be cut down into several functions.

Question 8

Thanks for your advice, I'll try to do all the stuff you suggested. I don't get one thing though, what exactly do you mean by "class variant"? I don't see anything related to variants in my code..

Question 9

@Arnil Basically, I mean the way you interpret the value (which is represented by a std::string) depends on the kind field.

Question 10

How can I use std::variant for it? I mean, the kinds are not separate types but just what kind of token the string represents

Question 11

@Arnil Define a tag type for each token kind. For null, there is no data (struct Null {};). For integer literals, for example, you can store them as an int rather than a string (struct Int_literal { int value; };). Similarly for the rest. (Also, don't accept so fast, give others some time.)

Question 12

Also any advice on how I can shorten the tokenize function? Like, what will I make smaller functions for?

L. F. L. F. 9,6952 gold badges27 silver badges69 bronze badges · Accepted Answer · 2019-10-11 11:51:55Z

Welcome to Code Review!

lexer.hpp

It is common practice to put standard headers before custom headers and sort them in alphabetical order:

#include <unordered_set>
#include <vector>
#include "reader.hpp"

Usually, class names are capitalized.

In general, try to avoid abbreviations. The token kinds are more readable in full name:

enum class kinds {
 keyword,
 identifier,
 operator,
 symbol,
 int_literal,
 float_literal,
 string_literal,
 null,
};

(The last comma is intentional.)

Strings should be passed around by value. Also, storing numbers as string probably isn't a very good idea.

Make a dedicated class for line-column number pairs.

std::unordered_set generally performs worse than std::set without a carefully crafted hash function. Use std::set.

Currently, you are maintaining the class variant yourself. Things will become much easier if you use std::variant.

lexer.cpp

Instead of defining the default constructor yourself, use in-class member initializers and = default the default constructor. The other constructor should move from token_val, not copy.

pos can just return {line, col}. (Also see above for making a dedicated position class.)

The lambdas don't need to explicitly specify -> bool. is_double_oper should accept by const reference.

Consistently put a space after a control keyword like while.

The tokenize function is becoming very very long. It should be cut down into several functions.

Thanks for your advice, I'll try to do all the stuff you suggested. I don't get one thing though, what exactly do you mean by "class variant"? I don't see anything related to variants in my code..
@Arnil Basically, I mean the way you interpret the value (which is represented by a std::string) depends on the kind field.
How can I use std::variant for it? I mean, the kinds are not separate types but just what kind of token the string represents
@Arnil Define a tag type for each token kind. For null, there is no data (struct Null {};). For integer literals, for example, you can store them as an int rather than a string (struct Int_literal { int value; };). Similarly for the rest. (Also, don't accept so fast, give others some time.)
Also any advice on how I can shorten the tokenize function? Like, what will I make smaller functions for?

Stack Exchange Network

Work-In-Progress Lexer

1 Answer 1

lexer.hpp

lexer.cpp

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Work-In-Progress Lexer

1 Answer 1

lexer.hpp

lexer.cpp

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions