So, purely out of desire to learn, I decided to take on a project of making my own very basic programming language from scratch. After which, I made a lexer.
I've tested the lexer (but only a bit) and it seems to work. But, I tend to doubt myself a lot (which might actually be a good thing). So I I'd like people to review my code.
Just remember that I'm doing this in order to learn, so the code is most likely far from flawless, but I tried my best. Also, as I stated inside the code in comments, the error showing part is still work-in-progress so I've used placeholders instead, just ignore those for now. Ignore everything with the show_error
function used.
Also I didn't provide source code for the reader
class definition or declaration, but the reader
class I used is quite basic so I don't think I would need to post it here, it should be possible to review the lexer without seeing the reader anyway.
Also, the language itself is very C-like for now... I gave the language the codename "Phi". But just treat it as if it were a C lexer for now.
So, here is the code of the lexer:
[declaration] lexer.hpp:
// lexer.hpp
// Contains declaration of the Phi Lexer class
#ifndef PHI_LEXER
#define PHI_LEXER
#include "reader.hpp"
#include <vector>
#include <unordered_set>
class token
{
public:
enum class kinds : char{kw, iden, op, symbol, int_lit, flt_lit, str_lit, null};
// Enum class to represent every possible "kind" of token
token();
token(kinds kind, std::string token_val, std::size_t pos_line, std::size_t pos_col);
std::string value() const;
kinds kind() const;
std::pair<std::size_t, std::size_t> pos() const;
bool isNull() const;
bool operator==(const token& other) const;
bool operator!=(const token& other) const;
private:
kinds knd; // Token kind
std::string val;
std::size_t line, col;
};
class tokenstream
{
public:
token get();
token peek(std::size_t step = 0) const;
std::size_t size() const;
void unget(std::size_t amount = 0);
void push(token tok);
void pop();
void clear();
bool contains(token tok) const;
bool empty() const;
private:
std::vector<token> tokens;
std::size_t index;
};
namespace lexer_dict
{
static const std::unordered_set<std::string> keywords {"If", "Else", "While", "For"};
static const std::unordered_set<char> operators {'+', '-', '*', '/', '%',
'>', '<', '=', '!'};
static const std::unordered_set<std::string> double_operators
{">=", "<=", "!=", "+=", "-=", "*=", "/=", "%="};
static const std::unordered_set<char> symbols {',', ';', ':', '(', ')', '{', '}', '[', ']'};
static const std::unordered_set<char> whitespace {' ', '\n', '\t', '\v', '\r', '\f'};
} // Lexer dictionary to identify type of a character or string
class lexer
{
public:
lexer();
bool tokenize(reader read);
const tokenstream &value() const;
private:
tokenstream val; // The output value
};
#endif
[definition] lexer.cpp:
// lexer.cpp
// Contains implementation of the Phi Lexer class
// WARNING: Errors are Work In Progress. Use it at your own risk
#include "lexer.hpp"
#include "errhandler.hpp"
#include <stdexcept>
#include <algorithm>
using kinds = token::kinds;
token::token() : knd{token::kinds::null}, val{},line{}, col{}
{
}
token::token(kinds kind, std::string token_val, std::size_t pos_line, std::size_t pos_col)
: knd{kind}, val{token_val}, line{pos_line}, col{pos_col}
{
}
std::string token::value() const
{
return val;
}
token::kinds token::kind() const
{
return knd;
}
std::pair<std::size_t, std::size_t> token::pos() const
{
return std::make_pair(line, col);
}
bool token::isNull() const
{
return knd == token::kinds::null;
}
bool token::operator==(const token& other) const
{
return knd == other.knd && val == other.val;
}
bool token::operator!=(const token& other) const
{
return !(*this == other);
}
token tokenstream::get()
{
if(tokens.size() > index)
return tokens[index++];
else
throw std::out_of_range("Trying to read past token stream");
}
token tokenstream::peek(std::size_t step /*= 0*/) const
{
if(tokens.size() > index + step)
return tokens[index + step];
else
throw std::out_of_range("Trying to read past token stream");
}
std::size_t tokenstream::size() const
{
return tokens.size() - index;
}
void tokenstream::unget(std::size_t amount /*= 1*/)
{
if(static_cast<long long>(index - amount) >= 0)
index -= amount;
else
throw std::out_of_range("Trying to read before token stream");
}
void tokenstream::push(token tok)
{
tokens.push_back(tok);
}
void tokenstream::pop()
{
tokens.pop_back();
}
void tokenstream::clear()
{
index = 0;
tokens.clear();
}
bool tokenstream::contains(token tok) const
{
return std::find(tokens.begin(), tokens.end(), tok) != tokens.end();
}
bool tokenstream::empty() const
{
return tokens.size() - index <= 0;
}
lexer::lexer() : val{}
{
}
bool lexer::tokenize(reader read)
{
auto is_oper = [](char c) -> bool
{
return lexer_dict::operators.find(c) != lexer_dict::operators.end();
};
auto is_double_oper = [](std::string &s) -> bool
{
return lexer_dict::double_operators.find(s) != lexer_dict::double_operators.end();
};
auto is_sym = [](char c) -> bool
{
return lexer_dict::symbols.find(c) != lexer_dict::symbols.end();
};
auto is_space = [](char c) -> bool
{
return lexer_dict::whitespace.find(c) != lexer_dict::whitespace.end();
};
auto is_kw = [](const std::string &s) -> bool
{
return lexer_dict::keywords.find(s) != lexer_dict::keywords.end();
};
// Lambda to check if character can be in a Numeric Literal
std::size_t i{}, line{1}, col{1}, begin_col{};
// i : index
// line : current line
// col : current column
// begin_col : beginning column of the current token being processed
const std::string& str = read.value();
std::string temp_str;
bool success = true;
while(i < str.length())
{
if(is_space(str[i]))
{
if(str[i] == '\n' || str[i] == '\v')
{
col = 1;
++line;
++i;
}
else
{
++col;
++i;
}
}
else if(is_sym(str[i]))
{
val.push(token{kinds::symbol, str.substr(i, 1), line, col});
++col;
++i;
}
else if (is_oper(str[i]))
{
// If str[i] is the last character (which is invalid syntax but it's for the parser to deal with)
// or if the next character isn't an operator,
// then just insert it normally into the token stream
if(i == str.length() - 1 || !is_oper(str[i+1]))
{
val.push(token{kinds::op, str.substr(i, 1), line, col});
++col;
++i;
}
// If str[i] is not the last character and the next character is also an operator,
// which means it's a double operator (eg: <=, >=, !=),
// ensure it's not an invalid operator combination (eg: not +- or *+)
// and then push it in the token stream
else
{
temp_str = str[i] + str[i + 1];
if(is_double_oper(temp_str))
val.push(token{kinds::op, temp_str, line, col});
else
{
show_error("stuff"); // Not finished yet
success = false;
break;
}
col += 2;
i += 2;
temp_str.clear();
}
}
else if (str[i] == '"')
{
std::size_t next_quote = str.find('"', i + 1);
// Find the next quote from starting from the next position
if(next_quote == std::string::npos)
{
// If no closing quote is found, show an error and set success flag to false
show_error("stuff"); // Not finished yet
success = false;
break;
}
val.push(token(kinds::str_lit, str.substr(i + 1, next_quote - i - 1), line, col));
// Make a substring containing everything inside the quotes, turn it into a token
// And push it into the tokenstream
col += next_quote - i + 1;
i = next_quote + 1;
}
else if(isdigit(str[i]))
{
begin_col = col;
// Store beginning column of token
// Since a numeric literal can't span multiple lines,
// We don't need to store the line
bool allow_sign = false;
bool expect_num = false;
bool dot_found = false;
bool e_found = false;
// allow_sign : Flag to allow a plus or minus sign if 'e' is encountered while lexing a numeric literal
// expect_num : Flag to check if the current char is '.' or 'e' so the next char must be a number
// dot_found : Flag to determine if '.' has been encountered
// e_found : Flag to determine if 'e' has been encountered
while (!(is_space(str[i]) || is_sym(str[i])))
{
if(isdigit(str[i]))
{
allow_sign = false;
expect_num = false;
}
else if(str[i] == '.')
{
if(expect_num || dot_found || e_found)
{
show_error("stuff"); // Not finished yet
success = false;
temp_str.clear();
break;
}
dot_found = true;
expect_num = true;
}
else if(str[i] == 'e')
{
if (expect_num || e_found)
{
show_error("stuff"); // Not finished yet
success = false;
temp_str.clear();
break;
}
allow_sign = true;
expect_num = true;
e_found = true;
}
else if(str[i] == '+' || str[i] == '-')
{
if (expect_num || !allow_sign)
{
show_error("stuff"); // Not finished yet
success = false;
temp_str.clear();
break;
}
allow_sign = false;
expect_num = false;
}
temp_str += str[i];
++col;
++i;
}
val.push(
token((dot_found || e_found) ? kinds::flt_lit : kinds::int_lit,
temp_str, line, begin_col));
temp_str.clear();
}
// Identifiers must start with an alphabet or an underscore '_'
// All keywords start with an alphabet
else if(isalpha(str[i]) || str[i] == '_')
{
begin_col = col;
// Store beginning column of token
// Since a keyword or identifier can't span multiple lines,
// We don't need to store the line
while (!(is_space(str[i]) || is_sym(str[i])))
{
temp_str += str[i];
++col;
++i;
}
if(is_kw(temp_str))
val.push(token(kinds::kw, temp_str, line, begin_col));
else
val.push(token(kinds::iden, temp_str, line, begin_col));
temp_str.clear();
}
// If a token begins with an invalid character, issue an error
else
{
show_error("stuff"); // Not finished yet
success = false;
break;
}
}
return success;
}
const tokenstream &lexer::value() const
{
return val;
}
I don't mind criticism. In fact, I would love advice on how to improve this. With that said, it'd be very helpful if you take your valuable time to review this.
Test case: [source] [NOTE: The Program written in the language is completely random and has no purpose]
Int i = 0; If(i < 0) { i = 2 + 3.1; testStr = "Test"; print("Test"); } While(i >= 2) { i--; }
Lexer output (using a print lexer function which prints the content stored in the lexer):
Lexer: Token: KW, Val: "Int", Pos: (1, 1) Token: Iden, Val: "i", Pos: (1, 5) Token: Op, Val: "=", Pos: (1, 7) Token: IntLit, Val: "0", Pos: (1, 9) Token: Sym, Val: ";", Pos: (1, 10) Token: KW, Val: "If", Pos: (3, 1) Token: Sym, Val: "(", Pos: (3, 3) Token: Iden, Val: "i", Pos: (3, 4) Token: Op, Val: "<", Pos: (3, 6) Token: IntLit, Val: "0", Pos: (3, 8) Token: Sym, Val: ")", Pos: (3, 9) Token: Sym, Val: "{", Pos: (4, 1) Token: Iden, Val: "i", Pos: (5, 5) Token: Op, Val: "=", Pos: (5, 7) Token: IntLit, Val: "2", Pos: (5, 9) Token: Op, Val: "+", Pos: (5, 11) Token: FltLit, Val: "3.1", Pos: (5, 13) Token: Sym, Val: ";", Pos: (5, 16) Token: Iden, Val: "testStr", Pos: (6, 5) Token: Op, Val: "=", Pos: (6, 13) Token: StrLit, Val: "Test", Pos: (6, 15) Token: Sym, Val: ";", Pos: (6, 21) Token: Iden, Val: "print", Pos: (7, 5) Token: Sym, Val: "(", Pos: (7, 10) Token: StrLit, Val: "Test", Pos: (7, 11) Token: Sym, Val: ")", Pos: (7, 17) Token: Sym, Val: ";", Pos: (7, 18) Token: Sym, Val: "}", Pos: (8, 1) Token: KW, Val: "While", Pos: (10, 1) Token: Sym, Val: "(", Pos: (10, 6) Token: Iden, Val: "i", Pos: (10, 7) Token: Op, Val: ">=", Pos: (10, 9) Token: IntLit, Val: "2", Pos: (10, 12) Token: Sym, Val: ")", Pos: (10, 13) Token: Sym, Val: "{", Pos: (11, 1) Token: Iden, Val: "i", Pos: (12, 5) Token: Op, Val: "--", Pos: (12, 6) Token: Sym, Val: ";", Pos: (12, 8) Token: Sym, Val: "}", Pos: (13, 1)
-
\$\begingroup\$ Welcome to Code Review! Since this is a lexer for your own programming language, it might help to post an example in that language just so that the reviewers can get a feeling what kind of syntax the code has to parse. From a quick look at your code, it seems to be C-like? \$\endgroup\$AlexV– AlexV2019年10月09日 11:10:48 +00:00Commented Oct 9, 2019 at 11:10
-
\$\begingroup\$ @AlexV Yes it is C-like. I thought it would be obvious enough so I didn't specifically mention it \$\endgroup\$Famiu– Famiu2019年10月09日 11:24:03 +00:00Commented Oct 9, 2019 at 11:24
-
\$\begingroup\$ Can you list the supported language constructs so far? \$\endgroup\$L. F.– L. F.2019年10月11日 10:19:50 +00:00Commented Oct 11, 2019 at 10:19
-
\$\begingroup\$ @L.F. I guess for now I'll stick to simple mathematic operations, simple things like If Statements and While Loops. That's all, for now \$\endgroup\$Famiu– Famiu2019年10月11日 11:05:04 +00:00Commented Oct 11, 2019 at 11:05
-
\$\begingroup\$ OK. Can you provide some test cases to provide more context regarding how this is supposed to work? \$\endgroup\$L. F.– L. F.2019年10月11日 11:13:27 +00:00Commented Oct 11, 2019 at 11:13
1 Answer 1
Welcome to Code Review!
lexer.hpp
It is common practice to put standard headers before custom headers and sort them in alphabetical order:
#include <unordered_set>
#include <vector>
#include "reader.hpp"
Usually, class names are capitalized.
In general, try to avoid abbreviations. The token kinds are more readable in full name:
enum class kinds {
keyword,
identifier,
operator,
symbol,
int_literal,
float_literal,
string_literal,
null,
};
(The last comma is intentional.)
Strings should be passed around by value. Also, storing numbers as string probably isn't a very good idea.
Make a dedicated class for line-column number pairs.
std::unordered_set
generally performs worse than std::set
without a carefully crafted hash function. Use std::set
.
Currently, you are maintaining the class variant yourself. Things will become much easier if you use std::variant
.
lexer.cpp
Instead of defining the default constructor yourself, use in-class member initializers and = default
the default constructor. The other constructor should move from token_val
, not copy.
pos
can just return {line, col}
. (Also see above for making a dedicated position class.)
The lambdas don't need to explicitly specify -> bool
. is_double_oper
should accept by const
reference.
Consistently put a space after a control keyword like while
.
The tokenize
function is becoming very very long. It should be cut down into several functions.
-
\$\begingroup\$ Thanks for your advice, I'll try to do all the stuff you suggested. I don't get one thing though, what exactly do you mean by "class variant"? I don't see anything related to variants in my code.. \$\endgroup\$Famiu– Famiu2019年10月11日 11:57:43 +00:00Commented Oct 11, 2019 at 11:57
-
\$\begingroup\$ @Arnil Basically, I mean the way you interpret the value (which is represented by a
std::string
) depends on the kind field. \$\endgroup\$L. F.– L. F.2019年10月11日 11:59:56 +00:00Commented Oct 11, 2019 at 11:59 -
\$\begingroup\$ How can I use std::variant for it? I mean, the kinds are not separate types but just what kind of token the string represents \$\endgroup\$Famiu– Famiu2019年10月11日 12:01:47 +00:00Commented Oct 11, 2019 at 12:01
-
\$\begingroup\$ @Arnil Define a tag type for each token kind. For null, there is no data (
struct Null {};
). For integer literals, for example, you can store them as anint
rather than a string (struct Int_literal { int value; };
). Similarly for the rest. (Also, don't accept so fast, give others some time.) \$\endgroup\$L. F.– L. F.2019年10月11日 12:05:02 +00:00Commented Oct 11, 2019 at 12:05 -
\$\begingroup\$ Also any advice on how I can shorten the tokenize function? Like, what will I make smaller functions for? \$\endgroup\$Famiu– Famiu2019年10月11日 12:05:06 +00:00Commented Oct 11, 2019 at 12:05