I'm trying to create a very basic compiler, and I'm working on the lexer. I decided to store keywords and operators in an unordered_set, and use regex to match identifiers and literals (no comments yet).
Tokens.h
#ifndef TOKENS_H
#define TOKENS_H
#include<unordered_set>
#include<string>
namespace AVSL
{
enum class TokenType
{
Identifier,
Keyword,
String,
Literal,
Comment,
Operator,
Delim
};
static const std::unordered_set<std::string> keywords = { "var", "for", "while", "print", "constant"};
static const std::unordered_set<std::string> operators = { "+", "-", "*", "/", "=", "==", "+=", "-=", "*=", "/=" };
static const std::unordered_set<char> delims = { '(', ')', ';' };
static const std::unordered_set<char> whitespace = { '\n', '\r', '\t', ' ' };
struct Token
{
std::string tokenValue;
TokenType tokenType;
Token() = delete;
Token(const std::string& tokenValue_, TokenType tokenType_): tokenValue(tokenValue_), tokenType(tokenType_){ /* empty */ }
Token(const Token& token): tokenValue(token.tokenValue), tokenType(token.tokenType) { /* empty */ }
};
}
#endif
Lexer.h
#ifndef LEXER_H
#define LEXER_H
#include "Tokens.h"
using LineNo = unsigned int;
namespace AVSL
{
std::vector<Token> Tokenize(std::string filename);
Token getToken(std::string buffer, LineNo line);
}
#endif
Lexer.cpp
#include "Lexer.h"
#include<fstream>
#include<iostream>
#include<regex>
namespace AVSL
{
Token getToken(std::string buffer, LineNo line)
{
if (operators.find(buffer) != operators.end())
{
return Token(buffer, TokenType::Operator);
}
else if (keywords.find(buffer) != keywords.end())
{
return Token(buffer, TokenType::Keyword);
}
std::regex iden("[a-zA-Z][a-zA-Z0-9_]*");
std::regex str("\".*\"");
std::regex lit("^[0-9]+$");
//std::regex;
if (std::regex_match(buffer, iden))
{
return Token(buffer, TokenType::Identifier);
}
if (std::regex_match(buffer, str))
{
return Token(buffer, TokenType::String);
}
if (std::regex_match(buffer, lit))
{
return Token(buffer, TokenType::Literal);
}
/* No support for comments yet*/
/* Will only reach here if all else fails, invalid string */
std::cout << "Invalid expression at line number " << line << " : " << buffer;
std::cin.get();
exit(0);
}
std::vector<Token> Tokenize(std::string filename)
{
std::ifstream file(filename, std::ios::in);
if (file.fail())
{
std::cout << "Unable to open file!\n";
std::cin.get();
exit(0);
}
LineNo line = 1;
std::string buffer = "";
char ch;
std::vector<Token> tokens;
while (file >> std::noskipws >> ch)
{
if (ch == '\n' || ch == '\r')
line += 1;
if (delims.find(ch) != delims.end())
{
if (buffer != "")
{
tokens.push_back(getToken(buffer, line));
buffer = "";
}
tokens.push_back(Token(std::string(1, ch), TokenType::Delim));
continue;
}
else if (whitespace.find(ch) != whitespace.end())
{
if (buffer != "")
{
tokens.push_back(getToken(buffer, line));
buffer = "";
}
continue;
}
else
{
buffer += ch;
}
}
//std::cout << line << "\n";
return tokens;
}
}
main.cpp
#include "Tokens.h"
#include "Lexer.h"
#include<iostream>
#include<unordered_map>
#include<fstream>
int main(int argc, char** argv)
{
std::unordered_map<AVSL::TokenType, std::string> tokenMap = { {AVSL::TokenType::Keyword, "Keyword"}, {AVSL::TokenType::Identifier, "Identifier"}, {AVSL::TokenType::Operator, "Operator"}, {AVSL::TokenType::Literal, "Literal"}, {AVSL::TokenType::Delim, "Delim" } };
auto vec = AVSL::Tokenize("dummy.txt");
for (auto x : vec)
{
std::cout << x.tokenValue << "->" << tokenMap.find(x.tokenType)->second << "\n";
}
std::cin.get();
return 0;
}
The lexer works fine so far. On a dummy file:
var i = 42;
It prints out:
var->Keyword
i->Identifier
=->Operator
42->Literal
;->Delim
Also, on invalid inputs, it will print out the line number and invalid string:
var i = 42;
i += 2a;
Invalid expression at line number 5 : 2a
Any advice/comments would be appreciated. I'm mostly concerned of error handling. Right now, all it does is print to stdout and exit the program.
-
\$\begingroup\$ I'm afraid the regex approach will get back at you once you start using escape sequences, literal escapes, quoting, etc. => str = "my \"quoted\" string" or \u0022my string started with an escape sequence" \$\endgroup\$dfhwze– dfhwze2019年06月03日 19:57:31 +00:00Commented Jun 3, 2019 at 19:57
-
3\$\begingroup\$ Welcome to Code Review! Great first question! Not only is the code clear, and the title good, but you've also done a good job of showing the context with an example. I look forward to seeing more good questions from you! \$\endgroup\$Edward– Edward2019年06月03日 21:36:34 +00:00Commented Jun 3, 2019 at 21:36
-
\$\begingroup\$ Please see What to do when someone answers. I have rolled back Rev 2 → 1 \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2019年06月03日 22:39:52 +00:00Commented Jun 3, 2019 at 22:39
-
3\$\begingroup\$ I would look into how to use Lex/Bison most of this can be done automatically with the correct tools. \$\endgroup\$Loki Astari– Loki Astari2019年06月04日 00:29:12 +00:00Commented Jun 4, 2019 at 0:29
1 Answer 1
Here are some things that may help you improve your program.
Use all required #include
s
The Lexer.h
file uses std::string
and std::vector
as part of its interface, but does not include <vector>
at all, and only indirectly includes <string>
because that's part of Tokens.h
. I'd recommend explicitly putting both <vector>
and <string>
in Lexer.h
so that if, for example, one wanted to alter the interface for Token
to not use std::string
, this file would still remain the same.
Eliminate unused variables
In this program's main()
, argc
, argv
are unused and should be eliminated from the program. Or perhaps use the next suggestion instead.
Allow the user to specify input file
The file names are currently hardcoded which certainly greatly restricts the usefulness of the program. Consider using argc
and argv
to allow the user to specify file names on the command line.
Consider allowing a std::istream
parameter for input
As it stands, the Tokenize()
function is only capable of reading its input from a file with the passed name, but not, say, std::cin
or any other stream. Instead of handling file I/O there, change it to take a std::istream &
. This makes the function much more flexible and also somewhat smaller and simpler.
Keep associated things closer together
Right now there is an AVSL::TokenTyp::Keyword
and an unordered_map
that contains the string "Keyword" and a list of keywords
and then part of the parser also looks for those. They are literally spread out over all files. Keeping associated things together is exactly a job for an object. First, I'd rename TokenType
to Type
and put it inside struct Token
so now we have Token::Type
everywhere that TokenType
is currently used. Next, we could create a more generic Classifier
class:
class Classifier {
public:
Classifier(Token::Type ttype, std::string name, std::unordered_set<std::string> tokens) :
my_type{ttype},
my_name{name},
my_tokens{tokens}
{}
bool among(const std::string& word) const { return my_tokens.find(word) != my_tokens.end(); }
Token::Type type() const { return my_type; }
private:
Token::Type my_type;
std::string my_name;
std::unordered_set<std::string> my_tokens;
};
static const std::vector<Classifier> classifiers {
{ Token::Type::Keyword, "Keyword", { "var", "for", "while", "print", "constant"} },
{ Token::Type::Operator, "Operator", { "+", "-", "*", "/", "=", "==", "+=", "-=", "*=", "/=" } },
};
Now the first part of getToken
could look like this:
for (const auto &thing : classifiers) {
if (thing.among(buffer)) {
return Token(buffer, thing.type());
}
}
Alternatively, one could include a function in the Token
class that would emit the type name for printing. If you have C++17, I'd suggest returning a std::string_view
for that purpose.
Wrap the regex operations into classes
It's not too difficult to imagine how, like the basic static Classifier
object above, one could use a slightly more sophisticated version to employ a std::regex
. If you have both kinds of classifiers derive from the same base object, then even more goodness can happen because now your code is data-driven which makes things easier to maintain and to understand.
Use a standard tool instead
There are a lot of inefficiencies with the current code. For one thing, the same buffer is scanned many times in trying to find a match. This is very inefficient and will make a big difference with larger files or more complex lexers. I'd recommend instead using flex++
instead to create a lexer which is both very easy to maintain and read and also very efficient and fast.
-
1\$\begingroup\$ Thanks! I've thought a lot about utilizing Flex or some other parser generator, but this project is purely for my own academic interests and for such a lightweight "language", it seems overkill. Maybe if I do decide to implement more features, I might use something like that. Also, I really liked the idea of using
Classifier
class, and I might do something like that. :) \$\endgroup\$Rish– Rish2019年06月03日 22:30:10 +00:00Commented Jun 3, 2019 at 22:30 -
2\$\begingroup\$ In my experience, the time invested in learning how to use
flex
andbison
far outweighed the time spent creating my own lexers and parsers in terms of value received for time spent. The latest releases of both have much improved C++ interfaces and facilities, so if you'd looked at them before but didn't spend time on them because they seemed very C-centric, it may well be worth another look. \$\endgroup\$Edward– Edward2019年06月03日 22:55:36 +00:00Commented Jun 3, 2019 at 22:55