Lexer using C++ and Regex

Question 1

I'm trying to create a very basic compiler, and I'm working on the lexer. I decided to store keywords and operators in an unordered_set, and use regex to match identifiers and literals (no comments yet).

Tokens.h

#ifndef TOKENS_H
#define TOKENS_H
#include<unordered_set>
#include<string>
namespace AVSL
{
 enum class TokenType
 {
 Identifier,
 Keyword,
 String,
 Literal,
 Comment,
 Operator,
 Delim
 };
 static const std::unordered_set<std::string> keywords = { "var", "for", "while", "print", "constant"};
 static const std::unordered_set<std::string> operators = { "+", "-", "*", "/", "=", "==", "+=", "-=", "*=", "/=" };
 static const std::unordered_set<char> delims = { '(', ')', ';' };
 static const std::unordered_set<char> whitespace = { '\n', '\r', '\t', ' ' };
 struct Token
 {
 std::string tokenValue;
 TokenType tokenType;
 Token() = delete;
 Token(const std::string& tokenValue_, TokenType tokenType_): tokenValue(tokenValue_), tokenType(tokenType_){ /* empty */ }
 Token(const Token& token): tokenValue(token.tokenValue), tokenType(token.tokenType) { /* empty */ }
 };
}
#endif

Lexer.h

#ifndef LEXER_H
#define LEXER_H
#include "Tokens.h"
using LineNo = unsigned int;
namespace AVSL
{
 std::vector<Token> Tokenize(std::string filename);
 Token getToken(std::string buffer, LineNo line);
}
#endif

Lexer.cpp

#include "Lexer.h"
#include<fstream>
#include<iostream>
#include<regex>
namespace AVSL
{
 Token getToken(std::string buffer, LineNo line)
 {
 if (operators.find(buffer) != operators.end())
 {
 return Token(buffer, TokenType::Operator);
 }
 else if (keywords.find(buffer) != keywords.end())
 {
 return Token(buffer, TokenType::Keyword);
 }
 std::regex iden("[a-zA-Z][a-zA-Z0-9_]*");
 std::regex str("\".*\"");
 std::regex lit("^[0-9]+$");
 //std::regex;
 if (std::regex_match(buffer, iden))
 {
 return Token(buffer, TokenType::Identifier);
 }
 if (std::regex_match(buffer, str))
 {
 return Token(buffer, TokenType::String);
 }
 if (std::regex_match(buffer, lit))
 {
 return Token(buffer, TokenType::Literal);
 }
 /* No support for comments yet*/
 /* Will only reach here if all else fails, invalid string */
 std::cout << "Invalid expression at line number " << line << " : " << buffer;
 std::cin.get();
 exit(0);
 }
 std::vector<Token> Tokenize(std::string filename)
 {
 std::ifstream file(filename, std::ios::in);
 if (file.fail())
 {
 std::cout << "Unable to open file!\n";
 std::cin.get();
 exit(0);
 }
 LineNo line = 1;
 std::string buffer = "";
 char ch;
 std::vector<Token> tokens;
 while (file >> std::noskipws >> ch)
 {
 if (ch == '\n' || ch == '\r')
 line += 1;
 if (delims.find(ch) != delims.end())
 {
 if (buffer != "")
 {
 tokens.push_back(getToken(buffer, line));
 buffer = "";
 }
 tokens.push_back(Token(std::string(1, ch), TokenType::Delim));
 continue;
 }
 else if (whitespace.find(ch) != whitespace.end())
 {
 if (buffer != "")
 {
 tokens.push_back(getToken(buffer, line));
 buffer = "";
 }
 continue;
 }
 else
 {
 buffer += ch;
 }
 }
 //std::cout << line << "\n";
 return tokens;
 }
}

main.cpp

#include "Tokens.h"
#include "Lexer.h"
#include<iostream>
#include<unordered_map>
#include<fstream>
int main(int argc, char** argv)
{
 std::unordered_map<AVSL::TokenType, std::string> tokenMap = { {AVSL::TokenType::Keyword, "Keyword"}, {AVSL::TokenType::Identifier, "Identifier"}, {AVSL::TokenType::Operator, "Operator"}, {AVSL::TokenType::Literal, "Literal"}, {AVSL::TokenType::Delim, "Delim" } };
 auto vec = AVSL::Tokenize("dummy.txt");
 for (auto x : vec)
 {
 std::cout << x.tokenValue << "->" << tokenMap.find(x.tokenType)->second << "\n";
 }
 std::cin.get();
 return 0;
}

The lexer works fine so far. On a dummy file: var i = 42; It prints out:

var->Keyword
i->Identifier
=->Operator
42->Literal
;->Delim

Also, on invalid inputs, it will print out the line number and invalid string:

var i = 42;
i += 2a;

Invalid expression at line number 5 : 2a

Any advice/comments would be appreciated. I'm mostly concerned of error handling. Right now, all it does is print to stdout and exit the program.

Question 2

I'm afraid the regex approach will get back at you once you start using escape sequences, literal escapes, quoting, etc. => str = "my \"quoted\" string" or \u0022my string started with an escape sequence"

Question 3

Welcome to Code Review! Great first question! Not only is the code clear, and the title good, but you've also done a good job of showing the context with an example. I look forward to seeing more good questions from you!

Question 4

Please see What to do when someone answers. I have rolled back Rev 2 → 1

Question 5

I would look into how to use Lex/Bison most of this can be done automatically with the correct tools.

Question 6

Here are some things that may help you improve your program.

Use all required `#include`s

The Lexer.h file uses std::string and std::vector as part of its interface, but does not include <vector> at all, and only indirectly includes <string> because that's part of Tokens.h. I'd recommend explicitly putting both <vector> and <string> in Lexer.h so that if, for example, one wanted to alter the interface for Token to not use std::string, this file would still remain the same.

Eliminate unused variables

In this program's main(), argc, argv are unused and should be eliminated from the program. Or perhaps use the next suggestion instead.

Allow the user to specify input file

The file names are currently hardcoded which certainly greatly restricts the usefulness of the program. Consider using argc and argv to allow the user to specify file names on the command line.

Consider allowing a `std::istream` parameter for input

As it stands, the Tokenize() function is only capable of reading its input from a file with the passed name, but not, say, std::cin or any other stream. Instead of handling file I/O there, change it to take a std::istream &. This makes the function much more flexible and also somewhat smaller and simpler.

Keep associated things closer together

Right now there is an AVSL::TokenTyp::Keyword and an unordered_map that contains the string "Keyword" and a list of keywords and then part of the parser also looks for those. They are literally spread out over all files. Keeping associated things together is exactly a job for an object. First, I'd rename TokenType to Type and put it inside struct Token so now we have Token::Type everywhere that TokenType is currently used. Next, we could create a more generic Classifier class:

class Classifier {
public:
 Classifier(Token::Type ttype, std::string name, std::unordered_set<std::string> tokens) :
 my_type{ttype},
 my_name{name},
 my_tokens{tokens}
 {}
 bool among(const std::string& word) const { return my_tokens.find(word) != my_tokens.end(); }
 Token::Type type() const { return my_type; }
private:
 Token::Type my_type;
 std::string my_name;
 std::unordered_set<std::string> my_tokens;
};
static const std::vector<Classifier> classifiers {
 { Token::Type::Keyword, "Keyword", { "var", "for", "while", "print", "constant"} },
 { Token::Type::Operator, "Operator", { "+", "-", "*", "/", "=", "==", "+=", "-=", "*=", "/=" } },
};

Now the first part of getToken could look like this:

for (const auto &thing : classifiers) {
 if (thing.among(buffer)) {
 return Token(buffer, thing.type());
 }
}

Alternatively, one could include a function in the Token class that would emit the type name for printing. If you have C++17, I'd suggest returning a std::string_view for that purpose.

Wrap the regex operations into classes

It's not too difficult to imagine how, like the basic static Classifier object above, one could use a slightly more sophisticated version to employ a std::regex. If you have both kinds of classifiers derive from the same base object, then even more goodness can happen because now your code is data-driven which makes things easier to maintain and to understand.

Use a standard tool instead

There are a lot of inefficiencies with the current code. For one thing, the same buffer is scanned many times in trying to find a match. This is very inefficient and will make a big difference with larger files or more complex lexers. I'd recommend instead using flex++ instead to create a lexer which is both very easy to maintain and read and also very efficient and fast.

Question 7

Thanks! I've thought a lot about utilizing Flex or some other parser generator, but this project is purely for my own academic interests and for such a lightweight "language", it seems overkill. Maybe if I do decide to implement more features, I might use something like that. Also, I really liked the idea of using Classifier class, and I might do something like that. :)

Question 8

In my experience, the time invested in learning how to use flex and bison far outweighed the time spent creating my own lexers and parsers in terms of value received for time spent. The latest releases of both have much improved C++ interfaces and facilities, so if you'd looked at them before but didn't spend time on them because they seemed very C-centric, it may well be worth another look.

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2019-06-03 21:34:54Z

Here are some things that may help you improve your program.

Use all required `#include`s

The Lexer.h file uses std::string and std::vector as part of its interface, but does not include <vector> at all, and only indirectly includes <string> because that's part of Tokens.h. I'd recommend explicitly putting both <vector> and <string> in Lexer.h so that if, for example, one wanted to alter the interface for Token to not use std::string, this file would still remain the same.

Eliminate unused variables

In this program's main(), argc, argv are unused and should be eliminated from the program. Or perhaps use the next suggestion instead.

Allow the user to specify input file

The file names are currently hardcoded which certainly greatly restricts the usefulness of the program. Consider using argc and argv to allow the user to specify file names on the command line.

Consider allowing a `std::istream` parameter for input

As it stands, the Tokenize() function is only capable of reading its input from a file with the passed name, but not, say, std::cin or any other stream. Instead of handling file I/O there, change it to take a std::istream &. This makes the function much more flexible and also somewhat smaller and simpler.

Keep associated things closer together

Right now there is an AVSL::TokenTyp::Keyword and an unordered_map that contains the string "Keyword" and a list of keywords and then part of the parser also looks for those. They are literally spread out over all files. Keeping associated things together is exactly a job for an object. First, I'd rename TokenType to Type and put it inside struct Token so now we have Token::Type everywhere that TokenType is currently used. Next, we could create a more generic Classifier class:

class Classifier {
public:
 Classifier(Token::Type ttype, std::string name, std::unordered_set<std::string> tokens) :
 my_type{ttype},
 my_name{name},
 my_tokens{tokens}
 {}
 bool among(const std::string& word) const { return my_tokens.find(word) != my_tokens.end(); }
 Token::Type type() const { return my_type; }
private:
 Token::Type my_type;
 std::string my_name;
 std::unordered_set<std::string> my_tokens;
};
static const std::vector<Classifier> classifiers {
 { Token::Type::Keyword, "Keyword", { "var", "for", "while", "print", "constant"} },
 { Token::Type::Operator, "Operator", { "+", "-", "*", "/", "=", "==", "+=", "-=", "*=", "/=" } },
};

Now the first part of getToken could look like this:

for (const auto &thing : classifiers) {
 if (thing.among(buffer)) {
 return Token(buffer, thing.type());
 }
}

Alternatively, one could include a function in the Token class that would emit the type name for printing. If you have C++17, I'd suggest returning a std::string_view for that purpose.

Wrap the regex operations into classes

It's not too difficult to imagine how, like the basic static Classifier object above, one could use a slightly more sophisticated version to employ a std::regex. If you have both kinds of classifiers derive from the same base object, then even more goodness can happen because now your code is data-driven which makes things easier to maintain and to understand.

Use a standard tool instead

There are a lot of inefficiencies with the current code. For one thing, the same buffer is scanned many times in trying to find a match. This is very inefficient and will make a big difference with larger files or more complex lexers. I'd recommend instead using flex++ instead to create a lexer which is both very easy to maintain and read and also very efficient and fast.

Thanks! I've thought a lot about utilizing Flex or some other parser generator, but this project is purely for my own academic interests and for such a lightweight "language", it seems overkill. Maybe if I do decide to implement more features, I might use something like that. Also, I really liked the idea of using Classifier class, and I might do something like that. :)
In my experience, the time invested in learning how to use flex and bison far outweighed the time spent creating my own lexers and parsers in terms of value received for time spent. The latest releases of both have much improved C++ interfaces and facilities, so if you'd looked at them before but didn't spend time on them because they seemed very C-centric, it may well be worth another look.

Stack Exchange Network

Lexer using C++ and Regex

1 Answer 1

Use all required `#include`s

Eliminate unused variables

Allow the user to specify input file

Consider allowing a `std::istream` parameter for input

Keep associated things closer together

Wrap the regex operations into classes

Use a standard tool instead

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Lexer using C++ and Regex

1 Answer 1

Use all required #includes

Eliminate unused variables

Allow the user to specify input file

Consider allowing a std::istream parameter for input

Keep associated things closer together

Wrap the regex operations into classes

Use a standard tool instead

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use all required `#include`s

Consider allowing a `std::istream` parameter for input