I've got a lexer written in C++ (Visual Studio 2010, so including lambdas and a few other C++0x tricks). This is my first lexing experience, the only other source code interpretation I ever did was trivial, didn't separate parsing and lexing, that sort of thing. The grammar closely resembles C++ without templates and the lexer should output appropriate tokens.
Concerns:
- Unicode support- I absolutely want to support foreign-languages fully in this lexer.
- I also don't transmit any other content than the token type, and I'd also like to know how extensively I'd have to re-write to include them, for example, if I want to actually output an AST, I'll need to keep the identifiers instead of just deciding that they are and junking them.
- It's also been a long time since I wrote non-trivial C++ and I'd like to know how my comments are.
- In addition, I've used tools like ANTLR and flex and they generated massive lexers, and I'm concerned that I'm missing a trick here.
Things that I'm definitely not missing:
- No string or character literals- I haven't implemented them yet.
- No const, volatile, or cast keywords
That's pretty much it, I think.
class LexedFile {
public:
enum Token {
// RESERVED WORDS (and identifier)
// Taken care of by next() lambda.
// They're inserted in order (except identifier) so shouldn't be too hard to verify
Namespace,
Identifier,
For,
While,
Do,
Switch,
Case,
Default,
Try,
Catch,
Auto,
Type,
Break,
Continue,
Return,
Static,
Sizeof,
Decltype,
If,
Else,
// LOGICAL OPERATORS
LogicalAnd,
LogicalOr,
LogicalNot,
GreaterThan,
GreaterThanOrEqual,
LessThan,
LessThanOrEqual,
EqualComparison,
NotEqualComparison,
// BINARY OPERATORS
AND, // Re-used for address-of, the lexer doesn't care
ANDAssign,
XOR,
XORAssign,
OR,
ORAssign,
NOT,
NOTAssign,
LeftShift,
LeftShiftAssign,
RightShift,
RightShiftAssign,
// ARITHMETIC OPERATORS
Mul,
MulAssign,
Div,
DivAssign,
Mod,
ModAssign,
Plus,
PlusPlus,
PlusAssign,
Sub,
MinusMinus, // I dunno, SubSub just does't seem right.
SubAssign,
// LITERALS
String,
Float,
Integral,
Character,
// SYNTACTIC ELEMENTS
OpenParen,
CloseParen,
OpenSquare,
CloseSquare,
OpenCurlyParen,
CloseCurlyParen,
MemberAccess,
Comma,
Assign,
Semicolon,
Ellipses,
PointerMemberAccess,
Colon
};
std::vector<Token> tokens;
};
#pragma warning(disable : 4482)
class InputFile {
class StreamInput {
public:
std::vector<wchar_t>* ptr;
int index;
bool fail;
StreamInput(std::vector<wchar_t>* ref) {
ptr = ref;
index = 0;
fail = false;
}
operator bool() {
return !fail;
}
void putback(wchar_t back) {
index--;
}
StreamInput& operator>>(wchar_t& ref) {
if (index == ptr->size()) {
fail = true;
return *this;
}
ref = (*ptr)[index];
index++;
return *this;
}
};
std::wstring filename;
public:
InputFile(const std::wstring& argfilename)
: filename(argfilename) {}
LexedFile Lex() {
LexedFile l;
auto FileHandle = CreateFile(
filename.c_str(),
GENERIC_READ,
FILE_SHARE_READ,
nullptr,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
0
);
std::vector<wchar_t> buffer;
LARGE_INTEGER size;
GetFileSizeEx(FileHandle, &size);
buffer.resize(size.QuadPart);
DWORD read;
ReadFile(
FileHandle,
&buffer[0],
size.QuadPart,
&read,
nullptr
);
CloseHandle(FileHandle);
StreamInput input(&buffer);
//std::wifstream input(filename, std::ios::in);
std::vector<wchar_t> stack;
wchar_t current = 0;
static const std::unordered_map<std::wstring, LexedFile::Token> reserved_words(
[]() -> std::unordered_map<std::wstring, LexedFile::Token> {
std::unordered_map<std::wstring, LexedFile::Token> retval;
retval[L"namespace"] = LexedFile::Token::Namespace;
retval[L"for"] = LexedFile::Token::For;
retval[L"while"] = LexedFile::Token::While;
retval[L"do"] = LexedFile::Token::Do;
retval[L"switch"] = LexedFile::Token::Switch;
retval[L"case"] = LexedFile::Token::Case;
retval[L"default"] = LexedFile::Token::Default;
retval[L"try"] = LexedFile::Token::Try;
retval[L"catch"] = LexedFile::Token::Catch;
retval[L"auto"] = LexedFile::Token::Auto;
retval[L"type"] = LexedFile::Token::Type;
retval[L"break"] = LexedFile::Token::Break;
retval[L"continue"] = LexedFile::Token::Continue;
retval[L"return"] = LexedFile::Token::Return;
retval[L"static"] = LexedFile::Token::Static;
retval[L"sizeof"] = LexedFile::Token::Sizeof;
retval[L"decltype"] = LexedFile::Token::Decltype;
retval[L"if"] = LexedFile::Token::If;
retval[L"else"] = LexedFile::Token::Else;
return retval;
}()
);
// Dumps the stack, looking for reserved words or identifiers.
auto next_no_token = [&] {
if (stack.empty()) {
return;
}
std::wstring token(stack.begin(), stack.end());
stack.clear();
if (reserved_words.find(token) != reserved_words.end()) {
l.tokens.push_back(reserved_words.find(token)->second);
return;
}
l.tokens.push_back(LexedFile::Identifier);
};
auto next = [&](Wide::LexedFile::Token t) {
if (stack.empty()) {
l.tokens.push_back(t);
return;
}
std::wstring token(stack.begin(), stack.end());
stack.clear();
if (reserved_words.find(token) != reserved_words.end()) {
l.tokens.push_back(reserved_words.find(token)->second);
return;
}
l.tokens.push_back(LexedFile::Identifier);
l.tokens.push_back(t);
};
input >> current;
if (current == 0xFEFF || current == 0xFFFE) {
// BOM
} else {
input.putback(current);
}
while(input >> current) {
// First we eliminate whitespace, when possible- including comments.
// Then we define predicates to recognize tokens taking various forms.
// Then they're called.
// Then if none of them find anything, we put the character down as "identifier" in the stack and move on.
// Whitespace
if (current == L' ' || current == L'\n' || current == L'\t' || current == L'\r' ) {
next_no_token();
continue;
}
// Comment - also div and div-equals
if (current == L'/') {
// Check for comments first- we'll look ahead
input >> current;
if (current == L'/') {
// single-line comment
// explicitly empty while loop
while(input >> current && current != L'\n');
next_no_token();
continue;
}
if (current == L'*') {
// multi-line comment
while(true) {
// Wait until we find a terminator.
while(input >> current && current != L'*');
input >> current;
if (current == L'/') {
break;
}
// If we found a * for some other reason, like commented out code, then
// just keep going again looking for one.
}
next_no_token();
continue;
}
// If we weren't a comment, check for /=
if (current == L'=') {
next(LexedFile::Token::DivAssign);
continue;
}
// We don't have any more dual-character tokens to check for
next(LexedFile::Token::Div);
input.putback(current); // Put back the look-ahead token
continue;
}
// Define predicates that will look for operators that occur in various groups.
// Check for the operators that only exist on their own, and as assignment versions
// "double_check"
// ! !=
// ~ ~=
// % %=
// * *=
auto check = [&](wchar_t token, LexedFile::Token original, LexedFile::Token original_assign) -> bool {
if (current == token) {
// Grab a look-ahead for *=
input >> current;
if (current == L'=') {
next(original_assign);
return true;
}
input.putback(current);
next(original);
return true;
}
return false;
};
// For tokens that take the form (for example)
// + ++ +=
// - -- -=
// | || |=
// & && &=
auto triple_check = [&](
wchar_t token,
LexedFile::Token original,
LexedFile::Token original_original,
LexedFile::Token original_equals) -> bool
{
if (current == token) {
input >> current;
if (current == token) {
next(original_original);
return true;
}
if (current == L'=') {
next(original_equals);
return true;
}
input.putback(current);
next(original);
return true;
}
return false;
};
// Written to handle <, <<, <=, <<= and >, >>, >=, >>=
auto quadruple_check = [&](
wchar_t token,
LexedFile::Token original,
LexedFile::Token original_original,
LexedFile::Token original_equals,
LexedFile::Token original_original_equals) -> bool
{
if (current == token) {
input >> current;
if (current == L'=') {
next(original_equals);
return true;
}
if (current == token) {
// We can put this back, and then just "check" for bitwise, which we know will succeed.
check(token, original_original, original_original_equals);
return true;
}
input.putback(current);
next(original);
return true;
}
return false;
};
// Grab *, *=, %, %=, =, ==, !, !=, ~, ~=
if (check(L'*', LexedFile::Token::Mul, LexedFile::Token::MulAssign)) continue;
if (check(L'%', LexedFile::Token::Mod, LexedFile::Token::ModAssign)) continue;
if (check(L'=', LexedFile::Token::Assign, LexedFile::Token::EqualComparison)) continue;
if (check(L'!', LexedFile::Token::LogicalNot, LexedFile::Token::NotEqualComparison)) continue;
if (check(L'~', LexedFile::Token::NOT, LexedFile::Token::NOTAssign)) continue;
if (check(L'^', LexedFile::Token::XOR, LexedFile::Token::XORAssign)) continue;
// Triple group: +, ++, +=, -, --, -=, |, ||, |=, &, &&, &=
if (triple_check(
L'+',
LexedFile::Token::Plus,
LexedFile::Token::PlusPlus,
LexedFile::Token::PlusAssign
)) continue;
// Pointer member access operator
// Handle the special case, then just move on
if (current == L'-') {
input >> current;
if (current == L'>') {
next(LexedFile::Token::PointerMemberAccess);
continue;
}
input.putback(current);
current = L'-';
}
if (triple_check(
L'-',
LexedFile::Token::Sub,
LexedFile::Token::MinusMinus,
LexedFile::Token::SubAssign
)) continue;
if (triple_check(
L'|',
LexedFile::Token::OR,
LexedFile::Token::LogicalOr,
LexedFile::Token::ORAssign
)) continue;
if (triple_check(
L'&',
LexedFile::Token::AND,
LexedFile::Token::LogicalAnd,
LexedFile::Token::ANDAssign
)) continue;
// Pretty much only <<=, <<, <=, < and >>=, >>, >=, > fit into the quadruple group.
if (quadruple_check(
L'<',
LexedFile::Token::LessThan,
LexedFile::Token::LeftShift,
LexedFile::Token::LessThanOrEqual,
LexedFile::Token::LeftShiftAssign
)) continue;
if (quadruple_check(
L'>',
LexedFile::Token::GreaterThan,
LexedFile::Token::RightShift,
LexedFile::Token::GreaterThanOrEqual,
LexedFile::Token::RightShiftAssign
)) continue;
// That's everything. Now on to the other syntactic elements
// This lambda is of dubious value, mostly here just to be consistent
// Elements of just one character
auto syntactic = [&](wchar_t token, LexedFile::Token original) -> bool {
if (current == token) {
next(original);
return true;
}
return false;
};
if (syntactic(L'(', LexedFile::Token::OpenParen)) continue;
if (syntactic(L')', LexedFile::Token::CloseParen)) continue;
if (syntactic(L'[', LexedFile::Token::OpenSquare)) continue;
if (syntactic(L']', LexedFile::Token::CloseSquare)) continue;
if (syntactic(L':', LexedFile::Token::Colon)) continue;
if (syntactic(L',', LexedFile::Token::Comma)) continue;
if (syntactic(L';', LexedFile::Token::Semicolon)) continue;
if (syntactic(L'}', LexedFile::Token::CloseCurlyParen)) continue;
if (syntactic(L'{', LexedFile::Token::OpenCurlyParen)) continue;
// Just dot, ellipses left, and then literal integers, floats, etc
if (current == L'.') {
input >> current;
if (current == L'.') {
// The only possible thing that can be here is another dot anyways
input >> current;
if (current == L'.') {
next(LexedFile::Token::Ellipses);
continue;
}
// Double dot illegal!
throw std::runtime_error("Lexer failure: encountered '..'");
}
input.putback(current);
next(LexedFile::Token::MemberAccess);
continue;
}
// Literals
// If the stack is empty and we are a numeric character, then we begin a numeric literal.
if (stack.empty() && current >= L'0' && current <= L'9') {
// Consume until no more decimals
while(input >> current && current >= L'0' && current <= L'9');
// If the next character is a dot, we're a floating-point literal
if (current == L'.') {
// Consume all the characters that make up the float
while(input >> current && current >= L'0' && current <= L'9');
next(LexedFile::Token::Float);
// Make "current" available for the next read as it's not part of the literal
input.putback(current);
continue;
}
next(LexedFile::Token::Integral);
input.putback(current);
continue;
}
// Skip string and character literals, and FP exponents for now
// Not recognized, push and go again, and we don't want next() in this case.
stack.push_back(current);
}
return l;
}
};
Edit: I ran it through some test cases and fixed some bugs.
2 Answers 2
well....duplicated code is the first order
current >= L'0' && current <= L'9'
might want to extract it out to isDigit or something. same with isWhitespace.....etc even if used one spot, it will help it be a lot more readable.
In fact you might want to take various parts and extract into functions.
I'm not a fan of while(true) loops. I think they could maybe be written better.
I think maybe having more primitive functions to help you navigate the text would be useful skipTill, etc.
Some things you can improve:
The initialization of
reserved_words
is far too complex. You can use an initializer it to simplify it:static const std::unordered_map<std::wstring, LexedFile::Token> reserved_words = { { L"namespace", LexedFile::Token::Namespace }, { L"for", LexedFile::Token::For }, { L"while", LexedFile::Token::While }, { L"do", LexedFile::Token::Do }, { L"switch", LexedFile::Token::Switch }, { L"case", LexedFile::Token::Case }, { L"default", LexedFile::Token::Default }, { L"try", LexedFile::Token::Try }, { L"catch", LexedFile::Token::Catch }, { L"auto", LexedFile::Token::Auto }, { L"type", LexedFile::Token::Type }, { L"break", LexedFile::Token::Break }, { L"continue", LexedFile::Token::Continue }, { L"return", LexedFile::Token::Return }, { L"static", LexedFile::Token::Static }, { L"sizeof", LexedFile::Token::Sizeof }, { L"decltype", LexedFile::Token::Decltype }, { L"if", LexedFile::Token::If }, { L"else", LexedFile::Token::Else } };
In
InputFile::StreamInput
, you probably want to have anexplicit operator bool()
instead of a bare one. It was prevent errors likeStreamInput stri = false;
which feels quite wrong. Also, it lacks aconst
qualification, it should be:explicit operator bool() const { return !fail; }
The constructor of
StreamInput
should initialize the members via a constructor initialization list:StreamInput(std::vector<wchar_t>* ref): ptr{ref}, index{0}, fail{false} {}
The parameter in
InputFile::StreamInput::putback
is unused. You can simply delete its name to avoid warnings if you really don't need it.
I did not read the entire lexer (well, trying to find errors in a full lexer is quite hard), so I won't comment about its internal logic. I will leave this part to somebody else :)
-
\$\begingroup\$ VS2010 doesn't have initializer lists or explicit operators. Also, you shoulda checked the date on the posting- I don't have anything like this anymore, because this is from three years ago. \$\endgroup\$DeadMG– DeadMG2014年04月18日 09:31:09 +00:00Commented Apr 18, 2014 at 9:31
-
\$\begingroup\$ @DeadMG Yeah, I realized the data after having posted this answer (for some reason, the quesiton was back on the main page). I realized that it was actually one of your questions when I came accross
Wide::
. But whatever. \$\endgroup\$Morwenn– Morwenn2014年04月18日 09:55:40 +00:00Commented Apr 18, 2014 at 9:55