7
\$\begingroup\$

I've got a lexer written in C++ (Visual Studio 2010, so including lambdas and a few other C++0x tricks). This is my first lexing experience, the only other source code interpretation I ever did was trivial, didn't separate parsing and lexing, that sort of thing. The grammar closely resembles C++ without templates and the lexer should output appropriate tokens.

Concerns:

  • Unicode support- I absolutely want to support foreign-languages fully in this lexer.
  • I also don't transmit any other content than the token type, and I'd also like to know how extensively I'd have to re-write to include them, for example, if I want to actually output an AST, I'll need to keep the identifiers instead of just deciding that they are and junking them.
  • It's also been a long time since I wrote non-trivial C++ and I'd like to know how my comments are.
  • In addition, I've used tools like ANTLR and flex and they generated massive lexers, and I'm concerned that I'm missing a trick here.

Things that I'm definitely not missing:

  • No string or character literals- I haven't implemented them yet.
  • No const, volatile, or cast keywords

That's pretty much it, I think.

class LexedFile {
public:
 enum Token {
 // RESERVED WORDS (and identifier)
 // Taken care of by next() lambda.
 // They're inserted in order (except identifier) so shouldn't be too hard to verify
 Namespace,
 Identifier,
 For,
 While,
 Do,
 Switch,
 Case,
 Default,
 Try,
 Catch,
 Auto,
 Type,
 Break,
 Continue,
 Return,
 Static,
 Sizeof,
 Decltype,
 If,
 Else,
 // LOGICAL OPERATORS
 LogicalAnd,
 LogicalOr,
 LogicalNot,
 GreaterThan,
 GreaterThanOrEqual,
 LessThan,
 LessThanOrEqual,
 EqualComparison,
 NotEqualComparison,
 // BINARY OPERATORS
 AND, // Re-used for address-of, the lexer doesn't care
 ANDAssign,
 XOR,
 XORAssign,
 OR,
 ORAssign,
 NOT,
 NOTAssign,
 LeftShift,
 LeftShiftAssign,
 RightShift,
 RightShiftAssign,
 // ARITHMETIC OPERATORS
 Mul,
 MulAssign,
 Div,
 DivAssign,
 Mod,
 ModAssign,
 Plus,
 PlusPlus,
 PlusAssign,
 Sub,
 MinusMinus, // I dunno, SubSub just does't seem right.
 SubAssign,
 // LITERALS
 String,
 Float,
 Integral,
 Character,
 // SYNTACTIC ELEMENTS
 OpenParen,
 CloseParen,
 OpenSquare,
 CloseSquare,
 OpenCurlyParen,
 CloseCurlyParen,
 MemberAccess,
 Comma,
 Assign,
 Semicolon,
 Ellipses,
 PointerMemberAccess,
 Colon
 };
 std::vector<Token> tokens;
};
#pragma warning(disable : 4482)
class InputFile {
 class StreamInput {
 public:
 std::vector<wchar_t>* ptr;
 int index;
 bool fail;
 StreamInput(std::vector<wchar_t>* ref) {
 ptr = ref;
 index = 0;
 fail = false;
 }
 operator bool() {
 return !fail;
 }
 void putback(wchar_t back) {
 index--;
 }
 StreamInput& operator>>(wchar_t& ref) {
 if (index == ptr->size()) {
 fail = true;
 return *this;
 }
 ref = (*ptr)[index];
 index++;
 return *this;
 }
 };
 std::wstring filename;
public:
 InputFile(const std::wstring& argfilename) 
 : filename(argfilename) {}
 LexedFile Lex() { 
 LexedFile l;
 auto FileHandle = CreateFile(
 filename.c_str(),
 GENERIC_READ,
 FILE_SHARE_READ,
 nullptr,
 OPEN_EXISTING,
 FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
 0
 );
 std::vector<wchar_t> buffer;
 LARGE_INTEGER size;
 GetFileSizeEx(FileHandle, &size);
 buffer.resize(size.QuadPart);
 DWORD read;
 ReadFile(
 FileHandle,
 &buffer[0],
 size.QuadPart,
 &read,
 nullptr
 );
 CloseHandle(FileHandle);
 StreamInput input(&buffer);
 //std::wifstream input(filename, std::ios::in);
 std::vector<wchar_t> stack;
 wchar_t current = 0;
 static const std::unordered_map<std::wstring, LexedFile::Token> reserved_words( 
 []() -> std::unordered_map<std::wstring, LexedFile::Token> {
 std::unordered_map<std::wstring, LexedFile::Token> retval;
 retval[L"namespace"] = LexedFile::Token::Namespace;
 retval[L"for"] = LexedFile::Token::For;
 retval[L"while"] = LexedFile::Token::While;
 retval[L"do"] = LexedFile::Token::Do;
 retval[L"switch"] = LexedFile::Token::Switch;
 retval[L"case"] = LexedFile::Token::Case;
 retval[L"default"] = LexedFile::Token::Default;
 retval[L"try"] = LexedFile::Token::Try;
 retval[L"catch"] = LexedFile::Token::Catch;
 retval[L"auto"] = LexedFile::Token::Auto;
 retval[L"type"] = LexedFile::Token::Type;
 retval[L"break"] = LexedFile::Token::Break;
 retval[L"continue"] = LexedFile::Token::Continue;
 retval[L"return"] = LexedFile::Token::Return;
 retval[L"static"] = LexedFile::Token::Static;
 retval[L"sizeof"] = LexedFile::Token::Sizeof;
 retval[L"decltype"] = LexedFile::Token::Decltype;
 retval[L"if"] = LexedFile::Token::If;
 retval[L"else"] = LexedFile::Token::Else;
 return retval;
 }()
 );
 // Dumps the stack, looking for reserved words or identifiers.
 auto next_no_token = [&] {
 if (stack.empty()) {
 return;
 }
 std::wstring token(stack.begin(), stack.end());
 stack.clear();
 if (reserved_words.find(token) != reserved_words.end()) {
 l.tokens.push_back(reserved_words.find(token)->second);
 return;
 }
 l.tokens.push_back(LexedFile::Identifier);
 };
 auto next = [&](Wide::LexedFile::Token t) {
 if (stack.empty()) {
 l.tokens.push_back(t);
 return;
 }
 std::wstring token(stack.begin(), stack.end());
 stack.clear();
 if (reserved_words.find(token) != reserved_words.end()) {
 l.tokens.push_back(reserved_words.find(token)->second);
 return;
 }
 l.tokens.push_back(LexedFile::Identifier);
 l.tokens.push_back(t);
 };
 input >> current;
 if (current == 0xFEFF || current == 0xFFFE) {
 // BOM
 } else {
 input.putback(current);
 }
 while(input >> current) {
 // First we eliminate whitespace, when possible- including comments.
 // Then we define predicates to recognize tokens taking various forms.
 // Then they're called.
 // Then if none of them find anything, we put the character down as "identifier" in the stack and move on.
 // Whitespace
 if (current == L' ' || current == L'\n' || current == L'\t' || current == L'\r' ) {
 next_no_token();
 continue;
 }
 // Comment - also div and div-equals
 if (current == L'/') {
 // Check for comments first- we'll look ahead
 input >> current;
 if (current == L'/') {
 // single-line comment
 // explicitly empty while loop
 while(input >> current && current != L'\n');
 next_no_token();
 continue;
 }
 if (current == L'*') {
 // multi-line comment
 while(true) {
 // Wait until we find a terminator.
 while(input >> current && current != L'*');
 input >> current;
 if (current == L'/') {
 break;
 }
 // If we found a * for some other reason, like commented out code, then
 // just keep going again looking for one.
 }
 next_no_token();
 continue;
 }
 // If we weren't a comment, check for /=
 if (current == L'=') {
 next(LexedFile::Token::DivAssign);
 continue;
 }
 // We don't have any more dual-character tokens to check for
 next(LexedFile::Token::Div);
 input.putback(current); // Put back the look-ahead token
 continue;
 }
 // Define predicates that will look for operators that occur in various groups.
 // Check for the operators that only exist on their own, and as assignment versions
 // "double_check"
 // ! !=
 // ~ ~=
 // % %=
 // * *=
 auto check = [&](wchar_t token, LexedFile::Token original, LexedFile::Token original_assign) -> bool {
 if (current == token) {
 // Grab a look-ahead for *=
 input >> current;
 if (current == L'=') {
 next(original_assign);
 return true;
 }
 input.putback(current);
 next(original);
 return true;
 }
 return false;
 };
 // For tokens that take the form (for example)
 // + ++ +=
 // - -- -=
 // | || |=
 // & && &=
 auto triple_check = [&](
 wchar_t token,
 LexedFile::Token original,
 LexedFile::Token original_original,
 LexedFile::Token original_equals) -> bool
 {
 if (current == token) {
 input >> current;
 if (current == token) {
 next(original_original);
 return true;
 }
 if (current == L'=') {
 next(original_equals);
 return true;
 }
 input.putback(current);
 next(original);
 return true;
 }
 return false;
 };
 // Written to handle <, <<, <=, <<= and >, >>, >=, >>=
 auto quadruple_check = [&](
 wchar_t token, 
 LexedFile::Token original,
 LexedFile::Token original_original,
 LexedFile::Token original_equals,
 LexedFile::Token original_original_equals) -> bool
 {
 if (current == token) {
 input >> current;
 if (current == L'=') {
 next(original_equals);
 return true;
 }
 if (current == token) {
 // We can put this back, and then just "check" for bitwise, which we know will succeed.
 check(token, original_original, original_original_equals);
 return true;
 }
 input.putback(current);
 next(original);
 return true;
 }
 return false;
 }; 
 // Grab *, *=, %, %=, =, ==, !, !=, ~, ~=
 if (check(L'*', LexedFile::Token::Mul, LexedFile::Token::MulAssign)) continue;
 if (check(L'%', LexedFile::Token::Mod, LexedFile::Token::ModAssign)) continue;
 if (check(L'=', LexedFile::Token::Assign, LexedFile::Token::EqualComparison)) continue;
 if (check(L'!', LexedFile::Token::LogicalNot, LexedFile::Token::NotEqualComparison)) continue;
 if (check(L'~', LexedFile::Token::NOT, LexedFile::Token::NOTAssign)) continue;
 if (check(L'^', LexedFile::Token::XOR, LexedFile::Token::XORAssign)) continue;
 // Triple group: +, ++, +=, -, --, -=, |, ||, |=, &, &&, &=
 if (triple_check(
 L'+', 
 LexedFile::Token::Plus, 
 LexedFile::Token::PlusPlus, 
 LexedFile::Token::PlusAssign
 )) continue;
 // Pointer member access operator
 // Handle the special case, then just move on
 if (current == L'-') {
 input >> current;
 if (current == L'>') {
 next(LexedFile::Token::PointerMemberAccess);
 continue;
 }
 input.putback(current);
 current = L'-';
 }
 if (triple_check(
 L'-', 
 LexedFile::Token::Sub, 
 LexedFile::Token::MinusMinus, 
 LexedFile::Token::SubAssign
 )) continue;
 if (triple_check(
 L'|', 
 LexedFile::Token::OR, 
 LexedFile::Token::LogicalOr, 
 LexedFile::Token::ORAssign
 )) continue;
 if (triple_check(
 L'&', 
 LexedFile::Token::AND, 
 LexedFile::Token::LogicalAnd, 
 LexedFile::Token::ANDAssign
 )) continue;
 // Pretty much only <<=, <<, <=, < and >>=, >>, >=, > fit into the quadruple group.
 if (quadruple_check(
 L'<', 
 LexedFile::Token::LessThan, 
 LexedFile::Token::LeftShift, 
 LexedFile::Token::LessThanOrEqual, 
 LexedFile::Token::LeftShiftAssign
 )) continue;
 if (quadruple_check(
 L'>', 
 LexedFile::Token::GreaterThan, 
 LexedFile::Token::RightShift, 
 LexedFile::Token::GreaterThanOrEqual,
 LexedFile::Token::RightShiftAssign
 )) continue;
 // That's everything. Now on to the other syntactic elements
 // This lambda is of dubious value, mostly here just to be consistent
 // Elements of just one character
 auto syntactic = [&](wchar_t token, LexedFile::Token original) -> bool {
 if (current == token) {
 next(original);
 return true;
 }
 return false;
 };
 if (syntactic(L'(', LexedFile::Token::OpenParen)) continue;
 if (syntactic(L')', LexedFile::Token::CloseParen)) continue;
 if (syntactic(L'[', LexedFile::Token::OpenSquare)) continue;
 if (syntactic(L']', LexedFile::Token::CloseSquare)) continue;
 if (syntactic(L':', LexedFile::Token::Colon)) continue;
 if (syntactic(L',', LexedFile::Token::Comma)) continue;
 if (syntactic(L';', LexedFile::Token::Semicolon)) continue;
 if (syntactic(L'}', LexedFile::Token::CloseCurlyParen)) continue;
 if (syntactic(L'{', LexedFile::Token::OpenCurlyParen)) continue;
 // Just dot, ellipses left, and then literal integers, floats, etc
 if (current == L'.') {
 input >> current;
 if (current == L'.') {
 // The only possible thing that can be here is another dot anyways
 input >> current;
 if (current == L'.') {
 next(LexedFile::Token::Ellipses);
 continue;
 }
 // Double dot illegal!
 throw std::runtime_error("Lexer failure: encountered '..'");
 }
 input.putback(current);
 next(LexedFile::Token::MemberAccess);
 continue;
 }
 // Literals
 // If the stack is empty and we are a numeric character, then we begin a numeric literal.
 if (stack.empty() && current >= L'0' && current <= L'9') {
 // Consume until no more decimals
 while(input >> current && current >= L'0' && current <= L'9');
 // If the next character is a dot, we're a floating-point literal
 if (current == L'.') {
 // Consume all the characters that make up the float
 while(input >> current && current >= L'0' && current <= L'9');
 next(LexedFile::Token::Float);
 // Make "current" available for the next read as it's not part of the literal
 input.putback(current);
 continue;
 }
 next(LexedFile::Token::Integral);
 input.putback(current);
 continue;
 } 
 // Skip string and character literals, and FP exponents for now
 // Not recognized, push and go again, and we don't want next() in this case.
 stack.push_back(current);
 }
 return l;
 }
};

Edit: I ran it through some test cases and fixed some bugs.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 20, 2011 at 23:54
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

well....duplicated code is the first order

 current >= L'0' && current <= L'9'

might want to extract it out to isDigit or something. same with isWhitespace.....etc even if used one spot, it will help it be a lot more readable.

In fact you might want to take various parts and extract into functions.

I'm not a fan of while(true) loops. I think they could maybe be written better.

I think maybe having more primitive functions to help you navigate the text would be useful skipTill, etc.

answered Jul 21, 2011 at 3:46
\$\endgroup\$
4
\$\begingroup\$

Some things you can improve:

  • The initialization of reserved_words is far too complex. You can use an initializer it to simplify it:

    static const std::unordered_map<std::wstring, LexedFile::Token> reserved_words = { 
     { L"namespace", LexedFile::Token::Namespace },
     { L"for", LexedFile::Token::For },
     { L"while", LexedFile::Token::While },
     { L"do", LexedFile::Token::Do },
     { L"switch", LexedFile::Token::Switch },
     { L"case", LexedFile::Token::Case },
     { L"default", LexedFile::Token::Default },
     { L"try", LexedFile::Token::Try },
     { L"catch", LexedFile::Token::Catch },
     { L"auto", LexedFile::Token::Auto },
     { L"type", LexedFile::Token::Type },
     { L"break", LexedFile::Token::Break },
     { L"continue", LexedFile::Token::Continue },
     { L"return", LexedFile::Token::Return },
     { L"static", LexedFile::Token::Static },
     { L"sizeof", LexedFile::Token::Sizeof },
     { L"decltype", LexedFile::Token::Decltype },
     { L"if", LexedFile::Token::If },
     { L"else", LexedFile::Token::Else }
    };
    
  • In InputFile::StreamInput, you probably want to have an explicit operator bool() instead of a bare one. It was prevent errors like StreamInput stri = false; which feels quite wrong. Also, it lacks a const qualification, it should be:

    explicit operator bool() const {
     return !fail;
    }
    
  • The constructor of StreamInput should initialize the members via a constructor initialization list:

    StreamInput(std::vector<wchar_t>* ref):
     ptr{ref},
     index{0},
     fail{false}
    {}
    
  • The parameter in InputFile::StreamInput::putback is unused. You can simply delete its name to avoid warnings if you really don't need it.

I did not read the entire lexer (well, trying to find errors in a full lexer is quite hard), so I won't comment about its internal logic. I will leave this part to somebody else :)

answered Apr 18, 2014 at 8:44
\$\endgroup\$
2
  • \$\begingroup\$ VS2010 doesn't have initializer lists or explicit operators. Also, you shoulda checked the date on the posting- I don't have anything like this anymore, because this is from three years ago. \$\endgroup\$ Commented Apr 18, 2014 at 9:31
  • \$\begingroup\$ @DeadMG Yeah, I realized the data after having posted this answer (for some reason, the quesiton was back on the main page). I realized that it was actually one of your questions when I came accross Wide::. But whatever. \$\endgroup\$ Commented Apr 18, 2014 at 9:55

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.