Return to Answer

added 2289 characters in body

edited May 7, 2018 at 2:03

11.7k
19
40

static std::set<std::string> keywords = {
 "auto", "break", "case", "char", "const", "continue", "default", ⋯

Two things: the set is rather slow for lookup! A sorted vector would be faster! Boost.Container has a flat_set etc.

Second, you are copying the statically-allocated lexical string literals into string objects. Do you really need string here? I would (did , actually) just use a plain (pre-sorted) array of const char*, and make the whole thing constexpr.

If std::string literals were constexpr you could at least save the redundant copy and run-time copying, but pondering why it isn’t constexpr will show you why I’m mentioning this — it needs to allocate memory, and do run-time work to set up the set.

Keeping the entire table in one contiguous lump of memory will not only save memory for all those pointers and nodes, but will be much faster.

Lexer::Lexer(std::istream& stream):
 mStream(stream)
{
}

Use uniform initialization. So now you write curly braces instead of parens:

: mStream{stream}

and this short one-liner could go inline in the header.

char Lexer::nextChar()
{
 char c;
 mStream.read(&c, 1);
 return c;
}

What if no character was read? The stream could be bad you know. Maybe you hit the end of the file, or the network glitched.

char c;
do
 c = nextChar();
while (isWhitespace(c));
return c;

You can write this in a way to prevent defining c outside the loop. More importantly, deal with errors from nextChar.

It would be cleaner if it only skipped whitespace and did not also read (and return) the next char beyond that! (Hint: see the next function in your file)

// We've already read the first character, so set that in advance
lexedWord.resize(word.size());
lexedWord[0] = word[0];

Exactly! Don’t read one ahead. Skipws should do its one job only.

It is especially confusing since peekWord does not do the skipws call. It seems that it is called when you already know that the first character of the word does match the input? This is going to lead to maintenance problems, believe me.

(later: I see a lot of the functions take the read-ahead character as another parameter and have special code to deal with that first. Get rid of all that. The function should be called with the input set to the first position of the thing it wants to read. You already know to peek and to rewind, so there is no reason to have this one-off getting in the way.)

Did you know that a stream can give you an input iterator? So rather than allocating a string and reading into it and then comparing (oh, and you didn't check to see if it read as many bytes as you asked for), you can use something like std::equal directly between word and the input stream.

If/when you do need to access the contents of a std::string as a buffer for writing into, use .data().

std::string Lexer::lexUniversalCharacter(char character, bool& isValid)

Two return values, no problem. Don’t use an "out" parameter for the second one! (⧺F.21 ).

However, I think you don’t have two values here, but an optional result.

return LexResult{Token::IDENTIFIER, identifier};

You don’t need to name the type, as it is automatically picked up from the function’s return type.

return {Token::IDENTIFIER, identifier};

this can be much nicer.

if (peekChar() == 'L' || peekChar() == 'l') {
 suffix += readLongSuffix();
 if (peekChar() == 'u' || peekChar() == 'U')
 suffix += nextChar();
}
else if (peekChar() == 'u' || peekChar() == 'U') {
 suffix += nextChar();
 if (peekChar() == 'l' || peekChar() == 'L')

Avoid calling peelChar twice on the same character. You can use a new feature in if statements here:

if (auto ch=peekChar(); ch=='L' || ch=='l') {

or (since you are checking a lot of cap/lower pairs) define a helper function:

if (mAtCh(peekChar(),'L') {

but later you need more general:

if (match_any_of(peekChar(), "+-") {
if (match_any_of(peekChar(), "fFlL") {
lexEscapeSequence

and to write that, use std::any_of in a one-line wrapper.

 if (peekWord("<<="))
 return LexResult{Token::PUNCTUATOR, "<<="};
 if (peekWord("<:"))
 return LexResult{Token::PUNCTUATOR, "<:"};
 if (peekWord("<%"))
 return LexResult{Token::PUNCTUATOR, "<%"};
 if (peekWord("<<"))
 return LexResult{Token::PUNCTUATOR, "<<"};
 if (peekWord("<="))
 return LexResult{Token::PUNCTUATOR, "<="};

There are a lot of these. Notice how the thing you peek always matches the parameter to the return value? What is the if even doing?

You need a function that returns the peeked word.

In fact, all of this reads one char, switches on it, then peeks on the rest of the word. Why does it need to do it in two parts like that? You are just classifying the "word" a PUNCTUATOR.

Make a list (static array, as discussed with keywords) and treat it in the same way. Just as there are keywords spelled with letters, these are a list of legal words spelled with non-letter/non-digit characters.

The trick is you have to take the longest sequence that is a valid token. But use a table, not dozens of duplicated snippets of code.

peek a character and append to the token
is that on the list?
 no: return with what you had.
 yes: advance the input and append to the token.

/// still typing... ///

Using std::stream for reading the text: most people just use a string.

/// still typing... ///

static std::set<std::string> keywords = {
 "auto", "break", "case", "char", "const", "continue", "default", ⋯

Two things: the set is rather slow for lookup! A sorted vector would be faster! Boost.Container has a flat_set etc.

Keeping the entire table in one contiguous lump of memory will not only save memory for all those pointers and nodes, but will be much faster.

Lexer::Lexer(std::istream& stream):
 mStream(stream)
{
}

Use uniform initialization. So now you write curly braces instead of parens:

: mStream{stream}

and this short one-liner could go inline in the header.

char Lexer::nextChar()
{
 char c;
 mStream.read(&c, 1);
 return c;
}

What if no character was read? The stream could be bad you know. Maybe you hit the end of the file, or the network glitched.

char c;
do
 c = nextChar();
while (isWhitespace(c));
return c;

You can write this in a way to prevent defining c outside the loop. More importantly, deal with errors from nextChar.

It would be cleaner if it only skipped whitespace and did not also read (and return) the next char beyond that! (Hint: see the next function in your file)

// We've already read the first character, so set that in advance
lexedWord.resize(word.size());
lexedWord[0] = word[0];

Exactly! Don’t read one ahead. Skipws should do its one job only.

If/when you do need to access the contents of a std::string as a buffer for writing into, use .data().

std::string Lexer::lexUniversalCharacter(char character, bool& isValid)

Two return values, no problem. Don’t use an "out" parameter for the second one! (⧺F.21 ).

However, I think you don’t have two values here, but an optional result.

return LexResult{Token::IDENTIFIER, identifier};

You don’t need to name the type, as it is automatically picked up from the function’s return type.

return {Token::IDENTIFIER, identifier};

this can be much nicer.

if (peekChar() == 'L' || peekChar() == 'l') {
 suffix += readLongSuffix();
 if (peekChar() == 'u' || peekChar() == 'U')
 suffix += nextChar();
}
else if (peekChar() == 'u' || peekChar() == 'U') {
 suffix += nextChar();
 if (peekChar() == 'l' || peekChar() == 'L')

Avoid calling peelChar twice on the same character. You can use a new feature in if statements here:

if (auto ch=peekChar(); ch=='L' || ch=='l') {

or (since you are checking a lot of cap/lower pairs) define a helper function:

if (mAtCh(peekChar(),'L') {

but later you need more general:

if (match_any_of(peekChar(), "+-") {
if (match_any_of(peekChar(), "fFlL") {
lexEscapeSequence

and to write that, use std::any_of in a one-line wrapper.

 if (peekWord("<<="))
 return LexResult{Token::PUNCTUATOR, "<<="};
 if (peekWord("<:"))
 return LexResult{Token::PUNCTUATOR, "<:"};
 if (peekWord("<%"))
 return LexResult{Token::PUNCTUATOR, "<%"};
 if (peekWord("<<"))
 return LexResult{Token::PUNCTUATOR, "<<"};
 if (peekWord("<="))
 return LexResult{Token::PUNCTUATOR, "<="};

There are a lot of these. Notice how the thing you peek always matches the parameter to the return value? What is the if even doing?

You need a function that returns the peeked word.

In fact, all of this reads one char, switches on it, then peeks on the rest of the word. Why does it need to do it in two parts like that? You are just classifying the "word" a PUNCTUATOR.

The trick is you have to take the longest sequence that is a valid token. But use a table, not dozens of duplicated snippets of code.

peek a character and append to the token
is that on the list?
 no: return with what you had.
 yes: advance the input and append to the token.

Using std::stream for reading the text: most people just use a string.

added 2289 characters in body

Source Link

edited May 7, 2018 at 1:13

JDługosz

edited May 7, 2018 at 1:13

JDługosz

11.7k
19
40

enum class Token
{
 PUNCTUATOR,
 KEYWORD,
 IDENTIFIER,
 INT_CONST,
 FLOAT_CONST,
 CHAR_CONST,
 ERROR
};

See ⧺ES.9 : Avoid ALL_CAPS names.

I see you are using a new language feature of enum class, so your code might not be as out-of-date as you feared.

You might consider reserving a value of 0 for ERROR or empty or something like that.

std::string lexeme = "";

Don’t, in general, assign "" to clear out a std::string. In this case, there is no need to say anything at all because string has a constructor. So just leave the default initializer off of the member.

Token token = Token::ERROR;

Since that is what you want for the default, I would make ERROR the first thing in the enumeration list so it has the value zero. It doesn’t matter ideally, but it’s nice and might be helpful later.

return std::isalpha(character) || character == '_');

See cppreference for the Kosher way to call these ancient functions.

⋯ isalpha(static_cast<unsigned char>(character)) ⋯

Since you already wrapped these calls in your own helpers, it will only appear within those helpers, once.

bool isWhitespace(char character)

not using std::isspace? You are not covering as many characters as it does.

Good!

I see you used an anonomous namespace for the helpers in the CPP file.

enum class Token
{
 PUNCTUATOR,
 KEYWORD,
 IDENTIFIER,
 INT_CONST,
 FLOAT_CONST,
 CHAR_CONST,
 ERROR
};

See ⧺ES.9 : Avoid ALL_CAPS names.

I see you are using a new language feature of enum class, so your code might not be as out-of-date as you feared.

You might consider reserving a value of 0 for ERROR or empty or something like that.

std::string lexeme = "";

Token token = Token::ERROR;

return std::isalpha(character) || character == '_');

See cppreference for the Kosher way to call these ancient functions.

⋯ isalpha(static_cast<unsigned char>(character)) ⋯

Since you already wrapped these calls in your own helpers, it will only appear within those helpers, once.

bool isWhitespace(char character)

not using std::isspace? You are not covering as many characters as it does.

Good!

I see you used an anonomous namespace for the helpers in the CPP file.

added 2289 characters in body

Source Link

edited May 7, 2018 at 1:00

JDługosz

edited May 7, 2018 at 1:00

JDługosz

11.7k
19
40

The C99 standard just breaks things down into punctuators, keywords, identifiers, and constants. But the compilers I've looked at tend to be more granular. Is there a better approach to choosing tokens? Why?

You can get a hint from my comments above. Consider: you can chop up the source into individual "words" only, but when you feed this to the next stage the only thing it saves you is not having to skip white space and mess with comments getting in the middle of grammatical productions.

If the tokens have a rough type, like number, identifier, punctuation; it saves the parser some work. The parser grammar uses these, so starting with that dovetails nicely.

If the tokens have a type and a value, it can save the next stage from having to figure that out. But someone has to figure it out, and if you do it in the lexer you’re guessing at what the parser will find useful. If you need to figure out something to do the lexical analysis, then preserve that information and pass it along.

If you write the lexer and parser grammars together, the lexer is just the bottom end of the whole grammar and it is clear what you need, because you are choosing which terminals to handle (or partially handle) in the lexing step. Why two phases? Well, the designer or Perl 6 patterns say "why indeed?". The tutorial and overview of Boost.Spirit goes into it, and lets you use a separate lexer or not.

One good reason is to get rid of comments. Eating whitespace after each nonterminal in a monolithic parser isn’t that bad, but consider comments and backslash continuation lines and other stuff that lives in the text, that makes it hard to deal with that mixed in with the parser grammar.

Other reasons historically are memory usage and machine capacity; that matters less today. And there is the adage that "If you put a team of three people onto a compiler-writing project, you’ll get a three-pass compiler."

The C99 standard just breaks things down into punctuators, keywords, identifiers, and constants. But the compilers I've looked at tend to be more granular. Is there a better approach to choosing tokens? Why?

If the tokens have a rough type, like number, identifier, punctuation; it saves the parser some work. The parser grammar uses these, so starting with that dovetails nicely.

added 2289 characters in body

Source Link

edited May 7, 2018 at 0:48

JDługosz

edited May 7, 2018 at 0:48

JDługosz

11.7k
19
40

Source Link

answered May 7, 2018 at 0:28

JDługosz

answered May 7, 2018 at 0:28

JDługosz

11.7k
19
40

lang-cpp