I'm working on implementing a syntax highlighter for a simple text editor I've been working on. To do this, I need a simple lexer for various languages (I don't need a full one - I'm only interested in highlighting things like comments, strings, numbers and keywords). The way my editor component works is it highlights the entire document, and then re-highlights lines of the file as they are changed. So I have some requirements for this lexer:
- It needs to be "fast," in that it should be more or less real-time while typing.
- I need to be able to re-lex a given line of text, given the state from the previous blocks (this should be trivial for a lexer implemented as a state machine).
- It should be written in C or C++ (I am using C++ for my project), and it should be fairly easily maintainable (I realize lexers are usually somewhat "messy", but I still want it as "clean" as possible).
I am aware of and have used in the past various tools for creating lexers like (f)lex, Antlr, Quex, boost::spirit/qi, etc, but as far as I have been able to find none of them satisfy requirement (2) (if you have an example of how to achieve (2) using one of these tools, I would love to see it!).
As such, I have implemented a very simple lexer for C-style languages by hand. Note that it is very obviously incomplete as far as language elements, and some more work needs to be done to turn lexer states into actual tokens. Here is my source code so far:
CLexer.h
#ifndef INCLUDE_C_LEXER_H
#define INCLUDE_C_LEXER_H
#include <map>
#include <vector>
#include <string>
class CLexer;
typedef int (CLexer::*C_LEXER_TRANSITION)(char);
struct CLexerToken;
typedef struct CLexerToken CLexerToken;
class CLexer
{
public:
static const int INITIAL_STATE;
CLexer();
virtual ~CLexer();
std::vector<CLexerToken> lexBlock(int &s, const std::string &t, int p);
private:
static const std::map<int, C_LEXER_TRANSITION> transitions;
int rootTransition(char c);
int stringTransition(char c);
int stringEndTransition(char c);
int escapeTransition(char c);
};
struct CLexerToken
{
int start;
int count;
int state;
};
#endif
CLexer.cpp
#include "CLexer.h"
#include <cassert>
#define ST_ROOT 1
#define ST_STRING 2
#define ST_STRING_END 3
#define ST_ESCAPE 4
// This is a list of functions which are used to transition between states.
const std::map<int, C_LEXER_TRANSITION> CLexer::transitions = {
{ST_ROOT, &CLexer::rootTransition},
{ST_STRING, &CLexer::stringTransition},
{ST_STRING_END, &CLexer::stringEndTransition},
{ST_ESCAPE, &CLexer::escapeTransition}
};
const int CLexer::INITIAL_STATE = ST_ROOT;
CLexer::CLexer()
{
}
CLexer::~CLexer()
{
}
/*
* \param s This will receive the state at the end of this block.
* \param t The text this block contains.
* \param p The state at the end of the previous block.
*/
std::vector<CLexerToken> CLexer::lexBlock(int &s, const std::string &t, int p)
{
std::vector<CLexerToken> tokens;
int start = 0;
int state = p;
for(size_t idx = 0; idx < t.length(); ++idx)
{
std::map<int, C_LEXER_TRANSITION>::const_iterator transit =
CLexer::transitions.find(state);
assert(transit != CLexer::transitions.cend());
int newstate = (this->*(transit->second))(t.at(idx));
if(newstate != state)
{
CLexerToken tok;
tok.start = start;
tok.count = idx - start;
tok.state = state;
if(tok.count > 0)
{
tokens.push_back(tok);
}
start = idx;
}
state = newstate;
}
CLexerToken tok;
tok.start = start;
tok.count = t.length() - start;
tok.state = state;
if(tok.count > 0)
{
tokens.push_back(tok);
}
s = state;
return tokens;
}
int CLexer::rootTransition(char c)
{
switch(c)
{
case '"': return ST_STRING;
default: return ST_ROOT;
}
}
int CLexer::stringTransition(char c)
{
switch(c)
{
case '\\': return ST_ESCAPE;
case '"': return ST_STRING_END;
default: return ST_STRING;
}
}
int CLexer::stringEndTransition(char c __attribute__((unused)))
{
return ST_ROOT;
}
int CLexer::escapeTransition(char c __attribute__((unused)))
{
return ST_STRING;
}
main.cpp
#include <iostream>
#include <vector>
#include <string>
#include "CLexer.h"
int main(void)
{
// A REALLY simple test program to lex.
std::vector<std::string> lines {
"#include <iostream>",
"",
"int main(void)",
"{",
"\tstd::cout << \"Hello, world!\\n\";",
"",
"\treturn 0;",
"}",
""
};
CLexer lexer;
int state = CLexer::INITIAL_STATE;
for(size_t idx = 0; idx < lines.size(); ++idx)
{
std::vector<CLexerToken> tokens = lexer.lexBlock(state, lines.at(idx), state);
for(std::vector<CLexerToken>::iterator it = tokens.begin(); it != tokens.end(); ++it)
{
std::cout << "Lexeme on line " << (idx + 1) << " from " << (*it).start <<
" for " << (*it).count << ", type " << (*it).state << "\n";
}
}
return 0;
}
I'm hoping for feedback on my design - can I simplify this code to make it more maintainable? Is there a way I can use an existing tool to achieve the same thing?
Note: I have been working on this code using gcc 4.7.3, with the flags -Wall -Wextra -ansi -pedantic -Wshadow -Wpointer-arith -Wcast-qual -pipe -fomit-frame-pointer -Wall -W -O2 -std=c++0x
2 Answers 2
There are still some things to add to complete Jamal's answer:
You are using the GCC extension
__attribute__((unused))
. That's useful for C code, but since you are using C++, you can simply drop the parameter name, that's allowed an encouraged:int CLexer::stringEndTransition(char) { return ST_ROOT; }
You are using the syntax
(*it).foo
all over the place. You can simply replace that with the shorthand syntaxit->foo
.Concerning your
#define
s, I would go even further than Jamal and use anenum class
(we are using C++11 after all):enum class state_t { ROOT, STRING, STRING_END, ESCAPE };
Then, wherever you use
int
to represent the state, you should replace it bystate_t
. Also, you will have to changeST_ROOT
bystate_t::ROOT
and so on... However, in order to do so, you will have to refactor some amount of code. But it is worth it, strong typing is a good practice.C++11 guidelines again: replace
static const
variables bystatic constexpr
variables whenever possible (I don't think it is possible forstd::map
though). Also, you can initialize variables (includingstatic
ones) directly at their point of declaration:static constexpr state_t INITIAL_STATE = state_t::ROOT;
And C++11 once again: you should consider replacing the meaningful
typedef
s (see Jamal's answer for the useless ones) byusing
declarations. It may be a mere matter of taste, but I find them easier to read:using C_LEXER_TRANSITION = state_t (CLexer::*)(char);
You should use meaningful names for your functions parameters, especially the ones that are meant to be used by everone (aka the
public
ones). For example, in this definition:lexBlock(state_t &s, const std::string &t, state_t p);
I don't really know what
s
,t
andp
mean. And it was even worse before I replacedint
parameters bystate_t
ones.
-
\$\begingroup\$ This was super helpful. I haven't spent as much time as I should learning new C++ style. Thanks! \$\endgroup\$CmdrMoozy– CmdrMoozy2014年03月10日 00:31:51 +00:00Commented Mar 10, 2014 at 0:31
-
2\$\begingroup\$ @CmdrMoozy New C++ style will probably save your time once you know it. Also, it may reduce code length and improve readability, which is great :) \$\endgroup\$Morwenn– Morwenn2014年03月10日 00:40:34 +00:00Commented Mar 10, 2014 at 0:40
-
\$\begingroup\$ @CmdrMoozy: I do give Morwenn credit for confirming that C++11 was to be used, while I barely noticed the initializer list at first. Moreover, you're welcome to post a follow-up question if you get more of the new style implemented, while also following all this advice. \$\endgroup\$Jamal– Jamal2014年03月10日 00:50:59 +00:00Commented Mar 10, 2014 at 0:50
-
\$\begingroup\$ @Jamal The flag
-std=c++0x
in the command line was also a good hint. \$\endgroup\$Morwenn– Morwenn2014年03月10日 01:47:03 +00:00Commented Mar 10, 2014 at 1:47
You're using single-character variables in various places. With the exception of typical loop counters (such as
i
), variables should be descriptive and not need comments to help explain their meaning. This will also help you remember what they're for at a later point in the program.There are some remnants of C code that are not required in C++:
- You have a
typedef struct
, but C++ does the same thing with juststruct
. - Functions with no parameters don't need
void
. return 0
is already applied by the compiler at the end ofmain()
.
- You have a
You can now use C++11's
default
constructor and destructor instead of providing your own.Instead of having these in the .cpp file:
CLexer::CLexer() { } CLexer::~CLexer() { }
you'll just have these in the .h file:
CLexer() = default; ~CLexer() = default;
In pre-C++11, you would just leave them out and let the compiler make them for you.
#define
s aren't too common in C++ as opposed to C:#define ST_ROOT 1 #define ST_STRING 2 #define ST_STRING_END 3 #define ST_ESCAPE 4
This should instead be an
enum
:enum State { ROOT=1, STRING, STRING_END, ESCAPE };
The
1
is needed for the initial value (it normally starts at0
), while all the following values will be one higher than the previous. This is generally howenum
s work.For this loop statement:
for(size_t idx = 0; idx < t.length(); ++idx)
Since
t
is of typestd::string
, usestd::string::size_type
as the loop counter type. This is the specific return value oft.size()
. In general,size()
returns anstd::size_type
.This function:
int CLexer::rootTransition(char c) { switch(c) { case '"': return ST_STRING; default: return ST_ROOT; } }
doesn't need a
switch
statement as there are only two possible outcomes. It can be done much simpler with a single-line ternary statement.Using the aforementioned
enum
, this function should now return aState
:State CLexer::rootTransition(char c) { return (c == '"') ? STRING : ROOT; }
This will return either the value on the left (if true) or on the right (if false).
-
\$\begingroup\$ This answer has a lot of great feedback. Thanks for taking the time to review this! \$\endgroup\$CmdrMoozy– CmdrMoozy2014年03月10日 00:35:36 +00:00Commented Mar 10, 2014 at 0:35
-
\$\begingroup\$ @CmdrMoozy: You're welcome! \$\endgroup\$Jamal– Jamal2014年03月10日 00:42:06 +00:00Commented Mar 10, 2014 at 0:42
flex
\$\endgroup\$/* ... */
) or preprocessor defines (#define ... \
). If the user alters a line, and the previous line opened a comment block, my lexer needs to know this so it knows whether to mark the line as being a comment, or containing keywords (for example). \$\endgroup\$