I tried implementing a lexer in rust that peeks ahead at the next character and makes a decision based on that.
However, i am told that this is bad practice, and instead i should be using finite-state-machines.
However, my implementation is slower than the original lexer, going from 1MB per 20ms to 1MB per 50ms to 1MB per 80 milliseconds running on debug mode.
Here is the code:
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
enum State {
Start,
Whitespace,
Comment,
Number,
Ident,
Slash,
LeftArrow,
RightArrow,
End,
NumberEnd,
IdentEnd,
Divide,
Less,
Greater,
LeftShift,
RightShift,
Equal,
LessOrEqual,
GreaterOrEqual,
}
const ADVANCE: &[bool] = &[
true, // Start
false, // Whitespace
false, // Comment
true, // Number
true, // Ident
true, // Slash
true, // LeftArrow
true, // RightArrow
false, // End
false, // NumberEnd
false, // IdentEnd
false, // Divide
false, // Less
false, // Greater
true, // LeftShift
true, // RightShift
true, // Equal
true, // LessOrEqual
true // GreaterOrEqual
];
const TRANSITION: &[State] = &[
// Start Whitespace Comment Number Ident Slash LeftArrow RightArrow
State::Ident, State::Start, State::Comment, State::NumberEnd, State::Ident, State::Divide, State::Less, State::Greater, // Letter
State::Number, State::Start, State::Comment, State::Number, State::Ident, State::Divide, State::Less, State::Greater, // Number
State::Whitespace, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Whitespace
State::Whitespace, State::Start, State::Whitespace, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Newline
State::Slash, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Comment, State::Less, State::Greater, // Slash
State::LeftArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LeftShift, State::Greater, // LeftArrow
State::RightArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::RightShift, // RightArrow
State::Equal, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LessOrEqual, State::GreaterOrEqual, // Equal
State::End, State::End, State::End, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Eof
];
pub struct Cursor<'a> {
bytes: &'a [u8],
pub start_pos: usize,
pub pos: usize,
state: State
}
macro_rules! test {
($name:tt, $($code:tt)*) => {
{
let now = std::time::Instant::now();
$($code)*
println!("{}: {:?}", stringify!($name), now.elapsed());
}
}
}
impl<'a> Cursor<'a> {
pub fn new(string: &'a str) -> Self {
Self {
bytes: string.as_bytes(),
start_pos: 0,
pos: 0,
state: State::Start
}
}
#[inline(always)]
fn advance(&mut self) -> usize {
let amount = if self.pos >= self.bytes.len() {
8
} else {
match self.bytes[self.pos] {
b'a'..=b'z' | b'A'..=b'Z' | 127.. => 0,
b'0'..=b'9' => 1,
b' ' | b'\t' | b'\r' => 2,
b'\n' => 3,
b'/' => 4,
b'<' => 5,
b'>' => 6,
b'=' => 7,
_ => todo!()
}
};
State::End as usize * amount
}
}
impl<'a> Iterator for Cursor<'a> {
type Item = &'a str;
fn next(&mut self) -> Option<&'a str> {
self.state = State::Start;
while self.state < State::End {
if self.state == State::Start { self.start_pos = self.pos; }
self.state = TRANSITION[self.advance() + self.state as usize];
if ADVANCE[self.state as usize] {
self.pos += self.bytes[self.pos].leading_ones().max(1) as usize;
}
}
if self.state == State::End {
None
} else {
Some(unsafe { std::str::from_utf8_unchecked(&self.bytes[self.start_pos..self.pos]) })
}
}
}
fn main() {
let code = "1 ".repeat(1_000_000);
let mut cursor: Cursor<'_> = Cursor::new(&code);
//test! { Next, cursor.advance(); };
test! {
Cursor,
while let Some(n) = cursor.next() {
//println!("{n:?} length {:?}", n.len());
}
};
}
1 Answer 1
As I understand it, (really fast parsing isn't my area), the idea behind lookup tables is to avoid branches and thus branch misprediction. So in order to make it really fast, you need to get rid of as many branches as possible in the parsing code.
You have a branch to decide whether to reset start_pos
, a branch to decide whether to ADVANCE
, as well as a match
statement to classify the character class which possibly compiles to several branches.
It is probably worse than your previous attempt because the branch prediction worked better when you have different branches being tested in different parts of the lexer. Because this standardizes all the branches together, the branch predictor hasn't much of a hope.
But a more general issue is that if you want a really fast lexer built on a table like this, you probably don't want to build it by hand. You want to generate a lexer either using a crate like logos
that builds a lexer for you or rolling your own generator. That way you can easily tweak the lexed language or the implementation strategy without completely rebuilding everything.
Explore related questions
See similar questions with these tags.