Rust state-machine lexer

Question 1

I tried implementing a lexer in rust that peeks ahead at the next character and makes a decision based on that.
However, i am told that this is bad practice, and instead i should be using finite-state-machines.
However, my implementation is slower than the original lexer, going from 1MB per 20ms to 1MB per 50ms to 1MB per 80 milliseconds running on debug mode.
Here is the code:

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
enum State {
 Start,
 Whitespace,
 Comment,
 Number,
 Ident,
 Slash,
 LeftArrow,
 RightArrow,
 End,
 NumberEnd,
 IdentEnd,
 Divide,
 Less,
 Greater,
 LeftShift,
 RightShift,
 Equal,
 LessOrEqual,
 GreaterOrEqual,
}
const ADVANCE: &[bool] = &[
 true, // Start
 false, // Whitespace
 false, // Comment
 true, // Number
 true, // Ident
 true, // Slash
 true, // LeftArrow
 true, // RightArrow
 false, // End
 false, // NumberEnd
 false, // IdentEnd
 false, // Divide
 false, // Less
 false, // Greater
 true, // LeftShift
 true, // RightShift
 true, // Equal
 true, // LessOrEqual
 true // GreaterOrEqual
];
const TRANSITION: &[State] = &[
 // Start Whitespace Comment Number Ident Slash LeftArrow RightArrow
 State::Ident, State::Start, State::Comment, State::NumberEnd, State::Ident, State::Divide, State::Less, State::Greater, // Letter
 State::Number, State::Start, State::Comment, State::Number, State::Ident, State::Divide, State::Less, State::Greater, // Number
 State::Whitespace, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Whitespace
 State::Whitespace, State::Start, State::Whitespace, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Newline
 State::Slash, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Comment, State::Less, State::Greater, // Slash
 State::LeftArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LeftShift, State::Greater, // LeftArrow
 State::RightArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::RightShift, // RightArrow
 State::Equal, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LessOrEqual, State::GreaterOrEqual, // Equal
 State::End, State::End, State::End, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Eof
];
pub struct Cursor<'a> {
 bytes: &'a [u8],
 pub start_pos: usize,
 pub pos: usize,
 state: State
}
macro_rules! test {
 ($name:tt, $($code:tt)*) => {
 {
 let now = std::time::Instant::now();
 $($code)*
 println!("{}: {:?}", stringify!($name), now.elapsed());
 }
 }
}
impl<'a> Cursor<'a> {
 pub fn new(string: &'a str) -> Self {
 Self {
 bytes: string.as_bytes(),
 start_pos: 0,
 pos: 0,
 state: State::Start
 }
 }
 #[inline(always)]
 fn advance(&mut self) -> usize {
 let amount = if self.pos >= self.bytes.len() {
 8
 } else {
 match self.bytes[self.pos] {
 b'a'..=b'z' | b'A'..=b'Z' | 127.. => 0,
 b'0'..=b'9' => 1,
 b' ' | b'\t' | b'\r' => 2,
 b'\n' => 3,
 b'/' => 4,
 b'<' => 5,
 b'>' => 6,
 b'=' => 7,
 _ => todo!()
 }
 };
 State::End as usize * amount
 }
}
impl<'a> Iterator for Cursor<'a> {
 type Item = &'a str;
 fn next(&mut self) -> Option<&'a str> {
 self.state = State::Start;
 while self.state < State::End {
 if self.state == State::Start { self.start_pos = self.pos; }
 self.state = TRANSITION[self.advance() + self.state as usize];
 if ADVANCE[self.state as usize] {
 self.pos += self.bytes[self.pos].leading_ones().max(1) as usize;
 }
 }
 if self.state == State::End {
 None
 } else {
 Some(unsafe { std::str::from_utf8_unchecked(&self.bytes[self.start_pos..self.pos]) })
 }
 }
}
fn main() {
 let code = "1 ".repeat(1_000_000);
 let mut cursor: Cursor<'_> = Cursor::new(&code);
 //test! { Next, cursor.advance(); };
 test! {
 Cursor,
 while let Some(n) = cursor.next() {
 //println!("{n:?} length {:?}", n.len()); 
 }
 }; 
}

Question 2

As I understand it, (really fast parsing isn't my area), the idea behind lookup tables is to avoid branches and thus branch misprediction. So in order to make it really fast, you need to get rid of as many branches as possible in the parsing code.

You have a branch to decide whether to reset start_pos, a branch to decide whether to ADVANCE, as well as a match statement to classify the character class which possibly compiles to several branches.

It is probably worse than your previous attempt because the branch prediction worked better when you have different branches being tested in different parts of the lexer. Because this standardizes all the branches together, the branch predictor hasn't much of a hope.

But a more general issue is that if you want a really fast lexer built on a table like this, you probably don't want to build it by hand. You want to generate a lexer either using a crate like logos that builds a lexer for you or rolling your own generator. That way you can easily tweak the lexed language or the implementation strategy without completely rebuilding everything.

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2023-01-16 05:39:12Z

As I understand it, (really fast parsing isn't my area), the idea behind lookup tables is to avoid branches and thus branch misprediction. So in order to make it really fast, you need to get rid of as many branches as possible in the parsing code.

You have a branch to decide whether to reset start_pos, a branch to decide whether to ADVANCE, as well as a match statement to classify the character class which possibly compiles to several branches.

It is probably worse than your previous attempt because the branch prediction worked better when you have different branches being tested in different parts of the lexer. Because this standardizes all the branches together, the branch predictor hasn't much of a hope.

But a more general issue is that if you want a really fast lexer built on a table like this, you probably don't want to build it by hand. You want to generate a lexer either using a crate like logos that builds a lexer for you or rolling your own generator. That way you can easily tweak the lexed language or the implementation strategy without completely rebuilding everything.

Stack Exchange Network

Rust state-machine lexer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Rust state-machine lexer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions