1
\$\begingroup\$

I tried implementing a lexer in rust that peeks ahead at the next character and makes a decision based on that.
However, i am told that this is bad practice, and instead i should be using finite-state-machines.
However, my implementation is slower than the original lexer, going from 1MB per 20ms to 1MB per 50ms to 1MB per 80 milliseconds running on debug mode.
Here is the code:

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
enum State {
 Start,
 Whitespace,
 Comment,
 Number,
 Ident,
 Slash,
 LeftArrow,
 RightArrow,
 End,
 NumberEnd,
 IdentEnd,
 Divide,
 Less,
 Greater,
 LeftShift,
 RightShift,
 Equal,
 LessOrEqual,
 GreaterOrEqual,
}
const ADVANCE: &[bool] = &[
 true, // Start
 false, // Whitespace
 false, // Comment
 true, // Number
 true, // Ident
 true, // Slash
 true, // LeftArrow
 true, // RightArrow
 false, // End
 false, // NumberEnd
 false, // IdentEnd
 false, // Divide
 false, // Less
 false, // Greater
 true, // LeftShift
 true, // RightShift
 true, // Equal
 true, // LessOrEqual
 true // GreaterOrEqual
];
const TRANSITION: &[State] = &[
 // Start Whitespace Comment Number Ident Slash LeftArrow RightArrow
 State::Ident, State::Start, State::Comment, State::NumberEnd, State::Ident, State::Divide, State::Less, State::Greater, // Letter
 State::Number, State::Start, State::Comment, State::Number, State::Ident, State::Divide, State::Less, State::Greater, // Number
 State::Whitespace, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Whitespace
 State::Whitespace, State::Start, State::Whitespace, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Newline
 State::Slash, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Comment, State::Less, State::Greater, // Slash
 State::LeftArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LeftShift, State::Greater, // LeftArrow
 State::RightArrow, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::RightShift, // RightArrow
 State::Equal, State::Start, State::Comment, State::NumberEnd, State::IdentEnd, State::Divide, State::LessOrEqual, State::GreaterOrEqual, // Equal
 State::End, State::End, State::End, State::NumberEnd, State::IdentEnd, State::Divide, State::Less, State::Greater, // Eof
];
pub struct Cursor<'a> {
 bytes: &'a [u8],
 pub start_pos: usize,
 pub pos: usize,
 state: State
}
macro_rules! test {
 ($name:tt, $($code:tt)*) => {
 {
 let now = std::time::Instant::now();
 $($code)*
 println!("{}: {:?}", stringify!($name), now.elapsed());
 }
 }
}
impl<'a> Cursor<'a> {
 pub fn new(string: &'a str) -> Self {
 Self {
 bytes: string.as_bytes(),
 start_pos: 0,
 pos: 0,
 state: State::Start
 }
 }
 #[inline(always)]
 fn advance(&mut self) -> usize {
 let amount = if self.pos >= self.bytes.len() {
 8
 } else {
 match self.bytes[self.pos] {
 b'a'..=b'z' | b'A'..=b'Z' | 127.. => 0,
 b'0'..=b'9' => 1,
 b' ' | b'\t' | b'\r' => 2,
 b'\n' => 3,
 b'/' => 4,
 b'<' => 5,
 b'>' => 6,
 b'=' => 7,
 _ => todo!()
 }
 };
 State::End as usize * amount
 }
}
impl<'a> Iterator for Cursor<'a> {
 type Item = &'a str;
 fn next(&mut self) -> Option<&'a str> {
 self.state = State::Start;
 while self.state < State::End {
 if self.state == State::Start { self.start_pos = self.pos; }
 self.state = TRANSITION[self.advance() + self.state as usize];
 if ADVANCE[self.state as usize] {
 self.pos += self.bytes[self.pos].leading_ones().max(1) as usize;
 }
 }
 if self.state == State::End {
 None
 } else {
 Some(unsafe { std::str::from_utf8_unchecked(&self.bytes[self.start_pos..self.pos]) })
 }
 }
}
fn main() {
 let code = "1 ".repeat(1_000_000);
 let mut cursor: Cursor<'_> = Cursor::new(&code);
 //test! { Next, cursor.advance(); };
 test! {
 Cursor,
 while let Some(n) = cursor.next() {
 //println!("{n:?} length {:?}", n.len()); 
 }
 }; 
}
asked Jan 14, 2023 at 22:18
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

As I understand it, (really fast parsing isn't my area), the idea behind lookup tables is to avoid branches and thus branch misprediction. So in order to make it really fast, you need to get rid of as many branches as possible in the parsing code.

You have a branch to decide whether to reset start_pos, a branch to decide whether to ADVANCE, as well as a match statement to classify the character class which possibly compiles to several branches.

It is probably worse than your previous attempt because the branch prediction worked better when you have different branches being tested in different parts of the lexer. Because this standardizes all the branches together, the branch predictor hasn't much of a hope.

But a more general issue is that if you want a really fast lexer built on a table like this, you probably don't want to build it by hand. You want to generate a lexer either using a crate like logos that builds a lexer for you or rolling your own generator. That way you can easily tweak the lexed language or the implementation strategy without completely rebuilding everything.

answered Jan 16, 2023 at 5:39
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.