-
Notifications
You must be signed in to change notification settings - Fork 666
Description
Hi,
At this time, whitespace tokens are stored in the parser, and are then filtered out in several distinct points in the parser logic, such as:
datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4032 to 4049 in 67684c8
pub fn peek_tokens_with_location<const N: usize>(&self) -> [TokenWithSpan; N] {let mut index = self.index;core::array::from_fn(|_| loop {let token = self.tokens.get(index);index += 1;if let Some(TokenWithSpan {token: Token::Whitespace(_),span: _,}) = token{continue;}break token.cloned().unwrap_or(TokenWithSpan {token: Token::EOF,span: Span::empty(),});})}datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4055 to 4069 in 67684c8
pub fn peek_tokens_ref<const N: usize>(&self) -> [&TokenWithSpan; N] {let mut index = self.index;core::array::from_fn(|_| loop {let token = self.tokens.get(index);index += 1;if let Some(TokenWithSpan {token: Token::Whitespace(_),span: _,}) = token{continue;}break token.unwrap_or(&EOF_TOKEN);})}datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4077 to 4094 in 67684c8
pub fn peek_nth_token_ref(&self, mut n: usize) -> &TokenWithSpan {let mut index = self.index;loop {index += 1;match self.tokens.get(index - 1) {Some(TokenWithSpan {token: Token::Whitespace(_),span: _,}) => continue,non_whitespace => {if n == 0 {return non_whitespace.unwrap_or(&EOF_TOKEN);}n -= 1;}}}}datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4149 to 4160 in 67684c8
pub fn advance_token(&mut self) {loop {self.index += 1;match self.tokens.get(self.index - 1) {Some(TokenWithSpan {token: Token::Whitespace(_),span: _,}) => continue,_ => break,}}}datafusion-sqlparser-rs/src/parser/mod.rs
Lines 4183 to 4202 in 67684c8
/// Seek back the last one non-whitespace token.////// Must be called after `next_token()`, otherwise might panic. OK to call/// after `next_token()` indicates an EOF.///// TODO rename to backup_token and deprecate prev_token?pub fn prev_token(&mut self) {loop {assert!(self.index > 0);self.index -= 1;if let Some(TokenWithSpan {token: Token::Whitespace(_),span: _,}) = self.tokens.get(self.index){continue;}return;}}
and many more.
SQL, as far as I know, is not a language that cares about spaces like Python - it should be safe to remove all concepts of whitespaces after the tokenization process is complete, and this should:
- Reduce memory requirements, as whitespace tokens would not be stored anymore
- Significantly simplify parser logic by removing all of the whitespace-related logic from the parser
- Move the parser closer to a streaming logic, but that will require many more PRs
Since such a PR would require quite a bit of effort on my part, I would appreciate some feedback on it before moving forward with it.
@iffyio do you happen to have any opinion regarding such a refactoring?
Ciao,
Luca