Eliminating whitespace from the parser logic #2076

New issue

Open

#2077

@LucaCappelletti94

Description

@LucaCappelletti94

LucaCappelletti94

opened

on Oct 28, 2025

Hi,

At this time, whitespace tokens are stored in the parser, and are then filtered out in several distinct points in the parser logic, such as:

datafusion-sqlparser-rs/src/parser/mod.rs

Lines 4032 to 4049 in 67684c8

pub fn peek_tokens_with_location<const N: usize>(&self) -> [TokenWithSpan; N] {

let mut index = self.index;

core::array::from_fn(|_| loop {

let token = self.tokens.get(index);

index += 1;

if let Some(TokenWithSpan {

token: Token::Whitespace(_),

span: _,

}) = token

{

continue;

}

break token.cloned().unwrap_or(TokenWithSpan {

token: Token::EOF,

span: Span::empty(),

});

})

}
datafusion-sqlparser-rs/src/parser/mod.rs

Lines 4055 to 4069 in 67684c8

pub fn peek_tokens_ref<const N: usize>(&self) -> [&TokenWithSpan; N] {

let mut index = self.index;

core::array::from_fn(|_| loop {

let token = self.tokens.get(index);

index += 1;

if let Some(TokenWithSpan {

token: Token::Whitespace(_),

span: _,

}) = token

{

continue;

}

break token.unwrap_or(&EOF_TOKEN);

})

}
datafusion-sqlparser-rs/src/parser/mod.rs

Lines 4077 to 4094 in 67684c8

pub fn peek_nth_token_ref(&self, mut n: usize) -> &TokenWithSpan {

let mut index = self.index;

loop {

index += 1;

match self.tokens.get(index - 1) {

Some(TokenWithSpan {

token: Token::Whitespace(_),

span: _,

}) => continue,

non_whitespace => {

if n == 0 {

return non_whitespace.unwrap_or(&EOF_TOKEN);

}

n -= 1;

}

}

}

}
datafusion-sqlparser-rs/src/parser/mod.rs

Lines 4149 to 4160 in 67684c8

pub fn advance_token(&mut self) {

loop {

self.index += 1;

match self.tokens.get(self.index - 1) {

Some(TokenWithSpan {

token: Token::Whitespace(_),

span: _,

}) => continue,

_ => break,

}

}

}
datafusion-sqlparser-rs/src/parser/mod.rs

Lines 4183 to 4202 in 67684c8

/// Seek back the last one non-whitespace token.

///

/// Must be called after `next_token()`, otherwise might panic. OK to call

/// after `next_token()` indicates an EOF.

///

// TODO rename to backup_token and deprecate prev_token?

pub fn prev_token(&mut self) {

loop {

assert!(self.index > 0);

self.index -= 1;

if let Some(TokenWithSpan {

token: Token::Whitespace(_),

span: _,

}) = self.tokens.get(self.index)

{

continue;

}

return;

}

}

and many more.

SQL, as far as I know, is not a language that cares about spaces like Python - it should be safe to remove all concepts of whitespaces after the tokenization process is complete, and this should:

Reduce memory requirements, as whitespace tokens would not be stored anymore
Significantly simplify parser logic by removing all of the whitespace-related logic from the parser
Move the parser closer to a streaming logic, but that will require many more PRs

Since such a PR would require quite a bit of effort on my part, I would appreciate some feedback on it before moving forward with it.

@iffyio do you happen to have any opinion regarding such a refactoring?

Ciao,
Luca

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eliminating whitespace from the parser logic #2076

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions