-
Notifications
You must be signed in to change notification settings - Fork 104
-
The parser right now emits both an abstract and a concrete syntax tree. The abstract syntax tree contains pretty much what is emitted from libpg_query: a list of statements, each with their respective text range.
use cstree::text::TextRange; use pg_query::NodeEnum; #[derive(Debug)] pub struct RawStmt { pub stmt: NodeEnum, pub range: TextRange, }
The concrete syntax tree right now is just a flat list of tokens beneath the root node of each statement. This should be good enough for most features of a lsp, but at some point we need a deeply nested one. A deeply nested tree contains not only the tokens, but also the nested nodes.
For example, the statement
CREATE TABLE weather ( city varchar(80) references cities(name), temp_lo int, temp_hi int, prcp real, date date );
parsed into a deely nested cst is
CreateStmt@0..174
Create@0..6 "CREATE"
Whitespace@6..7 " "
Table@7..12 "TABLE"
Whitespace@12..13 " "
RangeVar@13..20
Ident@13..20 "weather"
Whitespace@20..21 " "
Ascii40@21..22 "("
Newline@22..23 "\n"
Whitespace@23..31 " "
ColumnDef@31..76
Ident@31..35 "city"
Whitespace@35..41 " "
Varchar@41..48 "varchar"
Ascii40@48..49 "("
Iconst@49..51 "80"
Ascii41@51..52 ")"
Whitespace@52..53 " "
References@53..63 "references"
Whitespace@63..64 " "
Ident@64..70 "cities"
Ascii40@70..71 "("
NameP@71..75 "name"
Ascii41@75..76 ")"
Ascii44@76..77 ","
Newline@77..78 "\n"
Whitespace@78..86 " "
ColumnDef@86..99
Ident@86..93 "temp_lo"
Whitespace@93..96 " "
IntP@96..99 "int"
Ascii44@99..100 ","
Newline@100..101 "\n"
Whitespace@101..109 " "
ColumnDef@109..122
Ident@109..116 "temp_hi"
Whitespace@116..119 " "
IntP@119..122 "int"
Ascii44@122..123 ","
Newline@123..124 "\n"
Whitespace@124..132 " "
ColumnDef@132..146
Ident@132..136 "prcp"
Whitespace@136..142 " "
Real@142..146 "real"
Ascii44@146..147 ","
Newline@147..148 "\n"
Whitespace@148..156 " "
ColumnDef@156..172
Ident@156..160 "date"
Whitespace@160..166 " "
Ident@166..170 "date"
Newline@170..171 "\n"
Ascii41@171..172 ")"
Ascii59@172..173 ";"
Right now, we only parse it into
CreateStmt@0..174
Create@0..6 "CREATE"
Whitespace@6..7 " "
Table@7..12 "TABLE"
Whitespace@12..13 " "
Ident@13..20 "weather"
Whitespace@20..21 " "
Ascii40@21..22 "("
Newline@22..23 "\n"
Whitespace@23..31 " "
Ident@31..35 "city"
Whitespace@35..41 " "
Varchar@41..48 "varchar"
Ascii40@48..49 "("
Iconst@49..51 "80"
Ascii41@51..52 ")"
Whitespace@52..53 " "
References@53..63 "references"
Whitespace@63..64 " "
Ident@64..70 "cities"
Ascii40@70..71 "("
NameP@71..75 "name"
Ascii41@75..76 ")"
Ascii44@76..77 ","
Newline@77..78 "\n"
Whitespace@78..86 " "
Ident@86..93 "temp_lo"
Whitespace@93..96 " "
IntP@96..99 "int"
Ascii44@99..100 ","
Newline@100..101 "\n"
Whitespace@101..109 " "
Ident@109..116 "temp_hi"
Whitespace@116..119 " "
IntP@119..122 "int"
Ascii44@122..123 ","
Newline@123..124 "\n"
Whitespace@124..132 " "
Ident@132..136 "prcp"
Whitespace@136..142 " "
Real@142..146 "real"
Ascii44@146..147 ","
Newline@147..148 "\n"
Whitespace@148..156 " "
Ident@156..160 "date"
Whitespace@160..166 " "
Ident@166..170 "date"
Newline@170..171 "\n"
Ascii41@171..172 ")"
Ascii59@172..173 ";"
libg_query
returns the ast, and a list of all tokens with their text span. A cst is a tree structure containing both: the ast nodes, and the tokens. Unfortunately, reverse engineering into a cst is not straightforward because of a few limitations of libg_query:
- For some reason, not all ast nodes have a location. For example, a SelectStmt does not have one. This is especially a problem when the SelectStmt Is not the root node, but nested within a CTE.
- The ast nodes have neither a length nor an end position.
- The top-level
nodes()
method does not return all nodes in a statement, but just the ones that are not list children of the root node. For example, a CreateTableStmt does not return the individual column definition nodes, just theRangeVar
node that contains the name of the table.
Attempt 1 (on hold)
My first attempt for 1 and 2 is pushed to branch feat/deep-cst
. It can be called an "whatever necessary" approach because I tried to merge the tokens and the nodes using
- As many control mechanisms as possible to figure out what token(s) should be children of what nodes
SiblingToken
are tokens that always belong together, such as(
and)
. The closing token must be applied on the same depth as its opening tokenTokenType
defines how a token should be applied. You can find details in the inline docs.
- Derive the location of every node that does not have one using regular expression (e.g. find
select
for aSelectStmt
)
While I could not find a conflicting case yet, this already took a significant effort and it feels like a rabbit hole. It feels very fuzzy to decide whether or not a token should be beneath a node solely based on a manually defined TokenType
.
Attempt 2 (todo)
Another idea I am currently pondering on is to start from the ast nodes, and try to figure out what tokens belong beneath each node from the ast itself. It is less fuzzy than the previous approach, because we use more information to determine the children of each node. Also, it will require "just" one implementation for every node, as opposed to guessing a type based on experience. On a high level, this is how it could work:
- Recursively walk ast of every statement, similar to
get_children
- For every node, try to find what tokens from the list of tokens for the statement are its direct children based on its properties. For example, a
RangeVar
node has arelname
property of typestring
. We can search for the string token by the name of the relation and setRangeVar
as the parent node of the respective token. - Some nodes will have static keywords tokens as their children, e.g.
SelectStmt
will have theSelect
token. - In most cases, there will be just a single token that can be a child of the node. But not always. For example a cte with multiple select statements. The correct
Select
token will be the one that is the first after the start position of its parentSelectStmt
node. We can get at least an estimation of the location of theSelectSmt
from- the location property on the node itself,
- if not there, from the location property of its immediate parent, recursively
The result should be some data structure that holds all tokens, and, if found, their parent node. In theory, the only tokens without a parent node should be whitespaces, brackets and other syntactical elements.
From here, we can determine whether or not to close a node before applying a token by validating that the token has the current node as its parent.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment
-
started with attempt 2: #38
Beta Was this translation helpful? Give feedback.