I've been trying to figure out an algorithm to go from an infix equation to a syntax tree, like so:
(1+3)*4+5
+
* 5
+ 4
1 3
However, I don't just want it to handle operators, I want it to handle functions with arbitrary argument numbers as well, i.e:
max(1,3,7)*4+5
+
* 5
max 4
1 3 7
Here's the general algorithm I've come up with:
You start with the root node of the tree, containing a null
value. You have a pointer which moves around the tree as you parse the expression, and starts pointed at the root node.
There are also some aspects of the tree I should probably clarify:
- Inserting at a node means adding to the end of the node's children.
- Injecting at a node means adding to a specific index in the node, and removing the node at that index and inserting it to the injected node. So, if node
A
has childB
at index 0, and we inject nodeC
at index 0, nodeA
will have a childC
which will have a childB
. - Replacing at an index removes the node at that index and puts the alternate node in its stead. So if we have node
A
with childB
at index 0, and we replace usingC
at index 0, we will have nodeA
with childC
.
Ok, so here's the algorithm so far.
For every token in the infix string:
- if the token is a number
- insert it as a child of the current node
- if the token is an argument separator
- traverse up the tree until the value of your current node is a function
- if the token is a left parenthesis
- if the value of the current node is not a function, insert our token as a child node, and set our current node to the token's node.
- if the token is a right parenthesis
- traverse until the current node is either a left parenthesis or a function
- if the current node is a left parenthesis, replace it with its first child (index 0). This is equivalent to removing the parenthesis node from the tree structure, while keeping its first child intact.
- traverse up one level, to the parent of the current node
- if the token is a function
- insert the token as a child node of the current node, and set the current node to the newly inserted child node
- if the token is an operator
- if the current node is not a left parenthesis or the root node
- traverse up if
- the current node is not at the root, or
- the token is right associative and the precedence of the token is less than the precedence of the current node or
- the token is left associative and the precedence of the token is less than or equal to the precedence of the current node
- traverse up if
- inject the token as a new node at the last index of the current node
- set the current node to its newly added token child node
- if the current node is not a left parenthesis or the root node
Once you have gone through all the tokens, return the first child of the root node.
Is there an existing algorithm I can check this against? Are there any obvious problems with this? Are there any particularly difficult to parse problems I can plug in using this and see if they work?
2 Answers 2
Treat the comma as an infix operator. Then
max(1,3,7)*4+5
becomes
+
/ \
* 5
/ \
max 4
|
,
/ \
, 7
/ \
1 3
The comma should have a lower precedence than your calculation operators (+ - * / etc.).
-
Ok, that's an interesting notion. Why?MirroredFate– MirroredFate2015年07月23日 18:02:00 +00:00Commented Jul 23, 2015 at 18:02
-
Different approach - apply the function on two arguments and two arguments again.
max(1,2,3)
gets changed tomax(1,(max(2,3))
user40980– user409802015年07月23日 19:15:23 +00:00Commented Jul 23, 2015 at 19:15 -
1: You already have the mechanics in place for parsing binary operators. 2: Standardizes the number of children a node can have so that walking the tree is simpler.A. I. Breveleri– A. I. Breveleri2015年07月24日 03:22:35 +00:00Commented Jul 24, 2015 at 3:22
-
@ MichaelT: Good idea but only works if the function composes its arguments associatively. It wouldn't work for eg. conditional( boolean, trueValue, Falsevalue ) or substring( string, count, start ) etc.A. I. Breveleri– A. I. Breveleri2015年07月24日 03:25:37 +00:00Commented Jul 24, 2015 at 3:25
At some point you just need to graduate to a proper parser, because there are too many things to track and too many boundary cases. Adding function calls to the peg.js example grammar, you get.
primary
= integer
/ "(" additive:additive ")" { return additive; }
/ function_call
function_call
= [a-z]+ "(" args ")"
args
= (additive ("," additive)*)?
Note this easily handles boundary cases like function calls and expressions nested within function calls:
max(3+4,min(6,(1+2)*3))+2
max(1,2,3 @3)
or something. This is pretty ugly and why I moved to a syntax tree. Right now I'm calling it shunting-tree... although there is probably already a named algorithm that does this I just haven't found.