I'm reading this paper (Tactic Learning and Proving for the Coq Proof Assistant written by Blaauwbroek et al). In section 3 (Proof Recording) it mentions one-shingles and two-shingles :
After creating a database of tactics and recorded proof states, we want to predict the correct tactic to make progress in a new, unseen proof state. For this, we must find a good characterization of the sentences present in a proof state. Currently, this characterization is a set of features. The features of a sentence are all of its one-shingles and two-shingles, meaning they are all identifiers and pairs of adjacent identifiers in the abstract syntax tree. For example, the sentence (x + y) + z = x + (y + z) will be converted to the feature set $$\{eq, plus, x, y, z, eq\_plus, plus\_x, plus\_y, plus\_z, eq\_z, eq\_x\}$$
What does a two-shingle mean? Here is my interpretation, first we need to draw the abstract syntax tree:
=
/ \
+ +
/ \ | \
+ z x +
/ \ / \
x y y z
Now, I assume one-shingles mean the set of all of the nodes, so it would be S1 = {eq, plus, x, y, z}. I also assume that two-shingles mean the node and it's parent, then it would be S2 = { eq plus, plus plus, plus x, plus y, plus z }.
How "eq_z" is considered a two shingle? I'm totally confused!
1 Answer 1
By the context you provided I suspect the author is talking about a way to automatically apply a tactic to a proof goal having some lexical property. In that case he/she is trying to solve some sort of matching problem.
The paper already told you what they are: "The features of a sentence are [...] all identifiers and pairs of adjacent identifiers in the abstract syntax tree." The notation and the language employed are definitely strange. By some books' standards (Dragon book, Modern compiler design) plain wrong.
Here, the author is finding all pairs of symbols appearing in the input string. In fact, "shingles" are lexems of language. "Two-shingles" are pair of lexems, but apparently, independently by the position of the adjacent lexem. This is why there is 'z=' and '=x'. To me, the mistake is the presence of '=+'.
I assume parentheses are not included because the precedence of the operations is encoded by the abstract syntax tree.
Explore related questions
See similar questions with these tags.
eq_zandeq_xshouldn't be there. You can email the authors for confirmation. $\endgroup$