How do I calculate the Nth result of a context-free grammar?

Question 1

Given a context-free grammar and a maximum depth, how do I directly compute the Nth sentence without calculating or caching intermediary sentences?

Take as an example the following grammar:

(from http://www.nltk.org/howto/generate.html)

S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept'
VP -> 'saw' NP
VP -> 'walked' PP
Det -> 'the'
Det -> 'a'
N -> 'man'
N -> 'park'
N -> 'dog'
P -> 'in'
P -> 'with'

Given a maximum depth of 4, it will deterministically generate the following 6 sentences:

0 the man slept
1 the park slept
2 the dog slept
3 a man slept
4 a park slept
5 a dog slept

So the brute force way of finding the Nth sentence is to compute sentences 0..N, discard them, then return the Nth. But with a more complex grammar it's wasteful to recompute it every time and it's not possible to hold the current state in memory.

But my question is, is there a way to directly compute the Nth sentence?

Also, is it possible to do the reverse: given a the grammar, depth, and sentence, determent what's the sentences index of all possible sentences. Something like:

>>> find_index_in_grammar("a man slept", grammar, depth=4)
3

Update

To clarify: traversal order, lexicographical order, or ordering by length is not important as long as it's deterministic. The example above is just one possible ordering that is deterministic and would work.

What I've tried so far is to consider the grammar as a different numeric "base", with its sentences as specific numbers in that base. With the key difference being the "base" varies with the number of possible replacements at each branch in the grammar.


def grammar_index(grammar, index):
 sentence = [grammar.start()]
 while set(sentence) & set(grammar._lhs_index): # any non-terminals?
 for i, symbol in enumerate(sentence): # loop for the non terminal
 possible_replacements = grammar._lhs_index.get(symbol)
 if possible_replacements: # if there is a replacement
 base = len(possible_replacements) # the len of replacements is the "base"
 rem = index % base
 index = index // base 
 replacement = possible_replacements[rem]
 del sentence[i]
 sentence[i:i] = replacement.rhs()
 if index <= 0:
 break
 if index <= 0:
 break
 return ' '.join(str(s) for s in sentence)

So this returns partially generated sentences with some duplicates as well.

Pros: it is deterministic, fast

Cons: has non terminals in the results...

>>> for i in range(20): 
...: s = grammar_index(grammar, i)
...: print(i, s) 
...:
0 NP VP
1 NP saw NP
2 NP walked PP
3 Det park slept
4 NP saw Det park
5 NP walked P Det park
6 Det dog slept
7 NP saw Det dog
8 NP walked P Det dog
9 a man slept
10 Det park saw Det man

Question 2

What is "the Nth result"!

Question 3

@YuvalFilmus the Nth sentence of all possible sentences of a grammar and depth. I expanded the question to give examples.

Question 4

The Nth sentence in what order? Please specify your question in full.

Question 5

You should update the question with this information.

Question 6

Your problem is likely hard. Given a SAT formula, you can write a grammar for all non-satisfying assignments. We can tell whether the formula is satisfiable by attempting to query word number 2ドル^n$.

Question 7

If the grammar is ambiguous, then the problem is NP-hard in general, as Yuval Filmus showed, so you should not expect any efficient algorithm.

If the grammar is unambiguous, and the corresponding context-free language is finite, then there is an efficient algorithm to find the $i$th word in the language. I'll describe it below.

First, note that you can count the number of words in the language efficiently. If $A$ is a symbol in the grammar (e.g., a non-terminal), let $n(A)$ denote the number of words in the language $L(A)$. If $a$ is a terminal, we see that $n(a)=1$. If $\alpha_1 \cdots \alpha_k$ is a sequence of terminals and non-terminals, then

$$n(\alpha_1 \cdots \alpha_k) = n(\alpha_1) \times \cdots \times n(\alpha_k).$$

Finally, if you have a rule $A ::= \beta | \gamma | \cdots$, then $n(A) = n(\beta) + n(\gamma) + \cdots$. So, you can use dynamic programming on this system of equations to calculate $n(A)$ for each non-terminal $A$. This lets you compute the number of words in the language as $n(S)$, where $S$ is the start symbol for the grammar.

This induces a bijection between the words of $L(S)$ and the set of integers $\{0,1,\dots,n(S)-1\}$. In particular, let $f_S(w)$ denote the integer corresponding to word $w \in L(S)$, under this bijection (the bijection associated with $S$). If we have a rule $S ::= \alpha | \beta$, then the bijection works as follows:

$$f_S(w) = \begin{cases} f_\alpha(w) &\text{if }w \in L(\alpha)\\ n(\alpha)+f_\beta(w) &\text{if }w \in L(\beta). \end{cases}$$

Also, if $w \in L(\alpha_1 \cdots \alpha_k)$ by $w=w_1\cdots w_k$ with $w_i \in L(\alpha_i)$, then

$$f_{\alpha_1 \cdots \alpha_k}(w) = f_{\alpha_1}(w_1) n(\alpha_2) \cdots n(\alpha_k) + f_{\alpha_2}(w_2) n(\alpha_3) \cdots n(\alpha_k) + \cdots + f_{\alpha_k}(w_k).$$

This then gives a recursive construction of a bijection between the words of the language and integers in the range 0ドル\dots n(S)-1$. This bijection can be efficiently computed in either direction via a straightforward algorithm. In particular, given $w$, to compute its index $f_S(w)$, you just apply the above rules recursively. Given an index $i$, to compute $f_S^{-1}(i)$ (i.e., to find the word $w$ such that $f_S(w)=i$), basically apply the above rules backwards. The only tricky part is the computation of $f_S^{-1}(i)$ when $S ::= \alpha|\beta$; that can be done with the rule

$$f_S^{-1}(i) = \begin{cases} f_\alpha^{-1}(i) &\text{if }i < n(\alpha)\\ f_\beta^{-1}(i-n(\alpha)) &\text{if }i \ge n(\alpha). \end{cases}$$

The rule for computing $f_{\alpha_1 \cdots \alpha_k}^{-1}(i)$ basically amounts to expressing $i$ in mixed-base, i.e., set $w_k=f_{\alpha_k}^{-1}(i \bmod n(\alpha_k))$, $w_{k-1} = f_{\alpha_{k-1}}^{-1}(\lfloor i/n(\alpha_k)\rfloor \bmod n(\alpha_{k-1}))$, etc., and then set $f_{\alpha_1 \cdots \alpha_k}^{-1}(i) = w_1 \cdots w_k$. This looks messy but is basically just the base-conversion algorithm so it isn't nearly as bad as it looks.

This latter business for computing $f^{-1}$ then solves exactly your problem.

Question 8

This is super helpful, I have had to reread it multiple times, but it's sinking in.

Question 9

@Tim, glad it helps! It's a simple idea, but the math and notation makes it all look complicated and obscure. I wish I knew a better way to describe the idea. Maybe you'll find some better way to extract the essence of what I'm getting at and explain it more simply. I also corrected a few errors just now.

Question 10

Thank you for the correction, I couldn't figure out what $w_k$ was in that context but it makes more sense now. And I was able to implement the length of the language in code from it from your first formula, so progress was made. And this is the first time I wrote LaTeX!

Question 11

The problem is likely hard, by reduction from SAT.

Let $\varphi$ be a CNF. For each clause $C$, we can construct a production which generates all truth assignments falsifying $C$. By going over all clauses, we can construct a grammar for all nonsatisfying truth assignments. For example, if the CNF is $(x_1 \lor x_2) \land (\lnot x_1 \lor x_3)$, the grammar is $$ \begin{align} &S \to 00B \mid 1B0 \\ &B \to 0 \mid 1 \end{align} $$ This grammar generates 2ドル^n$ words iff $\varphi$ is unsatisfiable. Therefore, by querying the 2ドル^n$'th word, we will be able to tell whether $\varphi$ is satisfiable or not, and so solve SAT.

Question 12

Can you phrase your answer more plainly? youtube.com/watch?v=BxacATCHrpo

Question 13

It seems I switched AND and OR. Hopefully now it makes more sense.

Question 14

He describes truth assignments that don't satisfy all the clauses. For example, x1 = false, x2 = false, x3 = whatever doesn't satisfy the first clause x1 or x2. x1 = true, x2 = whatever, x3 = false doesn't satisfy the second clause not x1 or x3. Every three letter string not in the language satisfies all clauses. If it is possible to satisfy all clauses then some n-letter string is not in the language, so the number of different strings in the language is less than 2^n.

Question 15

This is about the n'th derivation. If the grammar is not unique, and you want the n'th string in the grammar (with duplicates removed), that would be an awful lot harder. As Yuval shows, that would actually make the problem NP-hard.

For every symbol, calculate how many sentences can be derived in one step, two steps, three steps etc. Then we define some canonical order for derivations. For example, to derive from S in 10 steps, a canonical order would be after deriving S -> NP VP which is one step: NP (9 steps) + VP (0 steps), followed by NP (i steps) + VP (1 step) and so on.

Let S(k) = number of strings that can be derived from S in k steps. If you wanted the 1 billionth sentence derived from S, you would sum S(0), S(1), S(2) as long as the sum is less than a billion. Let the sum after k-1 steps be s < 1 billion, and after k steps>= 1 billion. Then you know you want sentence #(1 billion - s) that can be derived in k steps.

You would then add up NP(k)VP(0), NP(k-1)VP(1) etc. Obviously this won't work too well with a finite language with a small number of sentences.

Question 16

Thanks for the response, but when you say, "S(k) = number of strings that can be derived from S in k steps" , I don't quite understand. "This is about the n'th derivation." Sorry if my question was poorly worded, I would prefer no duplicates, but I can work around them if present. My code in the update has duplicates in the results.

Question 17

As Yuval said, if you want the n'th string without duplicates, then the problem is NP-complete. I'm only counting derivations, so if the same string is produced with two different derivations, it counts twice. That makes the problem an awful lot easier.

D.W. ♦ 169k23 gold badges236 silver badges519 bronze badges · Accepted Answer · 2019-10-29 06:10:11Z