5
$\begingroup$

Given a context-free grammar and a maximum depth, how do I directly compute the Nth sentence without calculating or caching intermediary sentences?

Take as an example the following grammar:

(from http://www.nltk.org/howto/generate.html)

S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept'
VP -> 'saw' NP
VP -> 'walked' PP
Det -> 'the'
Det -> 'a'
N -> 'man'
N -> 'park'
N -> 'dog'
P -> 'in'
P -> 'with'

Given a maximum depth of 4, it will deterministically generate the following 6 sentences:

0 the man slept
1 the park slept
2 the dog slept
3 a man slept
4 a park slept
5 a dog slept

So the brute force way of finding the Nth sentence is to compute sentences 0..N, discard them, then return the Nth. But with a more complex grammar it's wasteful to recompute it every time and it's not possible to hold the current state in memory.

But my question is, is there a way to directly compute the Nth sentence?

Also, is it possible to do the reverse: given a the grammar, depth, and sentence, determent what's the sentences index of all possible sentences. Something like:

>>> find_index_in_grammar("a man slept", grammar, depth=4)
3

Update

To clarify: traversal order, lexicographical order, or ordering by length is not important as long as it's deterministic. The example above is just one possible ordering that is deterministic and would work.

What I've tried so far is to consider the grammar as a different numeric "base", with its sentences as specific numbers in that base. With the key difference being the "base" varies with the number of possible replacements at each branch in the grammar.


def grammar_index(grammar, index):
 sentence = [grammar.start()]
 while set(sentence) & set(grammar._lhs_index): # any non-terminals?
 for i, symbol in enumerate(sentence): # loop for the non terminal
 possible_replacements = grammar._lhs_index.get(symbol)
 if possible_replacements: # if there is a replacement
 base = len(possible_replacements) # the len of replacements is the "base"
 rem = index % base
 index = index // base 
 replacement = possible_replacements[rem]
 del sentence[i]
 sentence[i:i] = replacement.rhs()
 if index <= 0:
 break
 if index <= 0:
 break
 return ' '.join(str(s) for s in sentence)

So this returns partially generated sentences with some duplicates as well.

Pros: it is deterministic, fast

Cons: has non terminals in the results...

>>> for i in range(20): 
...: s = grammar_index(grammar, i)
...: print(i, s) 
...:
0 NP VP
1 NP saw NP
2 NP walked PP
3 Det park slept
4 NP saw Det park
5 NP walked P Det park
6 Det dog slept
7 NP saw Det dog
8 NP walked P Det dog
9 a man slept
10 Det park saw Det man
asked Oct 26, 2019 at 11:55
$\endgroup$
7
  • 2
    $\begingroup$ What is "the Nth result"! $\endgroup$ Commented Oct 26, 2019 at 12:49
  • $\begingroup$ @YuvalFilmus the Nth sentence of all possible sentences of a grammar and depth. I expanded the question to give examples. $\endgroup$ Commented Oct 26, 2019 at 14:06
  • $\begingroup$ The Nth sentence in what order? Please specify your question in full. $\endgroup$ Commented Oct 26, 2019 at 14:10
  • $\begingroup$ You should update the question with this information. $\endgroup$ Commented Oct 26, 2019 at 14:21
  • 3
    $\begingroup$ Your problem is likely hard. Given a SAT formula, you can write a grammar for all non-satisfying assignments. We can tell whether the formula is satisfiable by attempting to query word number 2ドル^n$. $\endgroup$ Commented Oct 26, 2019 at 17:54

3 Answers 3

4
$\begingroup$

If the grammar is ambiguous, then the problem is NP-hard in general, as Yuval Filmus showed, so you should not expect any efficient algorithm.

If the grammar is unambiguous, and the corresponding context-free language is finite, then there is an efficient algorithm to find the $i$th word in the language. I'll describe it below.

First, note that you can count the number of words in the language efficiently. If $A$ is a symbol in the grammar (e.g., a non-terminal), let $n(A)$ denote the number of words in the language $L(A)$. If $a$ is a terminal, we see that $n(a)=1$. If $\alpha_1 \cdots \alpha_k$ is a sequence of terminals and non-terminals, then

$$n(\alpha_1 \cdots \alpha_k) = n(\alpha_1) \times \cdots \times n(\alpha_k).$$

Finally, if you have a rule $A ::= \beta | \gamma | \cdots$, then $n(A) = n(\beta) + n(\gamma) + \cdots$. So, you can use dynamic programming on this system of equations to calculate $n(A)$ for each non-terminal $A$. This lets you compute the number of words in the language as $n(S)$, where $S$ is the start symbol for the grammar.

This induces a bijection between the words of $L(S)$ and the set of integers $\{0,1,\dots,n(S)-1\}$. In particular, let $f_S(w)$ denote the integer corresponding to word $w \in L(S)$, under this bijection (the bijection associated with $S$). If we have a rule $S ::= \alpha | \beta$, then the bijection works as follows:

$$f_S(w) = \begin{cases} f_\alpha(w) &\text{if }w \in L(\alpha)\\ n(\alpha)+f_\beta(w) &\text{if }w \in L(\beta). \end{cases}$$

Also, if $w \in L(\alpha_1 \cdots \alpha_k)$ by $w=w_1\cdots w_k$ with $w_i \in L(\alpha_i)$, then

$$f_{\alpha_1 \cdots \alpha_k}(w) = f_{\alpha_1}(w_1) n(\alpha_2) \cdots n(\alpha_k) + f_{\alpha_2}(w_2) n(\alpha_3) \cdots n(\alpha_k) + \cdots + f_{\alpha_k}(w_k).$$

This then gives a recursive construction of a bijection between the words of the language and integers in the range 0ドル\dots n(S)-1$. This bijection can be efficiently computed in either direction via a straightforward algorithm. In particular, given $w$, to compute its index $f_S(w)$, you just apply the above rules recursively. Given an index $i$, to compute $f_S^{-1}(i)$ (i.e., to find the word $w$ such that $f_S(w)=i$), basically apply the above rules backwards. The only tricky part is the computation of $f_S^{-1}(i)$ when $S ::= \alpha|\beta$; that can be done with the rule

$$f_S^{-1}(i) = \begin{cases} f_\alpha^{-1}(i) &\text{if }i < n(\alpha)\\ f_\beta^{-1}(i-n(\alpha)) &\text{if }i \ge n(\alpha). \end{cases}$$

The rule for computing $f_{\alpha_1 \cdots \alpha_k}^{-1}(i)$ basically amounts to expressing $i$ in mixed-base, i.e., set $w_k=f_{\alpha_k}^{-1}(i \bmod n(\alpha_k))$, $w_{k-1} = f_{\alpha_{k-1}}^{-1}(\lfloor i/n(\alpha_k)\rfloor \bmod n(\alpha_{k-1}))$, etc., and then set $f_{\alpha_1 \cdots \alpha_k}^{-1}(i) = w_1 \cdots w_k$. This looks messy but is basically just the base-conversion algorithm so it isn't nearly as bad as it looks.

This latter business for computing $f^{-1}$ then solves exactly your problem.

answered Oct 29, 2019 at 6:10
$\endgroup$
3
  • $\begingroup$ This is super helpful, I have had to reread it multiple times, but it's sinking in. $\endgroup$ Commented Oct 29, 2019 at 18:12
  • $\begingroup$ @Tim, glad it helps! It's a simple idea, but the math and notation makes it all look complicated and obscure. I wish I knew a better way to describe the idea. Maybe you'll find some better way to extract the essence of what I'm getting at and explain it more simply. I also corrected a few errors just now. $\endgroup$ Commented Oct 29, 2019 at 21:24
  • $\begingroup$ Thank you for the correction, I couldn't figure out what $w_k$ was in that context but it makes more sense now. And I was able to implement the length of the language in code from it from your first formula, so progress was made. And this is the first time I wrote LaTeX! $\endgroup$ Commented Oct 30, 2019 at 11:14
5
$\begingroup$

The problem is likely hard, by reduction from SAT.

Let $\varphi$ be a CNF. For each clause $C$, we can construct a production which generates all truth assignments falsifying $C$. By going over all clauses, we can construct a grammar for all nonsatisfying truth assignments. For example, if the CNF is $(x_1 \lor x_2) \land (\lnot x_1 \lor x_3)$, the grammar is $$ \begin{align} &S \to 00B \mid 1B0 \\ &B \to 0 \mid 1 \end{align} $$ This grammar generates 2ドル^n$ words iff $\varphi$ is unsatisfiable. Therefore, by querying the 2ドル^n$'th word, we will be able to tell whether $\varphi$ is satisfiable or not, and so solve SAT.

answered Oct 27, 2019 at 8:57
$\endgroup$
3
  • $\begingroup$ Can you phrase your answer more plainly? youtube.com/watch?v=BxacATCHrpo $\endgroup$ Commented Oct 27, 2019 at 13:33
  • $\begingroup$ It seems I switched AND and OR. Hopefully now it makes more sense. $\endgroup$ Commented Oct 27, 2019 at 13:34
  • 1
    $\begingroup$ He describes truth assignments that don't satisfy all the clauses. For example, x1 = false, x2 = false, x3 = whatever doesn't satisfy the first clause x1 or x2. x1 = true, x2 = whatever, x3 = false doesn't satisfy the second clause not x1 or x3. Every three letter string not in the language satisfies all clauses. If it is possible to satisfy all clauses then some n-letter string is not in the language, so the number of different strings in the language is less than 2^n. $\endgroup$ Commented Oct 27, 2019 at 13:39
3
$\begingroup$

This is about the n'th derivation. If the grammar is not unique, and you want the n'th string in the grammar (with duplicates removed), that would be an awful lot harder. As Yuval shows, that would actually make the problem NP-hard.

For every symbol, calculate how many sentences can be derived in one step, two steps, three steps etc. Then we define some canonical order for derivations. For example, to derive from S in 10 steps, a canonical order would be after deriving S -> NP VP which is one step: NP (9 steps) + VP (0 steps), followed by NP (i steps) + VP (1 step) and so on.

Let S(k) = number of strings that can be derived from S in k steps. If you wanted the 1 billionth sentence derived from S, you would sum S(0), S(1), S(2) as long as the sum is less than a billion. Let the sum after k-1 steps be s < 1 billion, and after k steps>= 1 billion. Then you know you want sentence #(1 billion - s) that can be derived in k steps.

You would then add up NP(k)VP(0), NP(k-1)VP(1) etc. Obviously this won't work too well with a finite language with a small number of sentences.

answered Oct 27, 2019 at 0:01
$\endgroup$
2
  • $\begingroup$ Thanks for the response, but when you say, "S(k) = number of strings that can be derived from S in k steps" , I don't quite understand. "This is about the n'th derivation." Sorry if my question was poorly worded, I would prefer no duplicates, but I can work around them if present. My code in the update has duplicates in the results. $\endgroup$ Commented Oct 27, 2019 at 13:29
  • 1
    $\begingroup$ As Yuval said, if you want the n'th string without duplicates, then the problem is NP-complete. I'm only counting derivations, so if the same string is produced with two different derivations, it counts twice. That makes the problem an awful lot easier. $\endgroup$ Commented Oct 27, 2019 at 13:34

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.