Efficiently finding an infix of a word that is rejected by an NFA

Question 1

I am interested in the fine-grained complexity of the following NFA infix rejection problem:

Input: a word $w$, a nondeterministic finite automaton $A$
Output: is there an infix of $w$ (i.e., some $u$ with $w = sut$ for some $s,t$) that is rejected by $A$

The naive algorithm for NFA infix rejection is to consider every infix of $w$ and run a state-set simulation of $A$ on each infix: this gives complexity $O(|w|^2 \times |w| \cdot |A|)$. Less naively, we can run $|w|$ state-set simulations from every starting position and see if we ever encounter a set of states that contains no final state (witnessing that some infix is rejected): this gives complexity $O(|w|^2 \cdot |A|)$.

In terms of lower bounds, it is easy to adapt the result of Backurs and Indyk. They show that, assuming SETH, we cannot determine whether an NFA $A'$ accepts a word $w'$ in time $O((|A'| \cdot |w'|)^{1-\epsilon})$ for any $\epsilon$. Up to adding unique delimiters at the beginning and end of the word, this easily implies that NFA infix rejection also does not have an $O((|w| \cdot |A|)^{1-\epsilon})$ algorithm assuming SETH.

Hence the question: it is possible to show a better conditional lower bound on NFA infix rejection, or can one somehow improve on the upper bound above?

Note that the problem where we want a infix accepted by $A$ is easily seen to be in $O(|w| \cdot |A|)$, simply by building $A'$ from $A$ that recognizes the language $\Sigma^* L(A) \Sigma^*$ and testing whether $A'$ accepts $w$ by state-set simulation.

(This problem is from a recent paper where we investigate related questions. We thought a bit about this NFA infix rejection problem specifically, but did not see what could be said to pinpoint its precise complexity. I am asking the question here in case there are insights from neighboring communities.)

Question 2

Yes, there is a strictly stronger SETH-based conditional lower bound for NFA infix rejection, and the best general upper bound remains quadratic in $n$ with word-parallel speedups. In particular, when $m \ge n$ no general improvement below $O(n^{2-\varepsilon})$ is possible under SETH. The following binary and $\varepsilon$-free construction, together with a word-RAM upper bound, gives a complete picture.

Let $n = |w|$ and $m = |Q|$. Fix any $\alpha \in (0,1]$ and $\varepsilon > 0$. Unless SETH fails, there is no deterministic or randomized algorithm that, on all instances with $m = \Theta(n^\alpha)$, decides whether there exists a nonempty infix $u$ of $w$ with $\Delta_u(I) \cap F = \varnothing$ in time

$$O\big(n^{1+\alpha-\varepsilon}\big).$$

For $\alpha = 1$, that is $m = \Theta(n)$, this rules out $O(n^{2-\varepsilon})$. For $\alpha > 1$ the $\alpha = 1$ case already implies the same barrier. The claim holds even for binary, $\varepsilon$-free NFAs.

Reduce from orthogonal vectors on $U, V \subseteq \{0,1\}^d$ with $|U| = |V| = N$ and $d = c\log N$, which is the standard SETH setting. Operate over the binary alphabet $\Sigma = \{0,1\}$. Encode each bit with the 2-bit cell $\mathrm{enc}(0) = 01$ and $\mathrm{enc}(1) = 10$. Any concatenation of such cells may contain 00ドル$ or 11ドル$ across boundaries but never 000ドル$ or 111ドル$. Use self-synchronizing markers that contain a forbidden triple. Take headers $H_U = 0000$ and $H_V = 0001$ and a unique barrier $B = 1110$. Since cell streams avoid 000ドル$ and 111ドル$, neither a header nor the barrier can appear inside cells or straddling a cell boundary. Any substring containing 000ドルx$ aligns to a header and any substring containing 1110ドル$ aligns to the barrier. For $x \in \{0,1\}^d$ define $E_U(x) = H_U \cdot \mathrm{enc}(x_1)\cdots\mathrm{enc}(x_d)$ and $E_V(x) = H_V \cdot \mathrm{enc}(x_1)\cdots\mathrm{enc}(x_d)$, and form the text

$$w = E_U(u^{(1)})\cdots E_U(u^{(N)}) \cdot B \cdot E_V(v^{(1)})\cdots E_V(v^{(N)}).$$

By synchronization, every infix spanning $B$ decomposes uniquely as $E_U(u^{(p)}) \cdot B \cdot E_V(v^{(q)})$, and any other infix is malformed because it chops a header or has the wrong number of cells or the wrong number of barriers.

Build a binary, $\varepsilon$-free NFA $A$ as a disjoint union $A = A_{\mathrm{syn}} \uplus A_{\mathrm{pair}}$ with multiple initial states, so no $\varepsilon$ is used for the union. The component $A_{\mathrm{syn}}$ is an $\varepsilon$-free binary DFA with $O(d)$ states that accepts exactly the malformed strings. It checks that there is exactly one barrier $B$, that headers align, and that there are exactly $d$ cells after each header by maintaining alternation and a small counter. The component $A_{\mathrm{pair}}$ handles the well-formed case. It recognizes exactly strings of the shape $H_U \cdot \text{cells} \cdot B \cdot H_V \cdot \text{cells}$ and otherwise rejects. Upon $H_U$ it nondeterministically chooses $i \in \{1,\dots,d\}$, consumes exactly $d$ cells while checking that the $i$-th $U$-cell is $\mathrm{enc}(1) = 10$, then requires $B$ and $H_V$, then consumes exactly $d$ cells and accepts iff the $i$-th $V$-cell is $\mathrm{enc}(1) = 10$. Accepting states are placed only at end of input so strict superstrings do not accept. All transitions consume one input bit and there are no $\varepsilon$-moves. For a well-formed infix $x = E_U(u^{(p)}) \cdot B \cdot E_V(v^{(q)})$, $A_{\mathrm{pair}}$ accepts iff there exists $i$ with $u^{(p)}_i = v^{(q)}_i = 1$, hence $A$ rejects $x$ iff $\langle u^{(p)}, v^{(q)}\rangle = 0$. For malformed infixes $A_{\mathrm{syn}}$ accepts and thus the union accepts, so

$$\Big(\exists\ \text{nonempty infix } u \text{ of } w\ \text{with}\ u \notin L(A)\Big)\ \Longleftrightarrow\ \exists p,q\ \text{with}\ \langle u^{(p)}, v^{(q)}\rangle = 0.$$

The instance sizes satisfy $n = \Theta(N \cdot d) = \Theta(N\log N)$ and $|A| = \Theta(d)$ before padding. To realize $m = \Theta(n^\alpha)$ for any fixed $\alpha \in (0,1]$, pad by the disjoint union of $\varepsilon$-free binary DFA gadgets that never accept any substring of $w$, for example tiny DFAs for the pattern 1111ドル$, which never appears because cell streams avoid 111ドル$ and the unique barrier is 1110ドル$. This preserves the answer. The construction, including padding, takes time linear in the output size $O(n+m)$ with $n = \Theta(N\log N)$ and $m = \Theta(d)$ before padding. If an algorithm ran in $O(n^{1+\alpha-\varepsilon})$ time for all inputs with $m = \Theta(n^\alpha)$, plugging in $n = \Theta(N\log N)$ gives an algorithm for orthogonal vectors in

$$O\big((N\log N)^{1+\alpha-\varepsilon}\big) = O\big(N^{1+\alpha-\varepsilon/2}\big),$$

which contradicts SETH.

For the upper bound, use the quadratic-frontier dynamic program with bit-parallelism. Let $\Gamma$ be the reflexive $\varepsilon$-closure on $Q$. Define $\Delta_a(S) = \Gamma(\delta(\Gamma(S),a))$ and for $u = a_1\cdots a_k$ define $\Delta_u = \Delta_{a_k} \circ \cdots \circ \Delta_{a_1}$. These keep sets $\varepsilon$-closed. Compute $\varepsilon$-SCCs and the condensation and let $\pi: 2^Q \to \{0,1\}^s$ project to SCC ids. Let $F_{\mathrm{SCC}}$ be the reflexive $\varepsilon$-upward closure of SCCs containing finals. For each $a \in \Sigma$ precompute an SCC-level transfer mapping that from $\pi(T)$ produces $\pi(\Delta_a(T))$ in $O(s/W)$ operations on a word-RAM with size $W = \Theta(\log(n+m))$. Process starts sequentially to minimize space. For each $i$ from 1ドル$ to $n$ set $S \leftarrow \Gamma(I)$ and for $j$ from $i$ to $n$ update $S \leftarrow \Delta_{w[j]}(S)$ using the precomputed SCC transfer, and after each update test rejection of the current factor via $(\pi(S) \wedge F_{\mathrm{SCC}}) = 0$. The one-step predicate using a mask $\mathrm{Good}[a]$ answers whether $\Delta_a(T) \cap F = \varnothing$ for the next letter and is not used to certify the current factor. Each update takes $O(m/W)$ operations, there are $\Theta(n^2)$ updates overall, so the running time is

$$O\left(n^2 \cdot \frac{m}{W}\right).$$

The frontier uses $O(m/W)$ memory, and the precomputed tables are near-linear in $m$ and in the number of letter edges. When $m \ge n$ this matches the lower bound up to the factor $W$ and therefore cannot be improved to $O(n^{2-\varepsilon})$ in general even for binary and $\varepsilon$-free NFAs. When $m < n$ the same algorithm remains the best general bound without extra structure, and any improvement to $o(n^{1+\alpha})$ for $m = \Theta(n^\alpha)$ would contradict the lower bound at $\alpha$.

If empty infixes are allowed, first check whether $\Gamma(I)\cap F=\varnothing$. If it holds, the empty infix is already a witness and the answer is "yes". Otherwise continue, and from then on consider only nonempty infixes.

Question 3

Hi, thanks for the answer! I see the proposed coding. I wonder if it couldn't be simplified a bit by having beginning and ending markers for the vectors (and different beginning and ending markers for the u and v vectors) so that $A_{\text{syn}}$ can accept anything that does not start by an u start marker and end by a v end marker or that contains no boundary $B$ (no need to count). I also think there may be a mistake in what $A_{\text{pairs}}$ accepts (the infix will span several $H_U,ドル not just one followed by cells, likewise after $B$).

Question 4

Fortunately I think you can still define $A_{\text{pairs}}$ to have linear size. Code the vectors of $v$ as their mirror image. Now $A_{\text{pairs}}$ reads $H_U,ドル read some number of cells, guess $i$ and read a 10, move to a self-loop $q_i$ that skips until it nondeterminically chooses to read a 10, count $i$ characters, and then must see a v end marker and stop. Crucially the ending part can be mutualized between the various $q_i,ドル i.e., the nondetermininstic choices to read a 10 can jump at some point at distance $i$ from the end of one single path shared by all the $q_i$

Question 5

So up to these adjustments I think this can work. I think it's a great coding to use infix choice to guess the pair of vectors, use nondeterminism to check non-orthogonality, and take union with an automaton accepting malformed infixes. Thanks a lot for these insights!

Question 6

PS: in fact as $d$ is logarithmic it's not so crucial that $A_{\text{pairs}}$ is in $O(d)$. The naive construction in $O(d^2)$ would also work (or indeed anything polynomial in $d$).

Question 7

In the end, I thought the lower bound in this answer could be rewritten more clearly (and I didn't fact-check the padding argument or the log speedup in the answer), so I wrote my own answer which focuses on what I cared about in the question. Thanks a lot again!

Question 8

For clarity let me write a self-contained answer inspired by the arguments by @AlexKarpy in this answer.

Let $n$ be the size of the input word and $m$ the size of the input NFA. The lower bound applies for NFAs even without epsilon-transitions. We show:

Conditionally to the OV-conjecture (i.e., conditionally to SETH), there do not exist $q \in \mathbb{N}$ and $\epsilon > 0$ such that we can solve the NFA infix rejection problem on alphabet $\Sigma = \{0,1\}$ in time $O(n^{2-ε} m^q)$.

We reduce from the orthogonal vectors problem. Given 2ドルn$ Boolean $d$-dimensional vectors $U = u_1, \ldots, u_n$ and $V = v_1, \ldots, v_n$ with $d = \log n$, we want to determine if there exist 1ドル \leq i, j \leq n$ such that $u_i$ and $v_j$ are orthogonal. Let $h \colon \Sigma^* \to \Sigma^*$ be the morphism defined by $h(0) := 01$ and $h(1) := 10$, and let $B := 11011$ and $E := 11111$ be special markers. For simplicity we assume without loss of generality that $d \geq 6$, which is the case for sufficiently large $n$. We write $u^R$ to denote the reverse of a word $u$.

Build from the input $U,V$ the following word $w$:

$$w = (B h(u_1)) \cdots (B h(u_n)) (h(v_1)^R E) \cdots (h(v_n)^R E)$$

And build an NFA $A$ as the disjunction of three NFAs $A_B, A_E, A'$:

$A_B$ accepts precisely the words that do not start with $B$.
$A_E$ accepts precisely the words that do not end with $E$.
$A'$ accepts precisely the words having the following form for some 0ドル \leq k < d$:

$$B ~ \Sigma^{2k} ~ 10 ~ \Sigma^* ~ (10)^R ~\Sigma^{2k} ~E$$

The word $w$ is build from $U,V$ in time $O(n d) = \tilde{O}(n)$, and the NFA $A$ can be naively built in time $O(d^2) = \tilde{O}(1)$. (Note that we can in fact build $A$ in $O(d)$, but this is not needed for the proof to work.)

We will show the following correctness claim (*): $A$ rejects some infix of $w$ iff the input features two orthogonal vectors. Assuming (*), we can conclude the proof by proceeding by contradiction: let us assume that we solve NFA infix rejection in time $O(n^{2-\epsilon} m^q)$ for some $q \in \mathbb{N}$ and $\epsilon > 0$, then we can solve the orthogonal vectors problem in time $\tilde{O}(n^{2-\epsilon})$, contradicting the OV-conjecture. So all that remains is to show claim (*).

In the forward direction, assume that $u_i$ and $v_j$ are orthogonal, and consider the infix $w_{ij} := B h(u_i) \cdots (h(v_j)^R) E$ of $w$. The infix $w_{ij}$ starts with $B$ and ends with $E$ so it is rejected by $A_B$ and $A_E$. To show that it is rejected by $A'$ (hence by $A$), let us assume by contradiction that $A'$ accepts it. Then it would witness that there is 0ドル \leq k \leq d$ such that the $k+1$-th letter of $u_i$ is 1ドル$ and the $k+1$-th letter of $v_j^R$ from the end is 1ドル$, contradicting the fact that $u_i$ and $v_j$ are orthogonal. This establishes the forward direction of (*).

In the backward direction, assume that some infix $w'$ of $w$ is rejected by $A$. As $w'$ is rejected by $A_B$ and $A_E$, we know that $w' = B w'' E$. From the definition of $h$ we know that the images of $h$ do not feature two contiguous 1ドル$'s, and from the assumption that $d \geq 6$ we know that the $B$'s and $E$'s in the definition of $w$ above are far apart. This implies that the only infixes of the form $w' = B w'' E$ in $w$ are infixes of the form $w_{ij}$ as defined in the previous paragraph. So the rejected infix $w' = B w'' E$ gives us 1ドル \leq i, j \leq n$ such that $w_{ij}$ is rejected by $A'$. This implies that, for each choice of 0ドル \leq k < d$, it cannot be the case that the $k+1$-th letter of $u_i$ and the $k+1$-th letter of $v_j$ are both 1ドル$, i.e., $u_i$ and $v_j$ are orthogonal. This establishes the backward direction of (*) and concludes the proof.

Alex Karpy 113 bronze badges · Answer 1 · 2025-10-09 17:35:35Z