Algorithm to find number of covering sets

Question 1

UPD: The following problem comes from error correction codes, to be precise -- either maximum-likelihood or belief-propagation decoding when the spectrum of special failing subsets is known (to some extent). Then the failure probability can be formulated as follows.

Let us say the universe set is $\mathcal U = \{1,2,\dotsc,N\}$ and we have $M$ subsets of it: $S_1, S_2, \dotsc, S_M$ ($S_i \subseteq \mathcal U$ for each $i$). Given $W,ドル find number of $W$-subsets of $\mathcal U$ that cover at least one of $S_1, S_2, \dotsc, S_M$. Mathematically speaking -- find cardinality of the following set: $$ \{S \subseteq \mathcal U : |S|=W, \exists i \text{ s.t. } S_i \subseteq S\} ,円. $$

NB: union of the sets $S_1, S_2, \dotsc, S_M$ is not necessarily $\mathcal U$.

The algorithm does not need to be the most efficient, as I just need to find these numbers for my other problem. Hence, say, running time of 1 hour is acceptable. (Well, in the worst case, even 1 day might be okay if there is nothing better). The order of parameters I am interested in: $N \approx 200,ドル $M \leq 50,ドル $|S_i| \leq 8,ドル $W \leq 30$.

UPD: It seems that to find exact answer is very hard, therefore I am also interested in good bounds and/or approximations.

Example. $N=4, M=2, S_1=\{1,2\}, S_2=\{2,3\}$.

If $W=1,ドル the answer is 0: there are no 1-subsets that cover either $S_1$ or $S_2$.

If $W=2,ドル the answer is 2: $S_1$ and $S_2$ themselves.

If $W=3,ドル the answer is 3: $\{1,2,3\}$ (covers both $S_1$ and $S_2$), $\{1,2,4\}$ (covers $S_1$), $\{2,3,4\}$ (covers $S_2$).

If $W=4,ドル the answer is 1: $\mathcal U = \{1,2,3,4\}$ covers anything.

UPD: $M$ is now much smaller.

Question 2

Do you need the exact number, or is an approximate answer OK too?

Question 3

Well, very close approximation will work. But not heuristics or so

Question 4

Perhaps, I should consider also done bounds on the correct answer. I might need to start with Bonferroni inequalities...

Question 5

Interesting problem! What's the context where you encountered this problem? Can you credit the source? Is there a motivation from a practical situation?

Question 6

Updated. Also, because of no progress approximate answer is now OK too :)

Question 7

I can suggest two possible methods.

#SAT

This is closely related to #SAT, and one approach might be to try applying an off-the-shelf #SAT (or approximate-#SAT) solver.

Let $t_k$ denote the number of $W$-sets that contain $S_k$ but not any of $S_1,\dots,S_{k-1},ドル i.e., $t_k$ is the cardinality of the set

$$\{S \subseteq \mathcal{U} : |S|=W, S_k \subseteq S, S_1 \not\subseteq S, \dots, S_{k-1} \not\subseteq S\}.$$

It is easy to see that you want to compute $t_1+t_2+\dots+t_M,ドル so if we can figure out how to compute $t_k,ドル we are done.

How shall we compute $t_k$? Well, any set $S$ that meets the conditions must cover $S_k,ドル so all of the elements of $S_k$ are in $S,ドル and the only question is the remaining elements. Set $\mathcal{U}' = \mathcal{U} \setminus S_k,ドル $S'_i = S_i \setminus S_k,ドル and $W' = W-|S_k|$. Then $t_k$ is the number of $W'$-sets (over universe $\mathcal{U}'$) that don't cover any of $S'_1,\dots,S'_{k-1},ドル i.e., the cardinality of the set

$$\{S' \subseteq \mathcal{U}' : |S'|=W', S'_1 \not\subseteq S', \dots, S'_{k-1} \not\subseteq S'\}.$$

As you stare at this, the condition on $S'$ turns out to be expressible as a boolean formula in CNF form. Let $x_1,\dots,x_{N'}$ (where $N'=|\mathcal{U}'|$) be boolean variables; the idea is that $x_i$ is true iff $i \in S$. Then the condition $S'_1 \not\subseteq S', \dots, S'_{k-1} \not\subseteq S'$ is equivalent to a CNF formula on $x_1,\dots,x_{N'}$ with $k-1$ clauses and $N'$ variables; each constraint $S'_i \not\subseteq S'$ induces a clause $\lor_{j \in S'_i} \neg x_j$. Then, $t_k$ is the number of satisfying assignments of weight $k$ for this formula. You can easily conjoin clauses to this formula that require the assignment to have weight $k$.

In this way, we have reduced the problem of computing $t_k$ to the problem of counting the number of satisfying assignments of a boolean formula in CNF form. This is exactly the #SAT problem. The problem is known to be hard, but there are off-the-shelf solvers you could try. There are also solvers that yield an approximation to the number, which are worth trying as well. See, e.g., Is there a sometimes-efficient algorithm to solve #SAT? and https://cstheory.stackexchange.com/q/18135/5038.

I don't know whether this will be fast enough for your needs, but it is a possible approach you could consider.

Inclusion-Exclusion

If $I \subseteq \{1,2,\dots,M\}$ is a set of indices, let $u_I$ denote number of $W$-sets that contain $S_i$ for each $i \in I,ドル i.e., the cardinality of the set

$$\{S \subseteq \mathcal{U} : |S|=W, \forall i \in I . S_i \subseteq S\}.$$

You can express the number you want using the inclusion-exclusion formula. In particular, you want to compute

$$u = \sum_I (-1)^{|I|} \cdot u_I$$

where $I$ ranges over all non-empty subsets of $\{1,2,\dots,M\}$.

Heuristically, we can expect that the larger $I$ is, the smaller $u_I$ will be, so we can approximate the above sum as follows:

$$u \approx \sum_{|I| \le 8} (-1)^{|I|} \cdot u_I.$$

Note that you can compute $u_I$ easily for any given $I$; in particular,

$$u_I = {N-|S_I| \choose W-|S_I|}$$

where we define $S_I = \cup_{i \in U} S_i$. Now since ${50 \choose 8} \approx 2^{29},ドル for your particular parameters it is easy enough to enumerate all index sets $I$ such that $|I|\le 8$ and compute $u_I$ for each, then use the approximate inclusion-exclusion formula above. You can adjust the parameter 8 to trade off computation time vs quality of the approximation.

The disadvantage is that there is no a priori guarantee on how good the approximation will be -- but this might be useful nonetheless.

It might be possible to refine and improve this approach further. While it's true that on average we expect larger sets $I$ correspond to smaller values of $u_I,ドル this isn't necessarily true. Therefore, it's possible that you might get a better approximation by picking a threshold $\tau$ and then using

$$u \approx \sum_{u_I \ge \tau} (-1)^{|I|} \cdot u_I.$$

This requires you to iterate through sets $I$ in order of decreasing $u_I,ドル and then stop once you reach the threshold $\tau$. How do you do this? The key insight is that as you add elements to $I,ドル the value of $u_I$ gets monotonically smaller. Therefore, you can use breadth-first search to explore which elements to add to the set. You start out with all sets $I$ of size 1 and push them onto a priority queue. The elements of the priority queue are ordered by the value of $u_I$. In each step, you pop from the priority queue a set $I$ with the largest value of $u_I$ out of all on the queue; and assuming you haven't processed $I$ before, you add $(-1)^{|I|} \cdot u_I$ to your running total and then push the sets $I \cup \{j\}$ onto the queue for each $j \notin I$.

I suggest trying both variants on inclusion-exclusion to see which gives a better approximation.

Question 8

Isn't that the case that in inclusion-exclusion principle $u = \sum_{k=1}^{M} (-1)^k \left( \sum_{|I|=k} u_I \right)$ terms in brackets decrease with increase of $k$?

Question 9

@YauhenYakimenka, I'm not sure. If you are asking does $u_I$ decrease as $|I|$ increases, not necessarily. I think you can have sets $I,I'$ with $|I'|>|I|$ but $u_{I'} > u_I$. If you are asking does $\sum_{|I|=k} u_I$ decrease as $k$ increases, I don't know the answer to that. That sounds plausible but you might want to double-check it to be sure.

Question 10

I was meaning the latter...

Question 11

Is using the Inclusion-Exclusion-Principle (also called Sieve-formula) too inefficient? According to the principle, if the union of any $L+1$ sets out of $S_{1}, S_{2}, \ldots S_{M}$ is bigger than $W$ then the number you are looking for is $$ \sum_{n=1}^{L}{(-1)^n \sum_{i_{1},\ldots,i_{n}} \binom{N - |S_{i_{1}} \cup \ldots \cup S_{i_{n}}|}{W - |S_{i_{1}} \cup \ldots \cup S_{i_{n}}|}} $$ Edit: The complexity is around $M^{\frac{W}{|S_{i}|}}$ if the $S_{i}$s are close to being disjoint and around 2ドル^M$ if the $S_{i}$s are close to being equal, so probably to big indeed.

Question 12

And the complexity then is around 2ドル^M$? For $M=5000$ it doesn't look anything close to realistic...

Question 13

The complexity is around $M^{\frac{W}{|S_{i}|}},ドル but yes probably in most cases to big.

Question 14

on the other hand, I now think I can at least use Bonferroni inequalities (i.e. partial sums from Inclusion-Exclusion Principle) to at least bound the exact answer.

Question 15

By the way, do you know some good algorithm to calculate the formula, better than straightforward calculation?

D.W. ♦D.W. 169k23 gold badges234 silver badges517 bronze badges · Accepted Answer · 2018-01-29 05:52:15Z