Python set intersection question

Question 1

I have three sets:

s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])] # true, 16 and 14
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])] # false

I want a function that will return True if every set in the list intersects with at least one other set in the list. Is there a built-in for this or a simple list comprehension?

Question 2

all(any(a & b for a in s if a is not b) for b in s)

Question 3

Elegant, but I would prefer 2 straight loops and a counter to avoid compare each two items 2 times, it would speed things up at least by a factor of two

Question 4

The bool() call is redundant.

Question 5

@Victor Ionescu, why not throw up an answer with your preferred method. It'd be worth at least ten rep :)

Question 6

@spoon16 : I tried to remove unnecessary comparisons. Of course, with all the additional code, I am not sure if this yields an optimized version over a quite elegant solution by jchl. I will probably try to time both over a large set sequence and see what it results in.

Question 7

@KennyTM: Thanks, I didn't realize any() performed automatic boolean conversion. I removed the bool() call.

Question 8

Here's a very simple solution that's very efficient for large inputs:

def g(s):
 import collections
 count = collections.defaultdict(int)
 for a in s:
 for x in a:
 count[x] += 1
 return all(any(count[x] > 1 for x in a) for a in s)

Question 9

you can just use count.setdefault(x, 0) += 1 instead of the if/else statement.

Question 10

No you can't. count.setdefault(x, 0) isn't an lvalue.

Question 11

+1 because it circumvents the "obvious" set intersection stuff and simplifies the problem. Unfortunately, I don't think you will get enough upvotes. BTW, either collections.Counter, or count= collections.defaultdict(int) and count[x]+=1

Question 12

Thanks for the hit about defaultdict. I've simplified my answer accordingly.

Question 13

It's a little verbose but I think it's a pretty efficient solution. It takes advantage of the fact that when two sets intersect, we can mark them both as connected. It does this by keeping a list of flags as long as the list of sets. when set i and set j intersect, it sets the flag for both of them. It then loops over the list of sets and only tries to find a intersection for sets that haven't already been intersected. After reading the comments, I think this is what @Victor was talking about.

s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])] # true, 16 and 14
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])] # false
def connected(sets):
 L = len(sets)
 if not L: return True
 if L == 1: return False
 passed = [False] * L
 i = 0
 while True:
 while passed[i]: 
 i += 1
 if i == L: 
 return True
 for j, s in enumerate(sets):
 if j == i: continue
 if sets[i] & s: 
 passed[i] = passed[j] = True
 break
 else:
 return False
print connected(s0)
print connected(s1)

I decided that an empty list of sets is connected (If you produce an element of the list, I can produce an element that it intersects ;). A list with only one element is dis-connected trivially. It's one line to change in either case if you disagree.

Question 14

lots of good answers, I'm using 2.5.2 right now so I ended up going with this solution

Question 15

Here's a more efficient (if much more complicated) solution, that performs a linear number of intersections and a number of unions of order O( n*log(n) ), where n is the length of s:

def f(s):
 import math
 j = int(math.log(len(s) - 1, 2)) + 1
 unions = [set()] * (j + 1)
 for i, a in enumerate(s):
 unions[:j] = [set.union(set(), *s[i+2**k:i+2**(k+1)]) for k in range(j)]
 if not (a & set.union(*unions)):
 return False
 j = int(math.log(i ^ (i + 1), 2))
 unions[j] = set.union(a, *unions[:j])
 return True

Note that this solution only works on Python>= 2.6.

Question 16

This is significantly slower performing that your previous answer. whereas your previous answer averages 4.7 msecs, (on large random input) this one takes 250. Unions are what's killing you (with this implementation) and intuited. Theoretically appealing but expensive.

Question 17

Interesting. I think it's the large constant factor due to all the Python code (compared to the previous solution that can be evaluated almost all in C). Also, my first solution will work well if the any() call can shortcut often, whereas this solution works well in all cases. This solution performs better (9 vs 24 seconds) than the first on this input: [set((i,)) for i in range(N)] + [set(range(N))] with N = 10000, and I suspect would perform better still with even larger N.

Question 18

I just realized that this solution is far more complicated than necessary, and still not optimal. My third solution in much better for both reasons.

Question 19

As usual I'd like to give the inevitable itertools solution ;-)

from itertools import combinations, groupby
from operator import itemgetter
def any_intersects( sets ):
 # we are doing stuff with combinations of sets
 combined = combinations(sets,2) 
 # group these combinations by their first set
 grouped = (g for k,g in groupby( combined, key=itemgetter(0)))
 # are any intersections in each group
 intersected = (any((a&b) for a,b in group) for group in grouped)
 return all( intersected )
s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])]
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])] 
print any_intersects( s0 ) # True
print any_intersects( s1 ) # False

This is really lazy and will only do the intersections that are required. It can also be a very confusing and unreadable oneliner ;-)

Question 20

combinated? How about "combined"? ;-)

Question 21

Heh, some online dictionaries list "combinated", but "combined" does sound much better ;-)

Question 22

On further reflection, "combos" would probably be a more descriptive -- and therefore even better -- variable name.

Question 23

To answer your question, no, there isn't a built-in or simple list comprehension that does what you want. Here's another itertools based solution that is very efficient -- surprisingly about twice as fast as @THC4k's itertools answer using groupby() in timing tests using your sample input. It could probably be optimized a bit further, but is very readable as presented. Like @AaronMcSmooth, I arbitrarily decided what to return when there are no or is only one set in the input list.

from itertools import combinations
def all_intersect(sets):
 N = len(sets)
 if not N: return True
 if N == 1: return False
 intersected = [False] * N
 for i,j in combinations(xrange(N), 2):
 if not intersected[i] or not intersected[j]:
 if sets[i] & sets[j]:
 intersected[i] = intersected[j] = True
 return all(intersected)

Question 24

This strategy isn't likely to be as efficient as @Victor's suggestion, but might be more efficient than jchl's answer due to increased use of set arithmetic (union).

s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])]
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])]
def freeze(list_of_sets):
 """Transform a list of sets into a frozenset of frozensets."""
 return frozenset(frozenset(set_) for set_ in list_of_sets)
def all_sets_have_relatives(set_of_sets):
 """Check if all sets have another set that they intersect with.
 >>> all_sets_have_relatives(s0) # true, 16 and 14
 True
 >>> all_sets_have_relatives(s1) # false
 False
 """
 set_of_sets = freeze(set_of_sets)
 def has_relative(set_):
 return set_ & frozenset.union(*(set_of_sets - set((set_,))))
 return all(has_relative(set) for set in set_of_sets)

Question 25

@jchl: sets aren't hashable, so they can't be elements in a set. I'm not sure if there's much point in making the top-level set a frozen one, but it seemed like it might be faster than a regular set, and it suits the logic.

Question 26

@jchl: Ahh, I see what you're talking about now. Looks like I had some sort of mental indigestion issue when refactoring and then posting that code. "The doctest passed, though!" Anyway, it's fixed now.

Question 27

my profiling indicates that they are slightly faster. only barely though

Question 28

Bad news. I just profiled this one and it takes around 2.5 seconds on the same type of large random input that @jchl's takes around 4.7 milli-seconds on. The bad news for me is that with all the effort I put into it, I only beat him by .2 milliseconds.

Question 29

"O! I am slain!" Interesting.. was that with large sets, or a lot of sets, or both?

Question 30

This may give better performance depending on the distribution of the sets.

def all_intersect(s):
 count = 0
 for x, a in enumerate(s):
 for y, b in enumerate(s):
 if a & b and x!=y:
 count += 1
 break
 return count == len(s)

Question 31

1. Don't you mean break instead of continue? 2. Are you sure about count = -1?? 3. Lose the 2nd line and replace the last 3 lines by return count == len(s). 4. Wouldn't x != y and a & b be better? 5. You don't exit early if a set has no buddies -- what is the basis for "may give better performance"? Better than what?

Question 32

Hello, hello, ... your code is BROKEN. count becomes the number of non-empty pair-wise set intersections, less 1. This is nothing like what the OP wants. By ACCIDENT it gives the correct answers for the OP's two test cases. Try this one: s2 = [set([16, 9, 2, 10]), set([16, 22, 14, 15]), set([14, 2])] ... it should return True but your count is 6-1 == 5 and so your code returns False.

Question 33

@John, +1, thanks for pointing out all the errors. I have edited to make all the necessary corrections. 4. I preferred a & b over x != y because the latter is more commonly true. Is this bad? 5. Yes, that's right. This may give better performance when compared to the ones doing the intersection check with all the sets.

Question 34

the whole point of checking if x != y is to avoid calculating a & b for a is b. the point is that it's much cheaper to check x != y then performing a set intersection.

jchl 6,5824 gold badges29 silver badges51 bronze badges · Accepted Answer · 2010-10-01 08:32:24Z

14

all(any(a & b for a in s if a is not b) for b in s)

Share

Improve this answer

edited Oct 1, 2010 at 9:14

answered Oct 1, 2010 at 8:32

jchl's user avatar

jchl

6,5824 gold badges29 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Victor Ionescu

Victor Ionescu Over a year ago

Elegant, but I would prefer 2 straight loops and a counter to avoid compare each two items 2 times, it would speed things up at least by a factor of two

2010年10月01日T08:44:33.033Z+00:00

kennytm

kennytm Over a year ago

The bool() call is redundant.

2010年10月01日T08:46:04.81Z+00:00

Eric Schoonover

Eric Schoonover Over a year ago

@Victor Ionescu, why not throw up an answer with your preferred method. It'd be worth at least ten rep :)

2010年10月01日T08:49:08.42Z+00:00

pyfunc

pyfunc Over a year ago

@spoon16 : I tried to remove unnecessary comparisons. Of course, with all the additional code, I am not sure if this yields an optimized version over a quite elegant solution by jchl. I will probably try to time both over a large set sequence and see what it results in.

2010年10月01日T09:14:52.68Z+00:00

jchl

jchl Over a year ago

@KennyTM: Thanks, I didn't realize any() performed automatic boolean conversion. I removed the bool() call.

2010年10月01日T09:15:50.603Z+00:00

|

CollectivesTM on Stack Overflow

Python set intersection question

8 Answers 8

8 Comments

4 Comments

1 Comment

3 Comments

3 Comments

Comments

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

8 Comments

4 Comments

1 Comment

3 Comments

3 Comments

Comments

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related