I have 10,000 sequentially numbered items I wish to characterize. For each item, I created a list of the other items I think might have a similar characteristic.
i.e.
Item 1: [2,4,7,8,9,...,9489] (list contains 250 items)
Item 2: [1,4,13,23,...,9424] (list contains 12 items)
Item 3: [1,2,4,7,...,9489] (list contains 140 items)
Item 4: [1,3,7,9,...,9211] (list contains 250 items)
Item 5: [1,3,7,9,...,9221] (list contains 250 items)
Item 6: [1,2,7,25,...,9248] (list contains 241 items)
Item 7: [4,5,6] (list contains 3 items)
If I only have time to test 50 items, How might I choose which items to test so that the maximum number of items is represented? The identified similarity is not necessarily bi-directional - notice item 7's list contains 4,5,6 but not 1 and 3(which both have 7 in their list).
Some lists may be very similar or even identical, as seen in items 4 and 5. Testing both 4 and 5 would not achieve the maximum coverage.
-
$\begingroup$ The first step would be to find a quantitative objective function to optimize. Without this, all we can offer is common-sense heuristics that you can think of on your own. $\endgroup$Yuval Filmus– Yuval Filmus2017年03月25日 07:31:43 +00:00Commented Mar 25, 2017 at 7:31
-
$\begingroup$ I don't yet have a function to optimize, which Is why I didn't post this on StackOverflow. Let's say I order the lists by the number of items in descending order. I then check if the items in list 1 match more than 99% of the items in every other list. If so, I remove the other list. Then I go back through and do the same starting with list 2. Does this sound reasonable? $\endgroup$Mark Brown– Mark Brown2017年03月25日 07:55:20 +00:00Commented Mar 25, 2017 at 7:55
-
$\begingroup$ That's your common sense heuristic, and it's as good as anybody's. $\endgroup$Yuval Filmus– Yuval Filmus2017年03月25日 08:07:46 +00:00Commented Mar 25, 2017 at 8:07
1 Answer 1
It sounds like you want to select a set of 50 lists whose union contains as many elements as possible. This is the maximum coverage problem, which is NP-complete, but there are standard techniques that can be used to get decent-quality solutions (there is an approximation algorithm using a greedy algorithm, and a formulation as ILP).