Find pairs of similar (by hamming distance) bit strings

Question 1

I have a list of binary strings where each binary string max length is 15. I need to find list of integers which is count of similar (should differ by max 1 bit position) binary strings present in the list for each of the binary strings present in the list. This is current solution with time complexity of O(N2) but getting TLE for larger inputs.

#include <iostream>
#include <string>
#include <vector>
bool is_similar(const long long num) {
 return num == 0 || (num & (num - 1)) == 0;
}
int main() {
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 std::vector<long long> nums;
 for (const auto& str_num : str_nums)
 nums.push_back(std::stoll(str_num, nullptr, 2));
 const int n = nums.size();
 std::vector<int> answers(n);
 for (int i = 0; i < n - 1; ++i) {
 for (int j = i + 1; j < n; ++j) {
 if (is_similar(nums[i] ^ nums[j])) {
 ++answers[i];
 ++answers[j];
 }
 }
 }
 for (const auto answer : answers)
 std::cout << answer << ", ";
 std::cout << std::endl;
}

Is there an efficient way to do this?

Question 2

I have rolled back Rev 3 → 2. Please see What to do when someone answers.

Question 3

Toby Speight's answer already addresses improvements in your code style. This answer presents an improved algorithm that makes use of the fact that the strings are short.

Consider, instead of iterating over all existing strings to check if they are similar, we can iterate over all similar strings to check if they exist.

To do this, we need to store a mapping from each string to the number of times it appears (in my solution an unordered_map).

Where n is the number of strings and l is the maximum length of each string, this solution has complexity O(n*l), compared to your original O(n*n+n*l).

#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
int main() {
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 std::vector<int> nums;
 std::unordered_map<int, int> nums_cnt;
 for (const auto& str_num : str_nums) {
 const int encoded_value = std::stoll(str_num, nullptr, 2);
 nums.push_back(encoded_value);
 nums_cnt[encoded_value] += 1;
 }
 const int n = nums.size();
 std::vector<int> answers(n);
 for (int i = 0; i < n; i++) {
 // add strings that are exactly the same (but don't include this one)
 answers[i] += nums_cnt[nums[i]] - 1;
 for (int bit = 0; bit <= 15; bit++) {
 // add strings which differ exactly in this bit
 answers[i] += nums_cnt[nums[i] ^ (1 << bit)];
 }
 }
 for (const auto answer : answers)
 std::cout << answer << ", ";
 std::cout << std::endl;
}

Question 4

On my system, replacing std::unordered_map with std::map improves speed by about 20%.

Question 5

We're told that the maximum string length is 15, yet we're storing values as long long. We could safely use a std::uint_fast16_t instead. Alternatively, consider using a std::bitset<15>, which handily comes with a constructor that converts from string.

The is_similar() test is misleadingly named. I expected it to take two arguments, but it actually takes one argument and tests whether its bit count is one or less. It can be made both shorter and clearer using std::popcount():

static constexpr is_similar(unsigned difference) {
 return std::popcount(difference) <= 1;
}

We're using signed types in many places where unsigned types are more suitable (e.g. I'd expect n to be a std::size_t, and consequently i and j too).

If we have repeated values in the input, we might be able to get a speed-up by using a Bag type (a histogram of values), and use that to reduce the pairings we need to consider.

Beyond that, you'll need to think of a better algorithm. One algorithm (suggested in comment by Deduplicator - thank you!) that's more likely to reduce the search space is to divide into buckets according to population count (i.e. number of set bits). If we use a std::bitset as the storage type, that information is readily accessible as .count(). Then we only need to compare adjacent buckets.

Another way that might work is to divide the input into buckets according to the top N bits (for some good choice of N). Then, we only have to consider pairs of buckets that are 1 step apart, and count how many values are common to both buckets, and within each bucket count how many values are 1 step apart (we could make this recursive, using multiple levels of bit-slicing).

I'm going to develop this idea, but first we need to split the program so that we have an independently testable (and measurable) function:

#include <cstdint>
#include <string>
#include <vector>
using Number = std::uint_fast16_t;
bool is_similar(const long long num) {
 return num == 0 || (num & (num - 1)) == 0;
}
auto count_similar_pairs(const std::vector<Number>& nums)
{
 const std::size_t n = nums.size();
 std::vector<std::size_t> answers(n);
 for (std::size_t i = 0; i < n - 1; ++i) {
 for (std::size_t j = i + 1; j < n; ++j) {
 if (is_similar(nums[i] ^ nums[j])) {
 ++answers[i];
 ++answers[j];
 }
 }
 }
 return answers;
}
auto count_similar_pairs(const std::vector<std::string>& str_nums)
{
 std::vector<Number> nums;
 for (const auto& str_num : str_nums)
 nums.push_back(std::stoll(str_num, nullptr, 2));
 return count_similar_pairs(nums);
}
#include <algorithm>
#include <iostream>
#include <iterator>
int main()
{
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 auto answers = count_similar_pairs(str_nums);
 std::ranges::copy(answers, std::ostream_iterator<std::size_t>(std::cout, ", "));
 std::cout << '\n';
}

Now we can replace main() with some tests (I'll use Google Test because it's convenient for me; other test frameworks will do equally well):

#include <gtest/gtest.h>
#include <numeric>
TEST(CountPairs, FromQuestion)
{
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 const std::vector<std::size_t> expected {
 2, 3, 2, 1, 2
 };
 EXPECT_EQ(count_similar_pairs(str_nums), expected);
}
TEST(CountPairs, BigInput)
{
 std::vector<Number> nums(20000);
 std::iota(nums.begin(), nums.end(), 20);
 // No assertion here - we're just interested in execution time
 count_similar_pairs(nums);
}

Now we can start implementing the algorithm. How many bits should we consider at a time? Taking things to the limit, I suggest one bit at a time. Then the algorithm looks like this:

Divide the N-bit inputs into two sets those that begin with 0 and those that begin with 1, and mask them down to N-1 bits.
Count numbers that are present in both sets (we can do this in a single pass if both sets are sorted, just like how we merge two sorted lists)
For each set, count how many pairs in that set (sharing the same prefix) are near-neighbours. This is a recursive call into our algorithm. If the bit-width reaches zero, then we just count each element.

If we sort the numbers into ascending order, then they are naturally arranged how we want them, and we can divide into sets using subranges. So let's sort them, remembering which answer count is associated with each number:

using count_iter = std::vector<std::size_t>::iterator;
struct num_pair {
 Number value;
 count_iter output;
};
using num_vector = std::vector<num_pair>;
auto count_similar_pairs(std::vector<Number> const& nums)
{
 auto const n = nums.size();
 std::vector<Number> counts(n);
 num_vector nv;
 nv.reserve(n);
 auto it = counts.begin();
 for (auto n: nums) {
 nv.emplace_back(n, it++);
 }
 std::ranges::sort(nv, {}, &num_pair::value);
 /// TODO: call our recursive function here
 return counts;
}

With this structure in place, we can now implement our recursive function.

void count_similar_pairs(std::ranges::random_access_range auto& nums,
 Number mask)
{
 if (!mask) {
 // We've reached the end of the recursion.
 // We just need to account for the identical pairs in input.
 auto it = nums.begin();
 while ((it = std::ranges::adjacent_find(it, nums.end(), {}, &num_pair::value)) != nums.end()) {
 auto const last = std::ranges::adjacent_find(it, nums.end(), std::not_equal_to<>{}, &num_pair::value);
 // each of the N values in this range is similar to the other N-1
 auto dup_count = std::distance(it, last) - 1;
 for (auto i = it; i != last; ++i) {
 *i->output += dup_count;
 }
 it = last;
 }
 return;
 }
 assert(std::has_single_bit(mask));
 // Create two views, one each for the mask bit unset and set
 // i.e. n < mask and n >= mask
 auto split_point = std::ranges::lower_bound(nums, mask, {}, &num_pair::value);
 auto view_0 = std::ranges::subrange(nums.begin(), split_point);
 auto view_1 = std::ranges::subrange(split_point, nums.end());
 // strip top bit off each number
 std::ranges::for_each(nums, [mask](auto& e) { e.value &= ~mask; });
 // find matching values
 // We can't use std::ranges::set_intersection() for this, because
 // it undercounts when a value exists multiple times in either set.
 for (auto const& e: view_0) {
 auto values_1 = std::ranges::equal_range(view_1, e.value, {}, &num_pair::value);
 *e.output += values_1.size();
 std::ranges::for_each(values_1, [](auto const& e) { ++*e.output; });
 }
 count_similar_pairs(view_0, mask >> 1);
 count_similar_pairs(view_1, mask >> 1);
 return;
}

And invoke it:

 std::ranges::sort(nv, {}, &num_pair::value);
 count_similar_pairs(nv, 1u << 14); // magic constant from question
 return counts;

This ought to scale as O(n log n), albeit with higher constant factor than the original O(n2) code, since we call equal_range (O(log n)) for each element of the view (O(n)). We can add some more tests to demonstrate this. The results I get are:

Input size	Original code	apilat code	Improved code
5	0 ms	0 ms	0 ms
20000	157 ms	7 ms	3 ms
200000	16,050 ms	50 ms	15 ms
2000000	timed out	521 ms	124 ms
20000000	timed out	`bad_alloc`	1421 ms

Additionally, the divide-and-conquer nature of the improved algorithm lends itself to parallelisation (though this answer is already long enough, so I won't demonstrate that).

Here's the full code I ended up with, including tests and main() function:

#include <algorithm>
#include <bit>
#include <cassert>
#include <cstdint>
#include <ranges>
#include <string>
#include <vector>
using Number = std::uint_fast16_t;
using count_iter = std::vector<std::size_t>::iterator;
struct num_pair {
 Number value;
 count_iter output;
};
using num_vector = std::vector<num_pair>;
void count_similar_pairs(std::ranges::random_access_range auto& nums,
 Number mask)
{
 if (!mask) {
 // We've reached the end of the recursion.
 // We just need to account for the identical pairs in input.
 auto it = nums.begin();
 while ((it = std::ranges::adjacent_find(it, nums.end(), {}, &num_pair::value)) != nums.end()) {
 auto const last = std::ranges::adjacent_find(it, nums.end(), std::not_equal_to<>{}, &num_pair::value);
 // each of the N values in this range is similar to the other N-1
 auto dup_count = std::distance(it, last) - 1;
 for (auto i = it; i != last; ++i) {
 *i->output += dup_count;
 }
 it = last;
 }
 return;
 }
 assert(std::has_single_bit(mask));
 // Create two views, one each for the mask bit unset and set
 // i.e. n < mask and n >= mask
 auto split_point = std::ranges::lower_bound(nums, mask, {}, &num_pair::value);
 auto view_0 = std::ranges::subrange(nums.begin(), split_point);
 auto view_1 = std::ranges::subrange(split_point, nums.end());
 // strip top bit off each number
 std::ranges::for_each(nums, [mask](auto& e) { e.value &= ~mask; });
 // find matching values
 // We can't use std::ranges::set_intersection() for this, because
 // it undercounts when a value exists multiple times in either set.
 if (view_0.size() > view_1.size()) {
 // Reduce number of calls to equal_range()
 std::swap(view_0, view_1);
 }
 for (auto const& e: view_0) {
 auto values_1 = std::ranges::equal_range(view_1, e.value, {}, &num_pair::value);
 *e.output += values_1.size();
 std::ranges::for_each(values_1, [](auto const& e) { ++*e.output; });
 }
 count_similar_pairs(view_0, mask >> 1);
 count_similar_pairs(view_1, mask >> 1);
 return;
}
auto count_similar_pairs(std::ranges::random_access_range auto const& nums)
{
 auto const n = nums.size();
 std::vector<std::size_t> counts(n);
 num_vector nv;
 nv.reserve(n);
 auto it = counts.begin();
 for (auto n: nums) {
 nv.emplace_back(n, it++);
 }
 std::ranges::sort(nv, {}, &num_pair::value);
 count_similar_pairs(nv, 1u << 14); // magic constant from question
 return counts;
}
auto count_similar_pairs(const std::vector<std::string>& str_nums)
{
 std::vector<Number> nums;
 for (const auto& str_num : str_nums)
 nums.push_back(std::stoll(str_num, nullptr, 2));
 return count_similar_pairs(nums);
}
#define UNIT_TEST
#ifdef UNIT_TEST
static bool is_similar(const long long num) {
 return num == 0 || (num & (num - 1)) == 0;
}
auto original(const std::vector<Number>& nums)
{
 const std::size_t n = nums.size();
 std::vector<Number> answers(n);
 for (std::size_t i = 0; i < n - 1; ++i) {
 for (std::size_t j = i + 1; j < n; ++j) {
 if (is_similar(nums[i] ^ nums[j])) {
 ++answers[i];
 ++answers[j];
 }
 }
 }
 return answers;
}
#include <gtest/gtest.h>
#include <numeric>
TEST(CountPairs, FromQuestion)
{
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 const std::vector<Number> expected {
 2, 3, 2, 1, 2
 };
 EXPECT_EQ(count_similar_pairs(str_nums), expected);
}
TEST(CountPairs, BigInput_Values)
{
 std::vector<Number> nums(20000);
 std::iota(nums.begin(), nums.end(), 20);
 EXPECT_EQ(count_similar_pairs(nums), original(nums));
}
TEST(CountPairs, BigInput_Slow)
{
 std::vector<Number> nums(20000);
 std::iota(nums.begin(), nums.end(), 20);
 // No assertion here - we're just interested in execution time
 original(nums);
}
TEST(CountPairs, BigInput_Fast)
{
 std::vector<Number> nums(20000);
 std::iota(nums.begin(), nums.end(), 20);
 count_similar_pairs(nums);
}
TEST(CountPairs, VeryBigInput_Slow)
{
 std::vector<Number> nums(200000);
 std::iota(nums.begin(), nums.end(), 20);
 original(nums);
}
TEST(CountPairs, VeryBigInput_Fast)
{
 std::vector<Number> nums(200000);
 std::iota(nums.begin(), nums.end(), 20);
 count_similar_pairs(nums);
}
TEST(CountPairs, HugeInput_Fast)
{
 std::vector<Number> nums(2000000);
 std::iota(nums.begin(), nums.end(), 20);
 count_similar_pairs(nums);
}
#else
#include <iostream>
#include <iterator>
int main()
{
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 auto answers = count_similar_pairs(str_nums);
 std::ranges::copy(answers, std::ostream_iterator<std::size_t>(std::cout, ", "));
 std::cout << '\n';
}
#endif

I came back to this question, and refined the implementation further to make it more readable and performant.

The changes I made:

Provide conversions from the num_pair type, and a comparison operator, so it appears as a Number in most contexts (this means we no longer need to pass a projection to standard algorithms).
Replace the equal_range() calls with linear find_adjacent(), giving better memory locality.
Recognised that in the final recursive step, all the values are equal, so removed a loop.
Terminate recursion early when we have fewer than two elements. (this gives no benefit on the larger test-cases, where every single 15-bit value is present many times over).

struct num_pair {
 Number value;
 std::reference_wrapper<Number> count;
 constexpr operator Number const&() const { return value; };
 constexpr operator Number&() { return value; };
 constexpr auto operator<=>(const num_pair& other) const
 {
 return value <=> other.value;
 }
};
// Return the longest initial subrange whose elements all have the same value
auto equal_prefix(std::ranges::forward_range auto range)
{
 auto it = std::ranges::adjacent_find(range, std::not_equal_to<Number>{});
 if (it != range.end()) {
 ++it;
 }
 return std::ranges::subrange(range.begin(), it);
}
void count_similar_pairs(std::ranges::random_access_range auto nums,
 Number mask)
{
 namespace r = std::ranges;
 if (nums.size() <= 1) {
 // not enough elements to have any duplicates
 return;
 }
 if (!mask) {
 // We've reached the end of the recursion.
 // each of the N values in this range is similar to the other N-1
 auto const dup_count = nums.size() - 1;
 for (auto i: nums) {
 i.count += dup_count;
 }
 return;
 }
 assert(std::has_single_bit(mask));
 // Create two views, one each for the mask bit unset and set
 // i.e. n < mask and n >= mask
 auto split_point = r::lower_bound(nums, mask);
 auto const view_a = r::subrange(nums.begin(), split_point);
 auto const view_b = r::subrange(split_point, nums.end());
 // strip top bit off each number
 r::for_each(nums, [mask](auto& e) { e &= ~mask; });
 // find matching values
 // We can't use r::set_intersection() for this, because
 // it undercounts when a value exists multiple times in either set.
 {
 // From C++23, we can use chunk_by view to walk the distinct
 // values in views a and b.
 auto a = view_a;
 auto b = view_b;
 while (a && b) {
 if (b.front() < a.front()) {
 std::swap(a, b);
 }
 if (a.front() < b.front()) {
 a = r::subrange(r::lower_bound(a, b.front().value), a.end());
 } else {
 // values are equal - count them
 auto ca = equal_prefix(a);
 auto cb = equal_prefix(b);
 for (auto i: ca) {
 i.count += cb.size();
 }
 for (auto i: cb) {
 i.count += ca.size();
 }
 // advance both views to next element
 a = r::subrange(ca.end(), a.end());
 b = r::subrange(cb.end(), b.end());
 }
 }
 }
 // recurse to find any similar pairs in each sub set
 mask >>= 1;
 count_similar_pairs(view_a, mask);
 count_similar_pairs(view_b, mask);
}

Input size	Original code	First version	Second Version
5	0 ms	0 ms	0 ms
20000	157 ms	3 ms	1 ms
200000	16,050 ms	15 ms	10 ms
2000000	timed out	124 ms	134 ms
20000000	1421 ms	1417 ms

Question 6

That's a prefix-tree., or suffix-tree Or a trie, if you prefer. All those cache-misses due to jumping around though...

Question 7

So it is! Answering this one has been educational for me. :) I hope to get time to compare this to sorting by popcount.

Question 8

Thank you so much for this detailed analysis and solution.

apilat apilat 2631 silver badge7 bronze badges · Accepted Answer · 2022-09-14 19:31:16Z

Toby Speight's answer already addresses improvements in your code style. This answer presents an improved algorithm that makes use of the fact that the strings are short.

Consider, instead of iterating over all existing strings to check if they are similar, we can iterate over all similar strings to check if they exist.

To do this, we need to store a mapping from each string to the number of times it appears (in my solution an unordered_map).

Where n is the number of strings and l is the maximum length of each string, this solution has complexity O(n*l), compared to your original O(n*n+n*l).

#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
int main() {
 const std::vector<std::string> str_nums {
 "100", "110", "010", "011", "100"
 };
 std::vector<int> nums;
 std::unordered_map<int, int> nums_cnt;
 for (const auto& str_num : str_nums) {
 const int encoded_value = std::stoll(str_num, nullptr, 2);
 nums.push_back(encoded_value);
 nums_cnt[encoded_value] += 1;
 }
 const int n = nums.size();
 std::vector<int> answers(n);
 for (int i = 0; i < n; i++) {
 // add strings that are exactly the same (but don't include this one)
 answers[i] += nums_cnt[nums[i]] - 1;
 for (int bit = 0; bit <= 15; bit++) {
 // add strings which differ exactly in this bit
 answers[i] += nums_cnt[nums[i] ^ (1 << bit)];
 }
 }
 for (const auto answer : answers)
 std::cout << answer << ", ";
 std::cout << std::endl;
}

On my system, replacing std::unordered_map with std::map improves speed by about 20%.

Stack Exchange Network

Find pairs of similar (by hamming distance) bit strings

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find pairs of similar (by hamming distance) bit strings

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions