Finding anagram groups using hashing and mapping the alphabet to random numbers

Question 1

I was solving Group Anagram problem in leetcode. Link for the problem https://leetcode.com/problems/group-anagrams/

Below is the code which I wrote

class Solution {
public:
vector<vector<string>> groupAnagrams(vector<string>& inp) {
 
 if(inp.size() == 0)
 return vector<vector<string> > {{}};
 
 int count = 1;
 map<char, long long> alphaVal;
 for (char i = 'a'; i <= 'z'; i++) {
 alphaVal[i] = int(rand());
 }
 map<long long, vector<string> > prep;
 for (auto value : inp) {
 long long temp = 0;
 for (auto c_p : value) {
 temp += alphaVal[c_p];
 }
 //cout << "temp " << temp << endl;
 prep[temp].push_back(value);
 }
 
 vector < vector<string> > ans;
 for (auto it = prep.begin(); it != prep.end(); it++) {
 vector<string> temp;
 for (int i = 0; i < it->second.size(); i++) {
 //cout << it->second[i] << " ";
 temp.push_back(it->second[i]);
 }
 ans.push_back(temp);
 }
 
 return ans;
}
};

I have two questions:

Even though the above solution worked, is using random is a really good idea?

Previously I wrote a logic to check the characters and store the string.

 map<set<char>, vector<string> > prep;
 for (auto value : inp) {
 set<char> temp;
 for (auto c_p : value) {
 temp.insert(c_p);
 }
 prep[temp].push_back(value);
 }

For the above logic one test case failed

ip: ["ddddddddddg","dgggggggggg"]
op: [["ddddddddddg","dgggggggggg"]]

Even though the two strings are unique, the set hash calculation says it is same. Any idea why this behavior on the selected test case? Was just curious to know. Thanks for the answers!

Question 2

(The characters occurring in "ddddddddddg" are {d, g}, same as in "dgggggggggg" (or ggd).)

Question 3

The single mention of random is in is using random is a really good idea? - did you forget to present something?

Question 4

Hi @greybeard I was using rand() in the for loop for (char i = 'a'; i <= 'z'; i++) { alphaVal[i] = int(rand()); } So i was referring to this function

Question 5

Answering your questions

Even though the above solution worked, is using random is a really good idea?

It's not a good idea. It doesn't ensure a correct result. To paraphrase the code at a high level:

Assign random long long values to the alphabet
For each word in the input
- Sum the random value of the letters
- Since anagrams will have the same sum, group the words by the sum

This approach is flawed, because:

Words that are not anagrams may also have the same sum. With some bad luck due to the randomness, words may get assigned to the same bucket incorrectly.
Although long long is a big range, it's possible that more than one letter gets assigned the same random number. That will significantly increase the risk of collisions.

To group anagrams correctly, the grouping key must be something that distinctly identifies the anagrams, and never shared by words that shouldn't be in the same group. This unique key could be the map of letter counts:

map<map<char, int>, vector<string>> counts2words;
for (auto word : words) {
 map<char, int> counts;
 for (auto c : word) {
 counts[c]++;
 }
 counts2words[counts].push_back(word);
}

Even though the two strings are unique, the set hash calculation says it is same. Any idea why this behavior on the selected test case?

First of all, that statement is misleading. In this example, it's much more than the hash calculation that's the same, here the two sets themselves are the same, consisting of the letters "d" and "g".

But let's just assume that hash calculation for two strings S1 and S2 were the same. That wouldn't be too surprising, that's just how hash functions work. Collisions happen, that's why it's important that when working with hashes, when the hashed values are equal, you also must compare the unhashed values to verify if they are really equal.

Use descriptive names

The names used throughout the implementation are cryptic. See the example code above where I used a map of counts: everything has a descriptive name, except the trivial c for the letters.

Drop unnecessary input validation

This is unnecessary:

if(inp.size() == 0)
 return vector<vector<string> > {{}};

For one thing, the problem description guarantees that the input is never empty.

For another, the implementation naturally handles correctly the case of empty input.

Question 6

Hi janos thanks for the answer. First I felt stupid to ask the question regarding the hash function where I used set<char> so it comes down to {d,g}, so same calculation will be there, which I realized just after posting it. Regarding the randomness thanks for the explanation and will definitely will follow correct naming convention. I also dry ran the logic snippet which you have written, that also works and reliable :)

Question 7

Actually @janos the second sample test case where the empty input is accepted, input is [""] Output: [[""]]

Question 8

... and the program returns that correctly, doesn't it?

Question 9

yeah.. you are right! :)

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2022-08-28 06:37:56Z

Answering your questions

Even though the above solution worked, is using random is a really good idea?

It's not a good idea. It doesn't ensure a correct result. To paraphrase the code at a high level:

Assign random long long values to the alphabet
For each word in the input
- Sum the random value of the letters
- Since anagrams will have the same sum, group the words by the sum

This approach is flawed, because:

Words that are not anagrams may also have the same sum. With some bad luck due to the randomness, words may get assigned to the same bucket incorrectly.
Although long long is a big range, it's possible that more than one letter gets assigned the same random number. That will significantly increase the risk of collisions.

To group anagrams correctly, the grouping key must be something that distinctly identifies the anagrams, and never shared by words that shouldn't be in the same group. This unique key could be the map of letter counts:

map<map<char, int>, vector<string>> counts2words;
for (auto word : words) {
 map<char, int> counts;
 for (auto c : word) {
 counts[c]++;
 }
 counts2words[counts].push_back(word);
}

Even though the two strings are unique, the set hash calculation says it is same. Any idea why this behavior on the selected test case?

First of all, that statement is misleading. In this example, it's much more than the hash calculation that's the same, here the two sets themselves are the same, consisting of the letters "d" and "g".

But let's just assume that hash calculation for two strings S1 and S2 were the same. That wouldn't be too surprising, that's just how hash functions work. Collisions happen, that's why it's important that when working with hashes, when the hashed values are equal, you also must compare the unhashed values to verify if they are really equal.

Use descriptive names

The names used throughout the implementation are cryptic. See the example code above where I used a map of counts: everything has a descriptive name, except the trivial c for the letters.

Drop unnecessary input validation

This is unnecessary:

if(inp.size() == 0)
 return vector<vector<string> > {{}};

For one thing, the problem description guarantees that the input is never empty.

For another, the implementation naturally handles correctly the case of empty input.

Hi janos thanks for the answer. First I felt stupid to ask the question regarding the hash function where I used set<char> so it comes down to {d,g}, so same calculation will be there, which I realized just after posting it. Regarding the randomness thanks for the explanation and will definitely will follow correct naming convention. I also dry ran the logic snippet which you have written, that also works and reliable :)
Actually @janos the second sample test case where the empty input is accepted, input is [""] Output: [[""]]

Stack Exchange Network

Finding anagram groups using hashing and mapping the alphabet to random numbers

1 Answer 1

Answering your questions

Use descriptive names

Drop unnecessary input validation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding anagram groups using hashing and mapping the alphabet to random numbers

1 Answer 1

Answering your questions

Use descriptive names

Drop unnecessary input validation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions