I was solving Group Anagram problem in leetcode. Link for the problem https://leetcode.com/problems/group-anagrams/
Below is the code which I wrote
class Solution {
public:
vector<vector<string>> groupAnagrams(vector<string>& inp) {
if(inp.size() == 0)
return vector<vector<string> > {{}};
int count = 1;
map<char, long long> alphaVal;
for (char i = 'a'; i <= 'z'; i++) {
alphaVal[i] = int(rand());
}
map<long long, vector<string> > prep;
for (auto value : inp) {
long long temp = 0;
for (auto c_p : value) {
temp += alphaVal[c_p];
}
//cout << "temp " << temp << endl;
prep[temp].push_back(value);
}
vector < vector<string> > ans;
for (auto it = prep.begin(); it != prep.end(); it++) {
vector<string> temp;
for (int i = 0; i < it->second.size(); i++) {
//cout << it->second[i] << " ";
temp.push_back(it->second[i]);
}
ans.push_back(temp);
}
return ans;
}
};
I have two questions:
Even though the above solution worked, is using random is a really good idea?
Previously I wrote a logic to check the characters and store the string.
map<set<char>, vector<string> > prep; for (auto value : inp) { set<char> temp; for (auto c_p : value) { temp.insert(c_p); } prep[temp].push_back(value); }
For the above logic one test case failed
ip: ["ddddddddddg","dgggggggggg"]
op: [["ddddddddddg","dgggggggggg"]]
Even though the two strings are unique, the set hash calculation says it is same. Any idea why this behavior on the selected test case? Was just curious to know. Thanks for the answers!
1 Answer 1
Answering your questions
Even though the above solution worked, is using random is a really good idea?
It's not a good idea. It doesn't ensure a correct result. To paraphrase the code at a high level:
- Assign random
long long
values to the alphabet - For each word in the input
- Sum the random value of the letters
- Since anagrams will have the same sum, group the words by the sum
This approach is flawed, because:
- Words that are not anagrams may also have the same sum. With some bad luck due to the randomness, words may get assigned to the same bucket incorrectly.
- Although
long long
is a big range, it's possible that more than one letter gets assigned the same random number. That will significantly increase the risk of collisions.
To group anagrams correctly, the grouping key must be something that distinctly identifies the anagrams, and never shared by words that shouldn't be in the same group. This unique key could be the map of letter counts:
map<map<char, int>, vector<string>> counts2words;
for (auto word : words) {
map<char, int> counts;
for (auto c : word) {
counts[c]++;
}
counts2words[counts].push_back(word);
}
Even though the two strings are unique, the set hash calculation says it is same. Any idea why this behavior on the selected test case?
First of all, that statement is misleading. In this example, it's much more than the hash calculation that's the same, here the two sets themselves are the same, consisting of the letters "d" and "g".
But let's just assume that hash calculation for two strings S1 and S2 were the same. That wouldn't be too surprising, that's just how hash functions work. Collisions happen, that's why it's important that when working with hashes, when the hashed values are equal, you also must compare the unhashed values to verify if they are really equal.
Use descriptive names
The names used throughout the implementation are cryptic.
See the example code above where I used a map of counts:
everything has a descriptive name, except the trivial c
for the letters.
Drop unnecessary input validation
This is unnecessary:
if(inp.size() == 0) return vector<vector<string> > {{}};
For one thing, the problem description guarantees that the input is never empty.
For another, the implementation naturally handles correctly the case of empty input.
-
\$\begingroup\$ Hi janos thanks for the answer. First I felt stupid to ask the question regarding the hash function where I used set<char> so it comes down to {d,g}, so same calculation will be there, which I realized just after posting it. Regarding the randomness thanks for the explanation and will definitely will follow correct naming convention. I also dry ran the logic snippet which you have written, that also works and reliable :) \$\endgroup\$susil95– susil952022年08月28日 07:00:35 +00:00Commented Aug 28, 2022 at 7:00
-
\$\begingroup\$ Actually @janos the second sample test case where the empty input is accepted, input is [""] Output: [[""]] \$\endgroup\$susil95– susil952022年08月28日 07:05:11 +00:00Commented Aug 28, 2022 at 7:05
-
2\$\begingroup\$ ... and the program returns that correctly, doesn't it? \$\endgroup\$janos– janos2022年08月28日 07:18:10 +00:00Commented Aug 28, 2022 at 7:18
-
\$\begingroup\$ yeah.. you are right! :) \$\endgroup\$susil95– susil952022年08月28日 07:38:38 +00:00Commented Aug 28, 2022 at 7:38
"ddddddddddg"
are {d, g}, same as in"dgggggggggg"
(orggd
).) \$\endgroup\$random
is inis using random is a really good idea?
- did you forget to present something? \$\endgroup\$