Which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.
I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.
-
12For what purpose, security or other?Orbling– Orbling02/19/2011 00:05:58Commented Feb 19, 2011 at 0:05
-
37@Orbling, for implementation of a hash dictionary. So collisions should be kept to a minimal, but it has no security purpose at all.Earlz– Earlz02/19/2011 00:06:51Commented Feb 19, 2011 at 0:06
-
5Note that you will need to expect at least some collisions in your hash table, otherwise the table will need to be enormous to be able to handle even a relatively small number of keys...Dean Harding– Dean Harding02/19/2011 01:14:22Commented Feb 19, 2011 at 1:14
-
25Great post! Could you also check's Yann Collet's xxHash (creator or LZ4), which is twice as fast as Murmur? Homepage: code.google.com/p/xxhash More info: fastcompression.blogspot.fr/2012/04/…user55191– user5519105/29/2012 12:35:22Commented May 29, 2012 at 12:35
-
27@zvrba Depends on the algorithm. bcrypt is designed to be slow.Izkata– Izkata10/31/2013 19:20:39Commented Oct 31, 2013 at 19:20
11 Answers 11
I tested some different algorithms, measuring speed and number of collisions.
I used three different key sets:
- A list of 216,553 English words 🕗archive (in lowercase)
- The numbers
"1"
to"216553"
(think ZIP codes, and how a poor hash took down msn.com 🕗archive ) - 216,553 "random" (i.e. type 4 uuid) GUIDs
For each corpus, the number of collisions and the average time spent hashing was recorded.
I tested:
- DJB2
- DJB2a (variant using
xor
rather than+
) - FNV-1 (32-bit)
- FNV-1a (32-bit)
- SDBM
- CRC32
- Murmur2 (32-bit)
- SuperFastHash
Results
Each result contains the average hash time, and the number of collisions
Hash Lowercase Random UUID Numbers
============= ============= =========== ==============
Murmur 145 ns 259 ns 92 ns
6 collis 5 collis 0 collis
FNV-1a 152 ns 504 ns 86 ns
4 collis 4 collis 0 collis
FNV-1 184 ns 730 ns 92 ns
1 collis 5 collis 0 collis▪
DBJ2a 158 ns 443 ns 91 ns
5 collis 6 collis 0 collis▪▪▪
DJB2 156 ns 437 ns 93 ns
7 collis 6 collis 0 collis▪▪▪
SDBM 148 ns 484 ns 90 ns
4 collis 6 collis 0 collis**
SuperFastHash 164 ns 344 ns 118 ns
85 collis 4 collis 18742 collis
CRC32 250 ns 946 ns 130 ns
2 collis 0 collis 0 collis
LoseLose 338 ns - -
215178 collis
Notes:
- The LoseLose algorithm (where hash = hash+character) is truly awful. Everything collides into the same 1,375 buckets
- SuperFastHash is fast, with things looking pretty scattered; by my goodness the number collisions. I'm hoping the guy who ported it got something wrong; it's pretty bad
- CRC32 is pretty good. Slower, and a 1k lookup table
Do collisions actually happen?
Yes. I started writing my test program to see if hash collisions actually happen - and are not just a theoretical construct. They do indeed happen:
FNV-1 collisions
creamwove
collides withquists
FNV-1a collisions
costarring
collides withliquid
declinate
collides withmacallums
altarage
collides withzinke
altarages
collides withzinkes
Murmur2 collisions
cataract
collides withperiti
roquette
collides withskivie
shawl
collides withstormbound
dowlases
collides withtramontane
cricketings
collides withtwanger
longans
collides withwhigs
DJB2 collisions
hetairas
collides withmentioner
heliotropes
collides withneurospora
depravement
collides withserafins
stylist
collides withsubgenera
joyful
collides withsynaphea
redescribed
collides withurites
dram
collides withvivency
DJB2a collisions
haggadot
collides withloathsomenesses
adorablenesses
collides withrentability
playwright
collides withsnush
playwrighting
collides withsnushing
treponematoses
collides withwaterbeds
CRC32 collisions
codding
collides withgnu
exhibiters
collides withschlager
SuperFastHash collisions
dahabiah
collides withdrapability
encharm
collides withenclave
grahams
collides withgramary
- ...snip 79 collisions...
night
collides withvigil
nights
collides withvigils
finks
collides withvinic
Randomnessification
The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:
Enter image description here
Or as a Hilbert Map (XKCD is always relevant):
Enter image description here
Except when hashing number strings ("1"
, "2"
, ..., "216553"
) (for example, zip codes), where patterns begin to emerge in most of the hashing algorithms:
SDBM:
Enter image description here
DJB2a:
Enter image description here
FNV-1:
Enter image description here
All except FNV-1a, which still look pretty random to me:
Enter image description here
In fact, Murmur2 seems to have even better randomness with Numbers
than FNV-1a
:
Enter image description here
When I look at the
FNV-1a
"number" map, I think I see subtle vertical patterns. With Murmur I see no patterns at all. What do you think?
The extra *
in the table denotes how bad the randomness is. With FNV-1a
being the best, and DJB2x
being the worst:
Murmur2: .
FNV-1a: .
FNV-1: ▪
DJB2: ▪▪
DJB2a: ▪▪
SDBM: ▪▪▪
SuperFastHash: .
CRC: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
Loselose: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
▪
▪▪▪▪▪▪▪▪▪▪▪▪▪
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
I originally wrote this program to decide if I even had to worry about collisions: I do.
And then it turned into making sure that the hash functions were sufficiently random.
FNV-1a algorithm
The FNV1 hash comes in variants that return 32, 64, 128, 256, 512 and 1024 bit hashes.
The FNV-1a algorithm is:
hash = FNV_offset_basis
for each octetOfData to be hashed
hash = hash xor octetOfData
hash = hash * FNV_prime
return hash
Where the constants FNV_offset_basis
and FNV_prime
depend on the return hash size you want:
Hash Size
===========
32-bit
prime: 2^24 + 2^8 + 0x93 = 16777619
offset: 2166136261
64-bit
prime: 2^40 + 2^8 + 0xb3 =わ 1099511628211
offset: 14695981039346656037
128-bit
prime: 2^88 + 2^8 + 0x3b = 309485009821345068724781371
offset: 144066263297769815596495629667062367629
256-bit
prime: 2^168 + 2^8 + 0x63 =わ 374144419156711147060143317175368453031918731002211
offset: 100029257958052580907070968620625704837092796014241193945225284501741471925557
512-bit
prime: 2^344 + 2^8 + 0x57 =わ 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
offset: 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
1024-bit
prime: 2^680 + 2^8 + 0x8d = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
offset: 1419779506494762106872207064140321832088062279544193396087847491461758272325229673230371772250864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915
See the main FNV page for details.
All my results are with the 32-bit variant.
FNV-1 better than FNV-1a?
No. FNV-1a is all around better. There was more collisions with FNV-1a when using the English word corpus:
Hash Word Collisions
====== ===============
FNV-1 1
FNV-1a 4
Now compare lowercase and uppercase:
Hash lowercase word Collisions UPPERCASE word collisions
====== ========================= =========================
FNV-1 1 9
FNV-1a 4 11
In this case FNV-1a isn't "400%" worse than FN-1, only 20% worse.
I think the more important takeaway is that there are two classes of algorithms when it comes to collisions:
- collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
- collisions common: SuperFastHash, Loselose
And then there's the how evenly distributed the hashes are:
- outstanding distribution: Murmur2, FNV-1a, SuperFastHas
- excellent distribution: FNV-1
- good distribution: SDBM, DJB2, DJB2a
- horrible distribution: Loselose
Update
Murmur? Sure, why not
Update
@whatshisname wondered how a CRC32 would perform, added numbers to the table.
CRC32 is pretty good. Few collisions, but slower, and the overhead of a 1k lookup table.
Snip all erroneous stuff about CRC distribution - my bad
Up until today I was going to use FNV-1a as my de facto hash-table hashing algorithm. But now I'm switching to Murmur2:
- Faster
- Better randomnessification of all classes of input
And I really, really hope there's something wrong with the SuperFastHash
algorithm I found; it's too bad to be as popular as it is.
Update: From the MurmurHash3 homepage on Google:
(1) - SuperFastHash has very poor collision properties, which have been documented elsewhere.
So I guess it's not just me.
Update: I realized why Murmur
is faster than the others. MurmurHash2 operates on four bytes at a time. Most algorithms are byte by byte:
for each octet in Key
AddTheOctetToTheHash
This means that as keys get longer Murmur gets its chance to shine.
Update
GUIDs are designed to be unique, not random
A timely post by Raymond Chen reiterates the fact that "random" GUIDs are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:
Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.
Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own "hashing" algorithm by taking some subset of a "random" guid:
int HashKeyFromGuid(Guid type4uuid)
{
//A "4" is put somewhere in the GUID.
//I can't remember exactly where, but it doesn't matter for
//the illustrative purposes of this pseudocode
int guidVersion = ((type4uuid.D3 & 0x0f00) >> 8);
Assert(guidVersion == 4);
return (int)GetFirstFourBytesOfGuid(type4uuid);
}
Note: Again, I put "random GUID" in quotes, because it's the "random" variant of GUIDs. A more accurate description would be Type 4 UUID
. But nobody knows what type 4, or types 1, 3 and 5 are. So it's just easier to call them "random" GUIDs.
All English Words mirrors
-
87It would be really interesting to see how SHA compares, not because it's a good candidate for a hashing algorithm here but it would be really interesting to see how any cryptographic hash compares with these made for speed algorithms.Michael– Michael05/25/2012 15:09:17Commented May 25, 2012 at 15:09
-
22A new hash by the name of 'xxHash', by Yann Collet, was doing the rounds recently. I'm always suspicious of a new hash. It would be interesting to see it in your comparison, (if you aren't tired of people suggesting random hashes they've heard of to be added...)th_in_gs– th_in_gs05/29/2012 11:51:51Commented May 29, 2012 at 11:51
-
7Indeed. The performance numbers announced by the xxHash project page look impressive, maybe too much to be true. Well, at least, it's an open-source project : code.google.com/p/xxhashATTracker– ATTracker05/29/2012 16:23:49Commented May 29, 2012 at 16:23
-
15Hi Ian, my Delphi implementation of SuperFastHash is correct. When implementing I created a test set in C and Delphi to compare the results of my implementation and the reference implementation. There are no differences. So what you see is the actual badness of the hash... (That is why I also published a MurmurHash implementation: landman-code.blogspot.nl/2009/02/… )Davy Landman– Davy Landman06/04/2012 08:58:44Commented Jun 4, 2012 at 8:58
-
75Is the poster aware this is not just an awesome answer - this is the world's de facto reference resource on the subject? Anytime I need to deal with hashes, that solves my issue so fast and authoritatively that I don't ever need anything else.MaiaVictor– MaiaVictor05/29/2014 11:53:14Commented May 29, 2014 at 11:53
If you are wanting to create a hash map from an unchanging dictionary, you might want to consider perfect hashing https://en.wikipedia.org/wiki/Perfect_hash_function - during the construction of the hash function and hash table, you can guarantee, for a given dataset, that there will be no collisions.
-
4Here's more about (minimal) Perfect Hashing burtleburtle.net/bob/hash/perfect.html including performance data, although it doesn't use the most current processor etc.Ellie K– Ellie K05/29/2012 12:24:32Commented May 29, 2012 at 12:24
-
6It's pretty obvious, but worth pointing out that in order to guarantee no collisions, the keys would have to be the same size as the values, unless there are constraints on the values the algorithm can capitalize on.devios1– devios104/04/2013 20:34:34Commented Apr 4, 2013 at 20:34
-
4@devios1 Your statement is meaningless. First, the values in a hash table, perfect or not, are independent of the keys. Second, a perfect hash table is just a linear array of values, indexed by the result of function that has been crafted so that all the indices are unique.Jim Balter– Jim Balter10/20/2018 16:29:03Commented Oct 20, 2018 at 16:29
-
1@MarcusJ Perfect hashing is usually used with less than 100 keys, but take a look at cmph.sourceforge.net ... still far short of your range.Jim Balter– Jim Balter10/20/2018 16:35:31Commented Oct 20, 2018 at 16:35
-
2@DavidCary Nothing at your link supports your claim. Possibly you have confused O(1) with "no collisions", but they aren't at all the same thing. Of course, perfect hashing guarantees no collisions, but it requires that all the keys are known in advance and that there are relatively few of them. (But see the link to cmph above.)Jim Balter– Jim Balter10/20/2018 16:38:36Commented Oct 20, 2018 at 16:38
Here is a list of hash functions, but the short version is:
If you just want to have a good hash function, and cannot wait,
djb2
is one of the best string hash functions i know. It has excellent distribution and speed on many different sets of keys and table sizes
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
-
9Actually djb2 is zero sensitive, as most such simple hash functions, so you can easily break such hashes. It has a bad bias too many collisions and a bad distribution, it breaks on most smhasher quality tests: See github.com/rurban/smhasher/blob/master/doc/bernstein His cdb database uses it, but I wouldn't use it with public access.rurban– rurban08/20/2014 06:03:28Commented Aug 20, 2014 at 6:03
-
4DJB is pretty bad from a performance and distribution standpoint. I wouldn't use it today.Conrad Meyer– Conrad Meyer10/30/2016 21:32:28Commented Oct 30, 2016 at 21:32
-
1@ConradMeyer I'd bet, DJB can be sped up by a factor of three just like in this question of mine and then it'd probably beat most usable algorithms. Concerning the distribution, I agree. A hash producing collisions even for two letter strings can't be really good.maaartinus– maaartinus06/04/2017 00:42:45Commented Jun 4, 2017 at 0:42
-
Guys, I have doubts. You are saying
djb2
is bad, but the test results of the accepted answer show it is good.Leopoldo Sanczyk– Leopoldo Sanczyk10/31/2019 04:43:57Commented Oct 31, 2019 at 4:43 -
You might at least use a sensible prime that produces less collisions instead of 33. stackoverflow.com/a/2816747/21499Dr. Hans-Peter Störr– Dr. Hans-Peter Störr06/22/2020 20:17:19Commented Jun 22, 2020 at 20:17
CityHash by Google is the algorithm you are looking for. It is not good for cryptography but is good for generating unique hashes.
Read the blog for more details and the code is available here.
CityHash is written in C++. There also is a plain C port.
All the CityHash functions are tuned for 64-bit processors. That said, they will run (except for the new ones that use SSE4.2) in 32-bit code. They won't be very fast though. You may want to use Murmur or something else in 32-bit code.
-
3Have a look at SipHash too, it is meant to replace MurmurHash/CityHash/etc. : 131002.net/siphashuser105009– user10500910/15/2013 08:47:20Commented Oct 15, 2013 at 8:47
-
3Also see FarmHash, a successor to CitHash. code.google.com/p/farmhashstevendaniels– stevendaniels03/18/2015 13:15:14Commented Mar 18, 2015 at 13:15
-
9xxHash claims to be 5x faster than CityHash.Clay Bridges– Clay Bridges05/22/2015 15:56:22Commented May 22, 2015 at 15:56
-
I've plotted a short speed comparison of different hashing algorithms when hashing files.
The individual plots only differ slightly in the reading method and can be ignored here, since all files were stored in a tmpfs. Therefore the benchmark was not IO-bound if you are wondering.
Algorithms include: SpookyHash, CityHash, Murmur3, MD5, SHA{1,256,512}
.
Conclusions:
- Non-cryptographic hash functions like Murmur3, Cityhash and Spooky are pretty close together. One should note that Cityhash may be faster on CPUs with SSE 4.2s
CRC
instruction, which my CPU does not have. SpookyHash was in my case always a tiny bit before CityHash. - MD5 seems to be a good tradeoff when using cryptographic hash functions, although SHA256 may be more secure to the collision vulnerabilities of MD5 and SHA1.
- The complexity of all algorithms is linear - which is really not surprising since they work blockwise. (I wanted to see if the reading method makes a difference, so you can just compare the rightmost values).
- SHA256 was slower than SHA512.
- I did not investigate the randomness of the hash functions. But here is a good comparison of the hash functions that are missing in Ian Boyds answer. This points out that CityHash has some problems in corner cases.
The source used for the plots:
- https://github.com/sahib/rmlint/tree/gh-pages/plots (sorry for the ugly code)
-
3The linear scale graph cuts off the y-axis label which says what quantity it is plotting. I guess it probably would be "time in seconds", same as the logarithmic scale. It's worth fixing.Craig McQueen– Craig McQueen12/13/2015 23:24:20Commented Dec 13, 2015 at 23:24
I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique.
The assumption that cryptographic hash functions are more unique is wrong, and in fact it can be shown to be often backwards in practice. In truth:
- Cryptographic hash functions ideally should be indistinguishable from random;
- But with non-cryptographic hash functions, it's desirable for them to interact favorably with likely inputs.
Which means that a non-cryptographic hash function may well have fewer collisions than a cryptographic one for "good" data set—data sets that it was designed for.
We can actually demonstrate this with the data in Ian Boyd's answer and a bit of math: the Birthday problem. The formula for the expected number of colliding pairs if you pick n
integers at random from the set [1, d]
is this (taken from Wikipedia):
n - d + d * ((d - 1) / d)^n
Plugging n
= 216,553 and d
= 2^32 we get about 5.5 expected collisions. Ian's tests mostly show results around that neighborhood, but with one dramatic exception: most of the functions got zero collisions in the consecutive numbers tests. The probability of choosing 216,553 32-bit numbers at random and getting zero collisions is about 0.43%. And that's just for one function—here we have five distinct hash function families with zero collisions!
So what we're seeing here is that the hashes that Ian tested are interacting favorably with the consecutive numbers dataset—i.e., they're dispersing minimally different inputs more widely than an ideal cryptographic hash function would. (Side note: this means that Ian's graphical assessment that FNV-1a and MurmurHash2 "look random" to him in the numbers data set can be refuted from his own data. Zero collisions on a data set of that size, for both hash functions, is strikingly nonrandom!)
This is not a surprise because this is a desirable behavior for many uses of hash functions. For example, hash table keys are often very similar; Ian's answer mentions a problem MSN once had with ZIP code hash tables. This is a use where collision avoidance on likely inputs wins over random-like behavior.
Another instructive comparison here is the contrast in the design goals between CRC and cryptographic hash functions:
- CRC is designed to catch errors resulting from noisy communications channels, which are likely to be a small number of bit flips;
- Crypto hashes are designed to catch modifications made by malicious attackers, who are allotted limited computational resources but arbitrarily much cleverness.
So for CRC it is again good to have fewer collisions than random in minimally different inputs. With crypto hashes, this is a no-no!
The SHA algorithms (including SHA-256) are designed to be fast.
In fact, their speed can be a problem sometimes. In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times (storing the hash of the hash of the hash of the hash of the ... password).
#!/usr/bin/env ruby
require 'securerandom'
require 'digest'
require 'benchmark'
def run_random_digest(digest, count)
v = SecureRandom.random_bytes(digest.block_length)
count.times { v = digest.digest(v) }
v
end
Benchmark.bmbm do |x|
x.report { run_random_digest(Digest::SHA256.new, 1_000_000) }
end
Output:
Rehearsal ------------------------------------
1.480000 0.000000 1.480000 ( 1.391229)
--------------------------- total: 1.480000sec
user system total real
1.400000 0.000000 1.400000 ( 1.382016)
-
61It's relatively fast, sure, for a cryptographic hashing algorithm. But the OP just wants to store values in a hashtable, and I don't think a cryptographic hash function is really appropriate for that.Dean Harding– Dean Harding02/19/2011 01:10:38Commented Feb 19, 2011 at 1:10
-
6The question brought up (tangentially, it now appears) the subject of the cryptographic hash functions. That's the bit I am responding to.yfeldblum– yfeldblum02/22/2011 13:14:49Commented Feb 22, 2011 at 13:14
-
18Just to put people off the idea of "In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times" -- while common, that's just plain stupid. There are algorithms designed for these scenarios, e.g.,
bcrypt
. Use the right tools.TC1– TC110/14/2013 13:19:22Commented Oct 14, 2013 at 13:19 -
3Cryptographic hashes are designed to have a high throughput, but that often means they have high setup, teardown,
.rodata
and/or state costs. When you want an algorithm for a hashtable, you usually have very short keys, and lots of them, but do not need the additional guarantees of a cryptographic has. I use a tweaked Jenkins’ one-at-a-time myself.mirabilos– mirabilos12/06/2013 13:57:31Commented Dec 6, 2013 at 13:57 -
1@ChrisMorgan: rather than using a cryptographically secure hash, HashTable DoS can be solved much more efficiently using hash randomization, so that every run of the programs or even on every hashtable, so the data doesn't get grouped into the same bucket every time.Lie Ryan– Lie Ryan11/12/2017 17:16:54Commented Nov 12, 2017 at 17:16
Use SipHash. It has many desirable properties:
Fast. An optimized implementation takes around 1 cycle per byte.
Secure. SipHash is a strong PRF (pseudorandom function). This means that it is indistinguishable from a random function (unless you know the 128-bit secret key). Hence:
No need to worry about your hash table probes becoming linear time due to collisions. With SipHash, you know that you will get average-case performance on average, regardless of inputs.
Immunity to hash-based denial of service attacks.
You can use SipHash (especially the version with a 128-bit output) as a MAC (Message Authentication Code). If you receive a message and a SipHash tag, and the tag is the same as that from running SipHash with your secret key, then you know that whoever created the hash was also in possession of your secret key, and that neither the message nor the hash have been altered since.
-
3Isn't SipHash overkill unless you need security? Requires a 128-bit key which is just a glorified hash seed. Not to mention MurmurHash3 has 128-bit output and SipHash only has a 64-bit output. Obviously the larger digest has a lower collision chance.bryc– bryc04/02/2018 00:37:33Commented Apr 2, 2018 at 0:37
-
1@bryc The difference is that SipHash will continue to be well-behaved, even on malicious input. A hash table based on SipHash can be used for data from potentially hostile sources, and can use an algorithm such as linear probing that is very sensitive to the details of the hash function.Demi– Demi04/02/2018 01:21:10Commented Apr 2, 2018 at 1:21
-
3Siphash (and related newer prng style functions) is my default choice for security. For performance, xxhash is hard to beat. There’s tons of bad hashing advice on the internet, even in the discussions here. Good performance on random or semi random inputs is meaningless. What is the worst case performance, with real world inputs? What is the result with malicious inputs? Your hash table will eventually become an attack vector.Frank Hileman– Frank Hileman02/15/2021 17:33:20Commented Feb 15, 2021 at 17:33
It depends on the data you are hashing. Some hashing works better with specific data like text. Some hashing algorithms were specificaly designed to be good for specific data.
Paul Hsieh once made fast hash. He lists source code and explanations. But it was already beaten. :)
Java uses this simple multiply-and-add algorithm:
The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where
s[i]
is the i-th character of the string,n
is the length of the string, and^
indicates exponentiation. (The hash value of the empty string is zero.)
There are probably much better ones out there but this is fairly widespread and seems to be a good trade-off between speed and uniqueness.
-
14I wouldn't use the exact same one used here, as it's still relatively easy to produce collisions with this. It's definitely not terrible, but there are much better ones out there. And if there's no significant reason to be compatible with Java, it should not be chosen.Joachim Sauer– Joachim Sauer04/23/2012 12:51:32Commented Apr 23, 2012 at 12:51
-
7If you still choose this way of hashing for some reason, you could at least use a better prime like 92821 as a multiplicator. That reduces collisions much. stackoverflow.com/a/2816747/21499Dr. Hans-Peter Störr– Dr. Hans-Peter Störr07/01/2014 06:30:59Commented Jul 1, 2014 at 6:30
-
3You might as well use FNV1a instead. It's also a simple multiplication-based hash, but uses a larger multiplier, which disperses the hash better.bryc– bryc01/15/2019 12:13:12Commented Jan 15, 2019 at 12:13
-
1You don't want to do
s[0]*31^3 + s[1]*31^2 + s[2]*31 + s[3]
. Avoid the power operator (^) and do it this way:((s[0]*31 + s[1])*31 + s[2])*31 + s[3]
.Leopoldo Sanczyk– Leopoldo Sanczyk10/31/2019 04:51:32Commented Oct 31, 2019 at 4:51 -
1@LeopoldoSanczyk Yes, in the code it is (and should be) done iteratively, it was just easier to understand in a closed formula.biziclop– biziclop11/01/2019 18:16:35Commented Nov 1, 2019 at 18:16
First of all, why do you need to implement your own hashing? For most tasks you should get good results with data structures from a standard library, assuming there's an implementation available (unless you're just doing this for your own education).
As far as actual hashing algorithms go, my personal favorite is FNV. 1
Here's an example implementation of the 32-bit version in C:
unsigned long int FNV_hash(void* dataToHash, unsigned long int length)
{
unsigned char* p = (unsigned char *) dataToHash;
unsigned long int h = 2166136261UL;
unsigned long int i;
for(i = 0; i < length; i++)
h = (h * 16777619) ^ p[i] ;
return h;
}
-
3The FNV-1a variant is slightly better with randomness. Swap the order of the
*
and^
:h = (h * 16777619) ^ p[i]
==>h = (h ^ p[i]) * 16777619
Ian Boyd– Ian Boyd04/23/2012 15:04:47Commented Apr 23, 2012 at 15:04
Explore related questions
See similar questions with these tags.