Issue 16427: Faster hash implementation

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60631

classification

Title:	Faster hash implementation
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.4

process

Status:	closed	Resolution:	duplicate
Dependencies:	Superseder:	PEP 456 Secure and interchangeable hash algorithm View: 19183
Assigned To:	christian.heimes	Nosy List:	asvetlov, christian.heimes, ebfe, gregory.p.smith, isoschiz, jcea, mark.dickinson, neologix, pitrou, python-dev, rhettinger, serhiy.storchaka, skrah, tim.peters, vstinner
Priority:	normal	Keywords:	patch

Created on 2012年11月07日 12:38 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_hash_align.patch	pitrou, 2012年11月10日 23:37	review
fast_hash_3.patch	serhiy.storchaka, 2012年11月11日 14:19	review
bench_hash.txt	vstinner, 2013年04月08日 21:51
bench_hash.py	vstinner, 2013年04月08日 21:51
cityhash.txt	ebfe, 2013年06月02日 09:30
cityhash_arm.txt	ebfe, 2013年06月02日 14:21
cityhash_fashhash3_arm.txt	ebfe, 2013年06月05日 16:30
cityhash_fasthast3.txt	ebfe, 2013年06月05日 16:30

Messages (36)
msg175093 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月07日 12:38
In the discussion of issue14621 it was noted that much more complex hash algorithms can overtake the current one due to the fact that they process more data at a time. Here is a patch that implements this idea for the current algorithm. Also code duplication removed. Microbenchmarks: $ ./python -m timeit -n 1 -s "t = b'a' * 10*8" "hash(t)" $ ./python -m timeit -n 1 -s "t = 'a' 10*8" "hash(t)" $ ./python -m timeit -n 1 -s "t = '\u0100' 10*8" "hash(t)" $ ./python -m timeit -n 1 -s "t = '\U00010000' 10**8" "hash(t)" Results on 32-bit Linux on AMD Athlon 64 X2 4600+: original patched speedup bytes 181 msec 45.7 msec 4x UCS1 429 msec 45.7 msec 9.4x UCS2 179 msec 92 msec 1.9x UCS4 183 msec 183 msec 1x If the idea is acceptable, I will create benchmarks for short strings.
msg175298 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2012年11月10日 22:01
yes please! Any reason you're using an unsigned int in your loop instead of a Py_uhash_t?
msg175300 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月10日 22:39
> Any reason you're using an unsigned int in your loop instead of a Py_uhash_t? In fact, there is no serious reason. This should be the type aligned as minimal alignment of void*, size_t and Py_hash_t. Since de facto Py_uhash_t is size_t, then we can use size_t.
msg175302 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年11月10日 23:26
The patch is too optimistic, it gives different results depending on the alignment of the memory buffer: >>> b = b"abcd"*100 >>> hash(b[1:]) 7264687738704559842 >>> hash(memoryview(b)[1:]) 9054791208347464792 >>> memoryview(b)[1:] == b[1:] True
msg175303 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年11月10日 23:37
Here is a test case for the hash/alignment issue.
msg175330 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 08:50
> The patch is too optimistic, it gives different results depending on the alignment of the memory buffer: So this method is not applicable for a byte. Here is a patch only for strings. If a fast hash for bytes/memoryview is desirable, I can write a fast robust implementation for nonaligned data. But this will be more cumbersome and a bit slower.
msg175332 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 08:53
> Here is a test case for the hash/alignment issue. I think here should be a test for a shifted data. Something like hash(b'abcd...') == hash(memoryview(b'xabcd...')[1:]).
msg175333 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 09:13
Oh, I see, it's already here.
msg175349 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年11月11日 11:41
> If a fast hash for bytes/memoryview is desirable, I can write a fast > robust implementation for nonaligned data. But this will be more > cumbersome and a bit slower. Unaligned accesses are not a problem on x86(-64), but they will segfault (bus error, IIRC) on other architectures such as SPARC, unfortunately. (blame RISC for being really too "simplified")
msg175351 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年11月11日 11:55
FWIW, on x86/x64 gcc often generates identical code for x = y and memcpy(x, y, 8). See e.g. the PACK_SINGLE and UNPACK_SINGLE macros in Objects/memoryobject.c. I didn't look at the patch yet, just an idea.
msg175353 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 12:20
> Unaligned accesses are not a problem on x86(-64), but they will segfault (bus error, IIRC) on other architectures such as SPARC, unfortunately. On x86(-64) this kills performance and makes the optimization be senseless. > FWIW, on x86/x64 gcc often generates identical code for x = y and memcpy(x, y, 8). The code can be identical, but the time will differ significantly for aligned and non-aligned data.
msg175361 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年11月11日 12:45
> The code can be identical, but the time will differ significantly for > aligned and non-aligned data. Of course, but in most cases the data is aligned, so only code that does something quite special pays the performance penalty.
msg175367 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 14:19
Stefan, thank you for the suggestion. The test showed that, in fact, at least under some x86 there is no performance decrease when using memcpy on nonaligned data. This is good news. The code can left simple and even some doubtful potential undefined behavior was removed. Additional microbenchmarks: $ ./python -m timeit -n 1 -s "t = memview(b'a' * 10*8)" "hash(t)" $ ./python -m timeit -n 1 -s "t = memview(b'a' 10*8)[1:]" "hash(t)" $ ./python -m timeit -n 1 -s "t = memview(b'a' 10**8)[8:]" "hash(t)" original patched speedup bytes 181 msec 46 msec 3.9x UCS1 429 msec 46.2 msec 9.3x UCS2 179 msec 91.9 msec 1.9x UCS4 183 msec 184 msec 1x memview() 362 msec 91.7 msec 3.9x memview()[1:] 362 msec 93.2 msec 3.9x memview()[8:] 362 msec 92.4 msec 3.9x I don't know how it will be on other platforms.
msg175385 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年11月11日 19:12
New changeset 797de1864fd9 by Antoine Pitrou in branch '3.3': Add a test for hashing of unaligned memory buffers (from issue #16427). http://hg.python.org/cpython/rev/797de1864fd9 New changeset 9cb1366b251b by Antoine Pitrou in branch 'default': Add a test for hashing of unaligned memory buffers (from issue #16427). http://hg.python.org/cpython/rev/9cb1366b251b
msg186352 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月08日 21:51
fast_hash_3.patch is a litte bit (6%) slower for Unicode string shorter than 10 characters, but much faster for string equal or longer than 100 characters (up to 10x faster). I used the str type and disabled its cache ("_PyUnicode_HASH(self) = x;" in unicode_hash()) to run my benchmark. Summary \| original \| patched ---------------+------------+--------------- Length 1 \| 231 ns () \| 244 ns (+6%) Length 3 \| 238 ns () \| 253 ns (+6%) Length 10 \| 254 ns () \| 251 ns Length 20 \| 280 ns () \| 256 ns (-8%) Length 100 \| 528 ns () \| 321 ns (-39%) Length 10 * 4 \| 32 us () \| 9.49 us (-70%) Length 10 * 8 \| 329 ms () \| 104 ms (-68%) ---------------+------------+--------------- Total \| 329 ms () \| 104 ms (-68%)
msg186353 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月08日 21:59
Does anyone know if fast_hash_3.patch may reduce the quality of the hash function? (May the patched hash function produce more collisions? The "Avalanche effect" thing.)
msg186403 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年04月09日 13:34
Well, the quality of the hash function is clearly reduced: >>> hash("abcdefgh") & 0xff 206 >>> hash("abcdefgi") & 0xff 206 >>> hash("abcdefgj") & 0xff 206 >>> hash("abxxxxxx") & 0xff 206 >>> hash("aaaaaa11") & 0xff 206 >>> hash("aaaaaa12") & 0xff 206 Now to know if that may produce slowdowns in some situations... (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one).
msg186405 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月09日 13:40
> (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one). Correct, so your specific example should not be a problem since the whole hash value is different for the 6 hash values. > Now to know if that may produce slowdowns in some situations... pybench and perf.py can be used to measure performances of the patch. The speedup may not be detected, but a slowdown would be detected at least. 2013年4月9日 Antoine Pitrou <report@bugs.python.org>: > > Antoine Pitrou added the comment: > > Well, the quality of the hash function is clearly reduced: > >>>> hash("abcdefgh") & 0xff > 206 >>>> hash("abcdefgi") & 0xff > 206 >>>> hash("abcdefgj") & 0xff > 206 >>>> hash("abxxxxxx") & 0xff > 206 >>>> hash("aaaaaa11") & 0xff > 206 >>>> hash("aaaaaa12") & 0xff > 206 > > > Now to know if that may produce slowdowns in some situations... (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one). > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue16427> > _______________________________________
msg186406 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2013年04月09日 13:42
Note that the patch uses type punning through a union: while GCC allows this, it's not allowed by ANSI (although since we're using a char [], it's somewhat a grey area). An aggresive compiler could optimiza the read/write away.
msg186421 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年04月09日 14:59
> Note that the patch uses type punning through a union What is the standard and portable way to cast an array of bytes to size_t? 2013年4月9日 Charles-François Natali <report@bugs.python.org>: > > Charles-François Natali added the comment: > > Note that the patch uses type punning through a union: while GCC allows this, it's not allowed by ANSI (although since we're using a char [], it's somewhat a grey area). An aggresive compiler could optimiza the read/write away. > > ---------- > nosy: +neologix > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue16427> > _______________________________________
msg186439 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2013年04月09日 16:23
> > Note that the patch uses type punning through a union > > What is the standard and portable way to cast an array of bytes to size_t? I'd expect just casting the pointer type before dereferencing: unsigned char p; ... hash = (multiplier hash) ^ ((Py_uhash_t )p); (don't use size_t, use Py_uhash_t)
msg186441 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年04月09日 16:36
> pybench and perf.py can be used to measure performances of the patch. > The speedup may not be detected, but a slowdown would be detected at > least. The slowdown would only occur for specific, well-chosen patterns. Also it may make DoS attacks easier. (remember the reason we randomized hashes is so that it's hard for attackers to find collisions)
msg186442 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2013年04月09日 16:51
> I'd expect just casting the pointer type before dereferencing: > > unsigned char p; > ... > hash = (multiplier hash) ^ ((Py_uhash_t )p); > > (don't use size_t, use Py_uhash_t) Is p guaranteed to be size_t aligned? If not, unaligned access can segfault (e.g. on Sparc IIRC). > Also it may make DoS attacks easier. Indeed. And the increase in collision you demonstrated in your previous message worries me (both security and performance wise).
msg186443 - (view)	Author: Charles-François Natali (neologix) * (Python committer)	Date: 2013年04月09日 17:02
> Is p guaranteed to be size_t aligned? > If not, unaligned access can segfault (e.g. on Sparc IIRC). Apparently yes.
msg186481 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2013年04月10日 07:42
> Note that the patch uses type punning through a union: while GCC allows > this, it's not allowed by ANSI. I believe it's legal under C99 + TC3.
msg190475 - (view)	Author: Lukas Lueg (ebfe)	Date: 2013年06月02日 09:30
I was investigating a callgrind dump of my code, showing how badly unicode_hash() was affecting my performance. Using google's cityhash instead of the builtin algorithm to hash unicode objects improves overall performance by about 15 to 20 percent for my case - that is quite a thing. Valgrind shows that the number of instructions spent by unicode_hash() drops from ~20% to ~11%. Amdahl crunches the two-fold performance increase to the mentioned 15 percent. Cityhash was chosen because of it's MIT license and advertisement for performance on short strings. I've now found this bug and attached a log for haypo's benchmark which compares native vs. cityhash. Caching was disabled during the test. Cityhash was compiled using -O3 -msse4.2 (cityhash uses cpu-native crc instructions). CPython's unittests fail due to known_hash and gdb output; besides that, everything else seems to work fine. Cityhash is advertised for it's performance with short strings, which does not seem to show in the benchmark. However, longer strings perform much better. If people are insterested, i can repeat the test on a armv7l
msg190476 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年06月02日 09:57
> I was investigating a callgrind dump of my code, showing how badly > unicode_hash() was affecting my performance. Can you tell us about your use case? There are several CityHash variants, which one did you use? CityHash64?
msg190478 - (view)	Author: Lukas Lueg (ebfe)	Date: 2013年06月02日 10:10
It's a cache sitting between an informix db and and an internal web service. Stuff comes out of db, processed, json'ifed, cached and put on the wire. 10**6s of strings pass this process per request if uncached... I use CityHash64WithSeed, the seed being cpython's hash prefix (which I don't care about but found reassuring to put in anyway)
msg190488 - (view)	Author: Lukas Lueg (ebfe)	Date: 2013年06月02日 14:21
Here are some benchmarks for a arm7l on a rk30-board. CityHash was compiled with -mcpu=native -O3. CityHash is around half as fast as the native algorithm for small strings and way, way slower on larger ones. My guess would be that the complex arithmetic in cityhash outweights the gains of better scheduling. The results are somewhat inconclusive, as the performance increases again for very long strings.
msg190489 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年06月02日 14:34
> Here are some benchmarks for a arm7l on a rk30-board. CityHash was > compiled with -mcpu=native -O3. The results look unbelievable. If you take "Length 10 ** 4", it means arm7l is able to hash 20 GB/s using the default unicode hash function. (did you disable caching?)
msg190491 - (view)	Author: Lukas Lueg (ebfe)	Date: 2013年06月02日 17:06
The 10**4-case is an error (see insane %), I've never been able to reproduce. Having done more tests with fixed cpu frequency and other daemons' process priority reduced, cityhash always comes out much slower on arm7l.
msg190676 - (view)	Author: Lukas Lueg (ebfe)	Date: 2013年06月05日 16:30
Here are more benchmarks of vanilla 3.4 vs. cityhash vs. fast_hash_3 on both arm7l and x86-64. The patch was applied varbatim, only caching disabled. On arm7l, the cpu was fixed to maximum freq (it seems to take ages to switch frequencies, at least there is a lot of jitter with ondemand). The cityhash implementation was compiled with -O3 on both platforms and -msse4.2 on x86-64. CityHash and fh3 come out much better than vanilla on x86-64 with cityhash being slightly faster (which is surprising). On ARM7 CityHash performs much worse than vanilla and fh3 significantly better.
msg190704 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年06月06日 00:48
I'm not sure that micro-benchmarks are revelant for this issue. Hash collisions may have an higher cost than the gain of a faster hash function. Some people feel also concerned by the dict denial of service issue. It would be interesting to compare performances using the benchmark suite: http://hg.python.org/benchmarks/
msg201493 - (view)	Author: Christian Heimes (christian.heimes) * (Python committer)	Date: 2013年10月27日 20:27
Antoine, I have addressed your concern "Well, the quality of the hash function is clearly reduced" in patch http://hg.python.org/features/pep-456/rev/765930d944a5 >>> s = set() >>> for i in range(256): ... s.add(hash("abcdfeg" + chr(i)) & 0xff) ... >>> len(s) 256 >>> s = set() >>> for i in range(256): ... s.add(hash("abcdfeghabcdefg" + chr(i)) & 0xff) ... >>> len(s) 256
msg201525 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年10月28日 11:06
Shouldn't this issue be obsoleted?
msg201529 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年10月28日 11:58
Since issue19183 supersedes it, yes.

History
Date	User	Action	Args
2022年04月11日 14:57:38	admin	set	github: 60631
2013年10月28日 11:58:03	serhiy.storchaka	set	status: open -> closed superseder: PEP 456 Secure and interchangeable hash algorithm messages: + msg201529 resolution: duplicate stage: patch review -> resolved
2013年10月28日 11:06:44	pitrou	set	messages: + msg201525
2013年10月27日 20:28:49	christian.heimes	set	assignee: christian.heimes
2013年10月27日 20:27:40	christian.heimes	set	messages: + msg201493
2013年06月06日 00:48:54	vstinner	set	messages: + msg190704
2013年06月05日 16:30:17	ebfe	set	files: + cityhash_fasthast3.txt
2013年06月05日 16:30:04	ebfe	set	files: + cityhash_fashhash3_arm.txt messages: + msg190676
2013年06月02日 17:06:53	ebfe	set	messages: + msg190491
2013年06月02日 14:34:21	pitrou	set	messages: + msg190489
2013年06月02日 14:21:08	ebfe	set	files: + cityhash_arm.txt messages: + msg190488
2013年06月02日 10:10:32	ebfe	set	messages: + msg190478
2013年06月02日 09:57:29	pitrou	set	messages: + msg190476
2013年06月02日 09:30:23	ebfe	set	files: + cityhash.txt nosy: + ebfe messages: + msg190475
2013年04月23日 22:04:30	isoschiz	set	nosy: + isoschiz
2013年04月10日 07:42:41	mark.dickinson	set	messages: + msg186481
2013年04月09日 17:02:58	neologix	set	messages: + msg186443
2013年04月09日 16:51:46	neologix	set	messages: + msg186442
2013年04月09日 16:36:54	pitrou	set	messages: + msg186441
2013年04月09日 16:23:16	gregory.p.smith	set	messages: + msg186439
2013年04月09日 14:59:40	vstinner	set	messages: + msg186421
2013年04月09日 13:42:57	neologix	set	nosy: + neologix messages: + msg186406
2013年04月09日 13:40:02	vstinner	set	messages: + msg186405
2013年04月09日 13:34:33	pitrou	set	messages: + msg186403
2013年04月09日 02:06:06	rhettinger	set	nosy: + rhettinger
2013年04月08日 21:59:22	vstinner	set	messages: + msg186353
2013年04月08日 21:51:18	vstinner	set	files: + bench_hash.py
2013年04月08日 21:51:03	vstinner	set	files: + bench_hash.txt messages: + msg186352
2013年04月08日 10:13:28	serhiy.storchaka	set	files: - fast_hash_2.patch
2013年04月08日 10:13:13	serhiy.storchaka	set	files: - fast_str_hash.patch
2013年04月07日 23:18:12	vstinner	set	nosy: + vstinner
2012年11月15日 15:51:27	asvetlov	set	nosy: + asvetlov
2012年11月11日 19:12:04	python-dev	set	nosy: + python-dev messages: + msg175385
2012年11月11日 14:19:22	serhiy.storchaka	set	files: + fast_hash_3.patch messages: + msg175367
2012年11月11日 12:45:05	skrah	set	messages: + msg175361
2012年11月11日 12:20:51	serhiy.storchaka	set	messages: + msg175353
2012年11月11日 11:55:35	skrah	set	messages: + msg175351
2012年11月11日 11:41:43	pitrou	set	messages: + msg175349
2012年11月11日 09:13:51	serhiy.storchaka	set	messages: + msg175333
2012年11月11日 08:53:47	serhiy.storchaka	set	messages: + msg175332
2012年11月11日 08:50:20	serhiy.storchaka	set	files: + fast_str_hash.patch messages: + msg175330
2012年11月10日 23:37:41	pitrou	set	files: + test_hash_align.patch messages: + msg175303
2012年11月10日 23:27:39	pitrou	set	nosy: + tim.peters, mark.dickinson, skrah
2012年11月10日 23:26:40	pitrou	set	nosy: + pitrou messages: + msg175302 stage: patch review
2012年11月10日 22:43:17	serhiy.storchaka	set	files: - fast_hash.patch
2012年11月10日 22:39:51	serhiy.storchaka	set	files: + fast_hash_2.patch messages: + msg175300
2012年11月10日 22:01:56	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg175298
2012年11月09日 21:48:10	jcea	set	nosy: + jcea
2012年11月07日 12:45:53	christian.heimes	set	nosy: + christian.heimes
2012年11月07日 12:38:49	serhiy.storchaka	create

homepage