Issue 10302: Add class-functions to hash many small objects with hashlib

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54511

classification

Title:	Add class-functions to hash many small objects with hashlib
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	rejected
Assigned To:	Nosy List:	ebfe, gregory.p.smith, pitrou, rhettinger
Priority:	normal	Keywords:

Created on 2010年11月03日 22:44 by ebfe, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (6)
msg120351 - (view)	Author: Lukas Lueg (ebfe)	Date: 2010年11月03日 22:44
The objects provided by hashlib mainly serve the purpose of computing hashes over strings of arbitrary size. The user gets a new object (e.g. hashlib.sha1()), calls .update() with chunks of data and then finally uses .digest() or .hexdigest() to get the hash. For convenience reasons these steps can also be done in almost one step (e.g. hashlib.sha1('foobar').hexdigest()). While the above approach basically covers all use-cases for hash-functions, when computing hashes of many small strings it is yet inefficient (e.g. due to interpreter-overhead) and leaves out the possibility for performance improvements. There are many cases where we need the hashes of numerous (small) objects, most or all of which being available in memory at the same time. I therefor propose to extend the classes provided by hashlib with an additional function that takes an iterable object, computes the hash over the string representation of each member and returns the result. Due to the aim of this interface, the function is a member of the class (not the instance) and has therefor no state bound to an instance. Memory requirements are to be anticipated and met by the programmer. For example: foo = ['my_database_key1', 'my_database_key2'] hashlib.sha1.compute(foo) >> ('\x00\x00', '\xff\xff') I consider this interface to hashlib particular useful, as we can take advantage of vector-based implementations that compute multiple hashes in one pass (e.g. through SSE2). GCC has a vector-extension that provides a somewhat standard way to write code that can get compiled to SSE2 or similar machine code. Examples of vector-based implementations of SHA1 and MD5 can be found at https://code.google.com/p/pyrit/issues/detail?id=207 Contigency plan: We compile to code iterating over OpenSSL's EVP-functions if compiler is other than GCC or SSE2 is not available. The same approach can be used to cover hashlib-objects for which we don't have an optimized implementation.
msg120441 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2010年11月04日 21:07
This sounds dangerously like a micro-optimization to me, and I'm not sure an additional API is ok for that.
msg120450 - (view)	Author: Lukas Lueg (ebfe)	Date: 2010年11月04日 21:40
Thanks for your comment; it is a very valid point to consider. However, as a vector-based implementation is roughly three to four times faster than what the current API can provide by design (reduced overhead and GIL-relaxation not included), I may disagree with it. I'm willing to make a proposal (read: patch) if you and the other overlords have enough confidence in this API-change and it has a chance to get submitted.
msg120455 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2010年11月04日 21:55
> Thanks for your comment; it is a very valid point to consider. > However, as a vector-based implementation is roughly three to four > times faster than what the current API can provide by design (reduced > overhead and GIL-relaxation not included), I may disagree with it. I fear that this kind of optimization is a moving target. If some CPU architecture gains hardware support for cryptographic hashes, OpenSSL will support it and be much faster than Python's parallel version. > I'm willing to make a proposal (read: patch) if you and the other > overlords have enough confidence in this API-change and it has a > chance to get submitted. If the only advantage is some speedup on parallel hashing of many-small-strings, I don't think a new API is warranted for such a specialized use case.
msg120457 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2010年11月04日 22:20
I concur with Overlord Antoine ;-) The API of "hashfunc *files" looks like something that should be done at the command line in bash. For the most part, I think users are better-off using normal Python loops so they can do something with the computed hashes as the files are read.
msg120564 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2010年11月06日 00:27
This is an application specific optimization that I'd like to see as its own library or a feature added to NSS or OpenSSL. Regardless if you want this for python, write an extension module and put it on pypy before aiming for the stdlib. I understand the optimization potential but it is a bit obscure.

History
Date	User	Action	Args
2022年04月11日 14:57:08	admin	set	github: 54511
2010年11月06日 00:27:23	gregory.p.smith	set	status: open -> closed resolution: rejected messages: + msg120564
2010年11月04日 22:20:14	rhettinger	set	nosy: + rhettinger messages: + msg120457
2010年11月04日 21:55:26	pitrou	set	messages: + msg120455
2010年11月04日 21:40:23	ebfe	set	messages: + msg120450
2010年11月04日 21:07:59	pitrou	set	nosy: + gregory.p.smith, pitrou messages: + msg120441
2010年11月03日 22:44:16	ebfe	create

homepage