I have undertaken a project concerning database deduplication. I did some research and have found that the Python dict type actually is a hashmap that uses open addressing.
In the deduplication module, we'll have some rules that determine whether two records are identical, with the rules essentially spelling out the attributes that uniquely identify the record (did not call it a candidate key as the DB is going to be non-relational, no-sql). Now let's say we deal with a really large dataset. Naturally, hashing is the way to go ( found that advice here).
The question(s):
- does the module need to compute a hash and then store it in a dict? Would that not be unnecessary, as the dict implementation is itself a hashmap?
- how costly is the conversion of a list to a set? That conversion should remove all the duplicates, but given the huge scale, is that practical?
- what is the cost incurred in checking membership in a dict/set using the "in" keyword?
Hadoop MapReduce is not an option at least for now.
I can't really dive into the Python sources to figure this out, as I'm strictly time limited :|
-
What exactly do you mean by "conversion of a dict to a set"? A dict is a mapping from unique keys to values, and a set is a collection of unique values. So, what do you want to be in the set?murgatroid99– murgatroid9910/05/2011 21:33:33Commented Oct 5, 2011 at 21:33
-
Oh I'm sorry, I meant converting a list to a set :)yati sagade– yati sagade10/05/2011 21:40:21Commented Oct 5, 2011 at 21:40
-
The dict implementation is a hashmap, true, but is the hash function suitable for what you're doing? For deduplication, you probably want hash functions with large outputs that almost never collide, which is probably not ideal for a general-purpose in-memory hash.David Thornley– David Thornley10/05/2011 21:59:19Commented Oct 5, 2011 at 21:59
-
@DavidThornley yes, but I thought the collisions are taken care of internally using open addressing/closed hashingyati sagade– yati sagade10/05/2011 22:12:08Commented Oct 5, 2011 at 22:12
1 Answer 1
http://wiki.python.org/moin/TimeComplexity
That should pretty much cover everything. What more do you need that's not already on this page?
-
searching a dict using in? let me guess: O(1) average and O(n) worst?yati sagade– yati sagade10/05/2011 22:15:44Commented Oct 5, 2011 at 22:15
-
@yatisagade:
in
is effectivelytry: __getitem__(key); return True; except KeyError: return False
. So (generally)in
generally behaves like__getitem__
.S.Lott– S.Lott10/06/2011 00:20:48Commented Oct 6, 2011 at 0:20 -
Explore related questions
See similar questions with these tags.