How costly are the Python dict and set in-built types?

Question 1

I have undertaken a project concerning database deduplication. I did some research and have found that the Python dict type actually is a hashmap that uses open addressing.

In the deduplication module, we'll have some rules that determine whether two records are identical, with the rules essentially spelling out the attributes that uniquely identify the record (did not call it a candidate key as the DB is going to be non-relational, no-sql). Now let's say we deal with a really large dataset. Naturally, hashing is the way to go ( found that advice here).

The question(s):

does the module need to compute a hash and then store it in a dict? Would that not be unnecessary, as the dict implementation is itself a hashmap?
how costly is the conversion of a list to a set? That conversion should remove all the duplicates, but given the huge scale, is that practical?
what is the cost incurred in checking membership in a dict/set using the "in" keyword?

Hadoop MapReduce is not an option at least for now.

I can't really dive into the Python sources to figure this out, as I'm strictly time limited :|

Question 2

What exactly do you mean by "conversion of a dict to a set"? A dict is a mapping from unique keys to values, and a set is a collection of unique values. So, what do you want to be in the set?

Question 3

Oh I'm sorry, I meant converting a list to a set :)

Question 4

The dict implementation is a hashmap, true, but is the hash function suitable for what you're doing? For deduplication, you probably want hash functions with large outputs that almost never collide, which is probably not ideal for a general-purpose in-memory hash.

Question 5

@DavidThornley yes, but I thought the collisions are taken care of internally using open addressing/closed hashing

Question 6

http://wiki.python.org/moin/TimeComplexity

That should pretty much cover everything. What more do you need that's not already on this page?

Question 7

searching a dict using in? let me guess: O(1) average and O(n) worst?

Question 8

@yatisagade: in is effectively try: __getitem__(key); return True; except KeyError: return False. So (generally) in generally behaves like __getitem__.

Question 9

How about space complexity?

S.Lott S.Lott 45.5k6 gold badges93 silver badges155 bronze badges · Accepted Answer · 2011-10-05 22:01:26Z

7

http://wiki.python.org/moin/TimeComplexity

That should pretty much cover everything. What more do you need that's not already on this page?

Share

Improve this answer

answered Oct 5, 2011 at 22:01

S.Lott's user avatar

S.Lott S.Lott

45.5k6 gold badges93 silver badges155 bronze badges

3

searching a dict using in? let me guess: O(1) average and O(n) worst?

yati sagade
– yati sagade

10/05/2011 22:15:44
Commented Oct 5, 2011 at 22:15
@yatisagade: in is effectively try: __getitem__(key); return True; except KeyError: return False. So (generally) in generally behaves like __getitem__.

S.Lott
– S.Lott

10/06/2011 00:20:48
Commented Oct 6, 2011 at 0:20
How about space complexity?

Pat
– Pat

10/14/2014 23:29:21
Commented Oct 14, 2014 at 23:29

Add a comment |

Stack Exchange Network

How costly are the Python dict and set in-built types?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How costly are the Python dict and set in-built types?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions