Tuesday, March 10, 2009
Hashing, Shingling and HashTrees
Hashing is my favorite methodology for data mining, machine learning and information retrieval. For computing “similarities” between a set of N objects (such as text, images, videos, web pages, etc), you need two steps:
Another interesting hash technique is the HashTree construction. Here an order set of items (itemset) is hashed with an hash function. The hash of item in position i-th is used to select a specific children of the node currently visited. In short, hashing is used to drive the visit of a path from the root to a leaf. Each leave of the HashTree contains a bucket, where similar itemsets are stored. The main difference with shingling is that order is preserved and that to items are considered as candidate for similarity if their hash collides in the first k positions. It turns our that this property if very useful for computing frequent itemsets in the Apriori algorithm. Here you have the C++ code for HashTree.
- define a distance metric among objects;
- compute the matrix of distances in O(N^2);
- define the fingerprint of the object;
- map the object to buckets; one different bucket for each different fingerprint;
- compare directly the objects in the same bucket.
Another interesting hash technique is the HashTree construction. Here an order set of items (itemset) is hashed with an hash function. The hash of item in position i-th is used to select a specific children of the node currently visited. In short, hashing is used to drive the visit of a path from the root to a leaf. Each leave of the HashTree contains a bucket, where similar itemsets are stored. The main difference with shingling is that order is preserved and that to items are considered as candidate for similarity if their hash collides in the first k positions. It turns our that this property if very useful for computing frequent itemsets in the Apriori algorithm. Here you have the C++ code for HashTree.
Subscribe to:
Post Comments (Atom)
2 comments:
Very useful blog. Thank you ..
Reply Deletethe c++ code is corrupted. Can you please check that.
Reply Delete