I have a set of n elements (1,000 <= n <= 100,000) and I can compute the grade of similarity between each pair, that is a value from 0 (very similar) to 1 (very different). I would like to cluster the elements based on their grade of similarity.
I thought about representing them as a graph, the elements are the vertices and the weighted edges are the similarity between them. I read about the MCL algorithm but I think it isn't the best approach since my graph is complete.
On the other hand, as there are a lot of elements, maybe computing the similarity between each pair is not the best practice (I want a fast algorithm). I also read something about leader clustering algorithms but, again, I am not sure if it is the best approach because, as far as I know, it is quite prone to fail due to its greediness (I would like something more robust).
Edit: I forgot to mention that I know a threshold for which when the comparison between two elements is higher than it, then I know that they belong to different clusters.
-
2If I know Similarity(a,b) and Similarity(b,c), does that tell me (or allow me to bound) Similarity(a,c)?Ben Aaronson– Ben Aaronson2015年03月19日 13:49:27 +00:00Commented Mar 19, 2015 at 13:49
-
No. Sometimes it does, but I never know when it is likely to occuribci– ibci2015年03月19日 14:03:27 +00:00Commented Mar 19, 2015 at 14:03
-
3The type of clustering algorithms you might want to use (given your requirements) is probably complete-linkage clustering which is necessary to avoid the chaining problem. If it is more complex than that, you may have to reconsider how that "similarity" is defined - e.g. if it is originally computed by a vector of scores, you may have to use that whole vector instead of a single similarity value.rwong– rwong2015年03月19日 15:33:08 +00:00Commented Mar 19, 2015 at 15:33
2 Answers 2
I don't think any meaningful clustering is possible if similarity(a,b)
and similarity(b,c)
don't upper-bound similarity(a,c)
. To demonstrate, let's consider the following simple (and extreme) example with only 3 items:
similarity(a,b) == 0
similarity(b,c) == 0
similarity(a,c) == 1
a
should thus be in the same cluster as b
and b
in the same cluster as c
. But a
and c
should be in different clusters, which contradicts the previous expectations.
This is a spectral clustering problem which has been studied in research area for a long time. Generally speaking, a spectral clustering algorithm uses the eigenvalues (aka spectrum) analysis to split the data into two or more clusters at a time. Each of this splitting is somehow globally optimized which leads to a overall good results of the final clustering.
The wikipedia entry can give more details.
PS: Commonly an element of a similarity matrix, which is a similarity measure for two objects, has a smaller value for dissimilar objects and a larger value for similar ones.
-
An answer that amounts to 'go google it' is not a strong answer. Please provide information about the nature of the problem and a brief summary of the information. If appropriate, include references to your summary for further reading for the OP, though these references should not make up the bulk of the answer.user40980– user409802015年04月12日 16:23:54 +00:00Commented Apr 12, 2015 at 16:23