5

I have a set of n elements (1,000 <= n <= 100,000) and I can compute the grade of similarity between each pair, that is a value from 0 (very similar) to 1 (very different). I would like to cluster the elements based on their grade of similarity.

I thought about representing them as a graph, the elements are the vertices and the weighted edges are the similarity between them. I read about the MCL algorithm but I think it isn't the best approach since my graph is complete.

On the other hand, as there are a lot of elements, maybe computing the similarity between each pair is not the best practice (I want a fast algorithm). I also read something about leader clustering algorithms but, again, I am not sure if it is the best approach because, as far as I know, it is quite prone to fail due to its greediness (I would like something more robust).

Edit: I forgot to mention that I know a threshold for which when the comparison between two elements is higher than it, then I know that they belong to different clusters.

asked Mar 19, 2015 at 13:44
3
  • 2
    If I know Similarity(a,b) and Similarity(b,c), does that tell me (or allow me to bound) Similarity(a,c)? Commented Mar 19, 2015 at 13:49
  • No. Sometimes it does, but I never know when it is likely to occur Commented Mar 19, 2015 at 14:03
  • 3
    The type of clustering algorithms you might want to use (given your requirements) is probably complete-linkage clustering which is necessary to avoid the chaining problem. If it is more complex than that, you may have to reconsider how that "similarity" is defined - e.g. if it is originally computed by a vector of scores, you may have to use that whole vector instead of a single similarity value. Commented Mar 19, 2015 at 15:33

2 Answers 2

1

I don't think any meaningful clustering is possible if similarity(a,b) and similarity(b,c) don't upper-bound similarity(a,c). To demonstrate, let's consider the following simple (and extreme) example with only 3 items:

  • similarity(a,b) == 0
  • similarity(b,c) == 0
  • similarity(a,c) == 1

a should thus be in the same cluster as b and b in the same cluster as c. But a and c should be in different clusters, which contradicts the previous expectations.

answered Apr 11, 2015 at 22:38
0

This is a spectral clustering problem which has been studied in research area for a long time. Generally speaking, a spectral clustering algorithm uses the eigenvalues (aka spectrum) analysis to split the data into two or more clusters at a time. Each of this splitting is somehow globally optimized which leads to a overall good results of the final clustering.

The wikipedia entry can give more details.

PS: Commonly an element of a similarity matrix, which is a similarity measure for two objects, has a smaller value for dissimilar objects and a larger value for similar ones.

answered Apr 12, 2015 at 5:22
1
  • An answer that amounts to 'go google it' is not a strong answer. Please provide information about the nature of the problem and a brief summary of the information. If appropriate, include references to your summary for further reading for the OP, though these references should not make up the bulk of the answer. Commented Apr 12, 2015 at 16:23

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.