Clustering algorithm

Question 1

I have a set of n elements (1,000 <= n <= 100,000) and I can compute the grade of similarity between each pair, that is a value from 0 (very similar) to 1 (very different). I would like to cluster the elements based on their grade of similarity.

I thought about representing them as a graph, the elements are the vertices and the weighted edges are the similarity between them. I read about the MCL algorithm but I think it isn't the best approach since my graph is complete.

On the other hand, as there are a lot of elements, maybe computing the similarity between each pair is not the best practice (I want a fast algorithm). I also read something about leader clustering algorithms but, again, I am not sure if it is the best approach because, as far as I know, it is quite prone to fail due to its greediness (I would like something more robust).

Edit: I forgot to mention that I know a threshold for which when the comparison between two elements is higher than it, then I know that they belong to different clusters.

Question 2

If I know Similarity(a,b) and Similarity(b,c), does that tell me (or allow me to bound) Similarity(a,c)?

Question 3

No. Sometimes it does, but I never know when it is likely to occur

Question 4

The type of clustering algorithms you might want to use (given your requirements) is probably complete-linkage clustering which is necessary to avoid the chaining problem. If it is more complex than that, you may have to reconsider how that "similarity" is defined - e.g. if it is originally computed by a vector of scores, you may have to use that whole vector instead of a single similarity value.

Question 5

I don't think any meaningful clustering is possible if similarity(a,b) and similarity(b,c) don't upper-bound similarity(a,c). To demonstrate, let's consider the following simple (and extreme) example with only 3 items:

similarity(a,b) == 0
similarity(b,c) == 0
similarity(a,c) == 1

a should thus be in the same cluster as b and b in the same cluster as c. But a and c should be in different clusters, which contradicts the previous expectations.

Question 6

This is a spectral clustering problem which has been studied in research area for a long time. Generally speaking, a spectral clustering algorithm uses the eigenvalues (aka spectrum) analysis to split the data into two or more clusters at a time. Each of this splitting is somehow globally optimized which leads to a overall good results of the final clustering.

The wikipedia entry can give more details.

PS: Commonly an element of a similarity matrix, which is a similarity measure for two objects, has a smaller value for dissimilar objects and a larger value for similar ones.

Question 7

An answer that amounts to 'go google it' is not a strong answer. Please provide information about the nature of the problem and a brief summary of the information. If appropriate, include references to your summary for further reading for the OP, though these references should not make up the bulk of the answer.

das-g das-g 7845 silver badges13 bronze badges · Answer 1 · 2015-04-11 22:38:40Z

I don't think any meaningful clustering is possible if similarity(a,b) and similarity(b,c) don't upper-bound similarity(a,c). To demonstrate, let's consider the following simple (and extreme) example with only 3 items:

similarity(a,b) == 0
similarity(b,c) == 0
similarity(a,c) == 1

a should thus be in the same cluster as b and b in the same cluster as c. But a and c should be in different clusters, which contradicts the previous expectations.

David David 1094 bronze badges · Answer 2 · 2015-04-12 05:22:00Z

This is a spectral clustering problem which has been studied in research area for a long time. Generally speaking, a spectral clustering algorithm uses the eigenvalues (aka spectrum) analysis to split the data into two or more clusters at a time. Each of this splitting is somehow globally optimized which leads to a overall good results of the final clustering.

The wikipedia entry can give more details.

PS: Commonly an element of a similarity matrix, which is a similarity measure for two objects, has a smaller value for dissimilar objects and a larger value for similar ones.

An answer that amounts to 'go google it' is not a strong answer. Please provide information about the nature of the problem and a brief summary of the information. If appropriate, include references to your summary for further reading for the OP, though these references should not make up the bulk of the answer.

Stack Exchange Network

Clustering algorithm

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Clustering algorithm

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions