Appropriate graph clustering algorithm

Question 1

I'm looking for an appropriate technique to search for clusters. My underlying data is 70,000 respondents to about 2500 multiple choice questions. Most respondents have not answered most questions. I have no expectations as to how many clusters I might expect or how clustered this data might be but am interested to explore if there are personality types wrt. to the kinds of questions asked, and if so, how well defined they are.

My thinking is that the best approach might be to transform the data into a graph and run a cluster analysis on that. Each node is a respondent and each edge is a distance between '0' or '1' derived from how similarly the respondents answered whatever questions they both responded to (or no connection at all, if they answered very few common questions) but I'm struggling to identify an appropriate algorithm for this situation.

Note: it seems likely that this graph setup will violate the principle of triangularity ie. there is no reason to suppose that $ dist(i, \ j) \leq dist(j, \ k) + dist(k, \ i) $ will always hold.

Question 2

That would make the question on-topic, yes. Since graph clustering is quite an established field, it would also help if you gave some indication of what research you'd done to try to find an appropriate algorithm.

Question 3

I would treat each multiple choice answer as a "feature" of a given user. If we assume your 2500 multiple choice questions each had 4 possible answers, you would then have a sparse matrix that is 70,000x10,000. You could have a varying number of answers per question and still use this approach.

You could then cluster this via k-means or similar data clustering.

If you really want to do graph clustering, you could treat the sparse matrix as a bipartite graph, with 70,000+10,000 nodes, and edge exists where a user has given a certain answer. You could then apply graph clustering techniques like modularity maximization or graph partitioning.

Detecting these groups/clusters/communities falls under the category of unsupervised learning.

Question 4

The k-means expects the parameter $k$ to be given, it is due to OP unknown, so another technique for finding $k$ must be used (like Gaussian test or BIC... etc.), so it is not that good in this case. Also it assumes that all answers are in some clusters, which is not always true or expected.

Question 5

Why do you expect this to be good? Why choose this model and algorithm over others?

dlasalle dlasalle 1313 bronze badges · Answer 1 · 2017-06-03 01:11:11Z

I would treat each multiple choice answer as a "feature" of a given user. If we assume your 2500 multiple choice questions each had 4 possible answers, you would then have a sparse matrix that is 70,000x10,000. You could have a varying number of answers per question and still use this approach.

You could then cluster this via k-means or similar data clustering.

If you really want to do graph clustering, you could treat the sparse matrix as a bipartite graph, with 70,000+10,000 nodes, and edge exists where a user has given a certain answer. You could then apply graph clustering techniques like modularity maximization or graph partitioning.

Detecting these groups/clusters/communities falls under the category of unsupervised learning.

The k-means expects the parameter $k$ to be given, it is due to OP unknown, so another technique for finding $k$ must be used (like Gaussian test or BIC... etc.), so it is not that good in this case. Also it assumes that all answers are in some clusters, which is not always true or expected.
Why do you expect this to be good? Why choose this model and algorithm over others?

Stack Exchange Network

Appropriate graph clustering algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Appropriate graph clustering algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions