2
$\begingroup$

I'm looking for an appropriate technique to search for clusters. My underlying data is 70,000 respondents to about 2500 multiple choice questions. Most respondents have not answered most questions. I have no expectations as to how many clusters I might expect or how clustered this data might be but am interested to explore if there are personality types wrt. to the kinds of questions asked, and if so, how well defined they are.

My thinking is that the best approach might be to transform the data into a graph and run a cluster analysis on that. Each node is a respondent and each edge is a distance between '0' or '1' derived from how similarly the respondents answered whatever questions they both responded to (or no connection at all, if they answered very few common questions) but I'm struggling to identify an appropriate algorithm for this situation.

Note: it seems likely that this graph setup will violate the principle of triangularity ie. there is no reason to suppose that $ dist(i, \ j) \leq dist(j, \ k) + dist(k, \ i) $ will always hold.

ice1000
1,0006 silver badges35 bronze badges
asked May 28, 2017 at 21:41
$\endgroup$
1
  • 2
    $\begingroup$ That would make the question on-topic, yes. Since graph clustering is quite an established field, it would also help if you gave some indication of what research you'd done to try to find an appropriate algorithm. $\endgroup$ Commented May 29, 2017 at 9:57

1 Answer 1

0
$\begingroup$

I would treat each multiple choice answer as a "feature" of a given user. If we assume your 2500 multiple choice questions each had 4 possible answers, you would then have a sparse matrix that is 70,000x10,000. You could have a varying number of answers per question and still use this approach.

You could then cluster this via k-means or similar data clustering.

If you really want to do graph clustering, you could treat the sparse matrix as a bipartite graph, with 70,000+10,000 nodes, and edge exists where a user has given a certain answer. You could then apply graph clustering techniques like modularity maximization or graph partitioning.

Detecting these groups/clusters/communities falls under the category of unsupervised learning.

answered Jun 3, 2017 at 1:11
$\endgroup$
2
  • $\begingroup$ The k-means expects the parameter $k$ to be given, it is due to OP unknown, so another technique for finding $k$ must be used (like Gaussian test or BIC... etc.), so it is not that good in this case. Also it assumes that all answers are in some clusters, which is not always true or expected. $\endgroup$ Commented Jun 3, 2017 at 1:53
  • $\begingroup$ Why do you expect this to be good? Why choose this model and algorithm over others? $\endgroup$ Commented Sep 1, 2017 at 5:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.