I have data that indicates which topics each student has got right/wrong in an assessment.:
| Student | Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | ... (perhaps 30 topics)
| 1 | T | T | T | T | F |
| 2 | F | T | F | T | T |
| 3 | T | F | F | T | T |
| 4 | T | F | F | T | F |
| 5 | F | F | T | T | F |
... (perhaps 200 students)
I would like to design an algorithm to identify groups of students with similar areas of misunderstanding. For example, it might return:
- 70 students who all got at least 80% of topics 4, 6, 7, 12, 14, 19, 20, 21 and 25 incorrect.
- 56 students who all got at least 80% of topics 4, 8, 10, 11, 19 and 24 incorrect.
- etc.
It would need to seek the largest possible subsets of topics, subject to matching a minimum number of students.
I realise this is not a clearly defined problem, so I would appreciate help in stating it more clearly - as well as suggestions for an algorithm design (or machine learning approach?).
-
1It sounds like you may be looking for some Pattern recognition assistance. It's a pretty big field. Cluster analysis might help, but it's not a casual "write up some code in 10 minutes" sort of thing.Dan Pichelman– Dan Pichelman2016年04月20日 17:46:56 +00:00Commented Apr 20, 2016 at 17:46
-
Thanks for the links, I'll look into those. I don't suppose there's an efficient way of checking combinations of topics to see which ones match the threshold number of students. For 30 topics, there are over a billion combinations but can it be optimised?James– James2016年04月20日 17:56:27 +00:00Commented Apr 20, 2016 at 17:56
3 Answers 3
You need a clustering algorithm. The basic idea is that you want to treat each student's data as an n-dimensional coordinate, e.g. 30-dimensional coordinate in this case.
So if you had 3 dimensions, you could have a student at x,y,z: 1,1,0; another one at 0,1,1, etc.
Distance between them could be found using euclidean formula.
Distance between n-dimensional coords can be found by the same formula:
https://en.wikipedia.org/wiki/Euclidean_distance#n_dimensions
However, very often in this type of clustering problems, euclidean formula may not work best, and instead you may have better results using the manhattan distance:
One of the simple and easy algorithms for clustering is k-means clustering:
https://en.wikipedia.org/wiki/K-means_clustering
The popular and well documented python machine learning library is called scikit-learn, and it has many types of clustering included, k-means I mentioned above, and many others:
http://scikit-learn.org/stable/modules/clustering.html#clustering
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
Hope this helps! I've implemented simple clustering as an exercise myself but not on real n-dimensional data, so this is all based on what I've read (mostly the 'Data Smart' machine learning book, which is very good but uses excel formulas for illustration which is a big shortcoming from my point of view, as a python programmer).
Let me know if you have questions..
[EDIT: one nuance you need to consider is whether you want to treat matches between students getting a topic right as more significant, less significant, or equal to match when they get a topic wrong. If they are equal, you can simply use one of the above distance metrics, if you want one of them to be more significant, you'll need to do some adjustments.
This is called "asymmetric distance calculation". One example used in the book is "cosine distance". I think scikit-learn will have some distance metrics that can help with this, look for "boolean-valued vector spaces" metrics, you will need to read up on them and probably experiment using different ones.]
Starting from permutations and combinations, I think we're interested in combinations:
For 30 topics, you would have the followng count of combinations for each size:
(In other words there are 30 combinations of size 1, and 435 combinations of size 2 (also 435 of size 28 as it turns out)).
- 30
- 435
- 4060
- 27405
- 142506
- 593775
- 2035800
- 5852925
- 14307150
- 30045015
- 54627300
- 86493225
- 119759850
- 145422675
- 155117520
- 145422675
- 119759850
- 86493225
- 54627300
- 30045015
- 14307150
- 5852925
- 2035800
- 593775
- 142506
- 27405
- 4060
- 435
- 30
- 1
I think you could generate each of the combinations for each size of combination from 1-30. You can google "generate all combinations", and, also, I have attached one that I whipped up below.
For each combination size, we might find the combinations, and find occurrences of them among students, and sort them by their count. You could have a cutoff of interest, keeping only the top some % for each size.
We can store each student's topic score with one bit. Thus, for 30 topics, we need 30 bits per student. On a 64-bit machine, we can store up to 64 bits in a single integer.
We can represent a combination also as an integer, whose bits are set to represent what elements are in the combination. In other words, a combination of size 6 is represented by an integer having exactly 6 of its bits set.
So, given a combination, we can test each of the ~200 students for match with a simple logical bitwise AND operation followed by a compare.
for ( var i = 0; i < 200; i++ )
if ( (studentScores [i] & currentCombinaton) == currentCombination )
// this combination matched for this student.
NOTE as written this is checking for success combinations not failure combinations, but that is easily reversed (using == 0
instead of == currentCombination
).
So, that's a start.
I would think that you could do an analysis of combinations relative to each other, such as what popular combinations of size 4 or 5 are found in the popular combinations of size 16.
Below is a C# program I wrote that generates combinations of T/F values as bit mask. You could put that combo generator in a size loop, and use the above student occurrence loop for each such size.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication17
{
class Program
{
static ulong NextCombination ( int outOf, ulong last )
{
int i;
for ( i = 0; i < outOf; i++ )
{
if ( (last & (1UL << i)) != 0 )
break;
}
int j;
for ( j = i + 1; j < outOf; j++ )
{
if ( (last & (1UL << j)) == 0 )
break;
}
// we alter the number by removing all the consecutive ones
// that occur followed by a zero (viewed from low order to high order)
// as in ...0111...
// replace with a single 1
// as in ...1000... and set low bits to bits in a row, i.e.
// ...1000.....011
int numberOfOnesInARowBeforeFirstZero = j - i;
var onesMask = ((1UL << numberOfOnesInARowBeforeFirstZero) - 1) << i;
last &= ~onesMask; // clear the series of one's
last |= 1UL << j; // set the first zero bit after the ones in a row
var lowOnesReplacement = (1UL << (numberOfOnesInARowBeforeFirstZero - 1)) - 1;
last |= lowOnesReplacement; // set the lowest bits to make up for the cleared bits
return last;
}
static void Main ( string [] args )
{
int comboSize = 3;
int outOfSize = 30;
int count = 1;
ulong currentCombination = (1UL << comboSize) - 1;
ulong lastCombination = currentCombination << (outOfSize - comboSize);
System.Console.WriteLine ( "Last Combination for size " + comboSize + " is " + Convert.ToString ( (long) lastCombination, 2 ).PadLeft ( outOfSize, '0' ) );
for ( ;; )
{
System.Console.WriteLine ( "Combination " + count + " is " + Convert.ToString ( (long) currentCombination, 2 ).PadLeft ( outOfSize, '0' ) );
if ( currentCombination == lastCombination )
break;
currentCombination = NextCombination ( outOfSize, currentCombination );
count++;
}
System.Console.ReadLine ();
}
}
}
It would need to seek the largest possible subsets of topics, subject to matching a minimum number of students.
So you essentially have 2 criteria. Let's look at "largest possible subset" first:
This is brute force, unfortunately. In other words, you don't know the largest group until you've tallied all of the results. (Except you could optimize on very large datasets - but let's ignore that for now.)
So, you iterate over all topics and sum the votes. Then next, be sure that the votes exceed the "minimum".
At this point you have all topics that have sufficient votes. One optimization in this process is to recognize that if the topic with the fewest votes exceeds the total of all remaining votes, then you can stop. But, it's likely the brute force approach will be fast enough without optimization unless voting is in excess of millions.
Pseudo code might be:
int topicVotes = 0;
Array[Integer] topics = new Array[topics.size];
for (i = 0; i < topics.size; i++) {
for (j = 0; j < votes.size; j++) {
if (votes[j] == true) topicVotes++;
}
if (topicVotes > minimum) topics[i] = topicVotes;
}
This might not be the best way to implement this, but it should illustrate how to filter out the topics that don't meet the minimum requirement while emphasizing that brute force is required (in most cases) to guarantee that all topics meeting the minimum are included.