Classification algorithm for high dimensional data which is uniquely definable in a very small sub-space

Question 1

I am new to machine learning, so forgive me if I am doing something absolutely absurd.

I have a classification task (~100 classes) and have about 2 million training data points in a 2000-dimensional space. Coordinates of data points are integers (discrete). All points have non-zero coordinates only for < 10 dimensions. That is, each point can be uniquely defined in < 10 dimensional sub-space.

If I use a Gaussian Mixture Model (GMM) for each class, I will end up with ~100 GMMs in a 2000-dimensional space. I feel that given the fact that each point is uniquely definable in less than 10 dimensional space, there can possibly be a better way of doing it.

What am I missing here?

Question 2

Why not do something like sparse PCA? What was your reasoning for using a Guassian mixture? Does the data conform to normal distributions? Using methods that explicitly consider the sparsity of your problem would highly outperform it. A simple guassian mixture model may overfit.

Question 3

I was not aware about sparse models. GMMs and SVMs have been my choices for classification tasks because it easy to implement them when i started out. I would go through the literature regarding sparse models now. It would nice of you if you could suggest some references for a beginner like me. Thank you.

Question 4

Since your data are extremely sparse, using GMMs or a traditional SVM will result in an over-fit model. By employing methods that exploit the sparsity of the structure, you should get much better results. Regression methods typically add some penalty function as a measure of the amount of non-zero values. This is usually referred to as "regularization". Doing this exactly (under the $L_0$ norm) is difficult, so relaxations are used (lasso: $L_1,ドル ridge: $L_2$).

If you are comfortable with Python, SciKit Learn has implementations for several approaches that utilize either lasso or ridge penalties. Here is an example of Lasso penalty for an SVM model. I would try that and see how well it works. It may require a parameter to adjust the regularization amount. You can do cross-validation to find what parameter works best.

Nicholas Mancuso 3,9271 gold badge25 silver badges40 bronze badges · Answer 1 · 2014-03-03 18:41:15Z

Since your data are extremely sparse, using GMMs or a traditional SVM will result in an over-fit model. By employing methods that exploit the sparsity of the structure, you should get much better results. Regression methods typically add some penalty function as a measure of the amount of non-zero values. This is usually referred to as "regularization". Doing this exactly (under the $L_0$ norm) is difficult, so relaxations are used (lasso: $L_1,ドル ridge: $L_2$).

If you are comfortable with Python, SciKit Learn has implementations for several approaches that utilize either lasso or ridge penalties. Here is an example of Lasso penalty for an SVM model. I would try that and see how well it works. It may require a parameter to adjust the regularization amount. You can do cross-validation to find what parameter works best.

Stack Exchange Network

Classification algorithm for high dimensional data which is uniquely definable in a very small sub-space

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Classification algorithm for high dimensional data which is uniquely definable in a very small sub-space

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions