Difference between missing data and sparse data in machine learning algorithms

Question 1

What are main differences between sparse data and missing data? And how does it influences machine learning? More specifically, what effect sparse data and missing data have on classification algorithms and regression (predicting numbers) type of algorithms. I'm talking about a situation, where percentage of missing data is significant and we can't drop the rows containing missing data.

Question 2

Sparse data means that many of the values are zero, but you know that they are zero. Missing data means that you don't know what some or many of the values are.

Question 3

Thanks. That's what I also thought, but wanted to confirm. Also, as mentioned in question, would like to know how, in general, these types datasets are handled in machine learning problems..

Question 4

I think that your question is a little vague. "Machine learning" includes a wide range of methods and tools, so the answer depends on what you have or what you want to do. Here they discuss some methods for handling missing data: stats.stackexchange.com/questions/103500/…

Question 5

Thanks. I'm aware of broad range of tools and types of ml algorithms. But wanted to know if there are any general approaches.

Question 6

To add a bit of nuance, you don't necessarily "know" that the true value is zero, that is just what is present in your data set/returned by your measurement device. I think an easier way to think about it is that sparse data has a bunch of zeros and missing data has a bunch of totally missing entries (eg 'NA', 'NULL', or simply blank entries with no value).

Question 7

For the ease of understanding, I'll describe this using an example. Let's say that you are collecting data from a device which has 12 sensors. And you have collected data for 10 days.

The data you have collected is as follows: enter image description here

This is called sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.

Let's say 2 sensors of your device is malfunctioning.
Then your data will be like: enter image description here

In this case, you can see that you cannot use data from Sensor1 and Sensor6. Either you have to fill data manually without affecting the results or you have to redo the experiment.

Question 8

This is really clear & helpful. Thank you.

Lahiru Karunaratne Lahiru Karunaratne 4816 silver badges6 bronze badges · Accepted Answer · 2018-01-29 06:41:42Z

For the ease of understanding, I'll describe this using an example. Let's say that you are collecting data from a device which has 12 sensors. And you have collected data for 10 days.

The data you have collected is as follows: enter image description here

This is called sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.

Let's say 2 sensors of your device is malfunctioning.
Then your data will be like: enter image description here

In this case, you can see that you cannot use data from Sensor1 and Sensor6. Either you have to fill data manually without affecting the results or you have to redo the experiment.

$\begingroup$ This is really clear & helpful. Thank you. $\endgroup$

Ciaran Haines
– Ciaran Haines

2021年05月11日 08:39:32 +00:00
Commented May 11, 2021 at 8:39

Stack Exchange Network

Difference between missing data and sparse data in machine learning algorithms

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Difference between missing data and sparse data in machine learning algorithms

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions