What are main differences between sparse data and missing data? And how does it influences machine learning? More specifically, what effect sparse data and missing data have on classification algorithms and regression (predicting numbers) type of algorithms. I'm talking about a situation, where percentage of missing data is significant and we can't drop the rows containing missing data.
-
8$\begingroup$ Sparse data means that many of the values are zero, but you know that they are zero. Missing data means that you don't know what some or many of the values are. $\endgroup$Anna SdTC– Anna SdTC2017年03月14日 06:54:00 +00:00Commented Mar 14, 2017 at 6:54
-
$\begingroup$ Thanks. That's what I also thought, but wanted to confirm. Also, as mentioned in question, would like to know how, in general, these types datasets are handled in machine learning problems.. $\endgroup$tired and bored dev– tired and bored dev2017年03月14日 07:40:51 +00:00Commented Mar 14, 2017 at 7:40
-
1$\begingroup$ I think that your question is a little vague. "Machine learning" includes a wide range of methods and tools, so the answer depends on what you have or what you want to do. Here they discuss some methods for handling missing data: stats.stackexchange.com/questions/103500/… $\endgroup$Anna SdTC– Anna SdTC2017年03月14日 07:50:17 +00:00Commented Mar 14, 2017 at 7:50
-
$\begingroup$ Thanks. I'm aware of broad range of tools and types of ml algorithms. But wanted to know if there are any general approaches. $\endgroup$tired and bored dev– tired and bored dev2017年03月14日 09:23:32 +00:00Commented Mar 14, 2017 at 9:23
-
$\begingroup$ To add a bit of nuance, you don't necessarily "know" that the true value is zero, that is just what is present in your data set/returned by your measurement device. I think an easier way to think about it is that sparse data has a bunch of zeros and missing data has a bunch of totally missing entries (eg 'NA', 'NULL', or simply blank entries with no value). $\endgroup$Cole– Cole2023年09月27日 16:48:06 +00:00Commented Sep 27, 2023 at 16:48
1 Answer 1
For the ease of understanding, I'll describe this using an example. Let's say that you are collecting data from a device which has 12 sensors. And you have collected data for 10 days.
The data you have collected is as follows: enter image description here
This is called sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.
Let's say 2 sensors of your device is malfunctioning.
Then your data will be like: enter image description here
In this case, you can see that you cannot use data from Sensor1 and Sensor6. Either you have to fill data manually without affecting the results or you have to redo the experiment.
-
$\begingroup$ This is really clear & helpful. Thank you. $\endgroup$Ciaran Haines– Ciaran Haines2021年05月11日 08:39:32 +00:00Commented May 11, 2021 at 8:39
Explore related questions
See similar questions with these tags.