I am trying to cluster time series data in Python using different clustering techniques. K-means didn't give good results. The following images are what I have after clustering using agglomerative clustering. I also tried Dynamic Time warping. These two seem to give similar results.
What I would ideally like to have is two different clusters for the time series in the second image. The first image is a cluster for rapid increases. The second for no increase kind of like stable and the third is a cluster for decreasing trends. I would like to know which time series are stable as well as popular (by popular here, I mean high count). I tried hierarchical clustering but the results showed way too many hierarchies and I am not sure how to pick the level of hierarchy. Can someone shed light on how to go about splitting the time series in the second image into two different clusters, one with low counts and the other with high counts? Is it possible to do it? Or should I just visually pick a threshold to cut them into two?
Cluster with rapid increases:
Cluster with stable counts:
Cluster with decreasing trends:
This is very very vague but this is the result of my hierarchical clustering.
enter image description here I know this particular image is not useful at all but this is like a dead end for me as well.
In general, if you would like to differentiate trends, say for instance for YouTube videos, how do only some get picked up for the "trending" section and some others for "trending this week" section? I understand the "trending" section videos are the ones that show similar characteristics to the first image. "Trending this week" section has a collection of videos which have very high view counts but are quiet stable in terms of counts (i.e. not showing rapid increases). I know that in case of YouTube, there are many many other factors that are considered in addition to just view counts. With the second image, what I am trying to do is similar to "trending this week" section. I would like to pick the ones that have very high counts. How do I split the time series in this case?
I know DTW captures trends. DTW gave the same results as the above images. It has identified the trend in the second image which is "stable". But it doesn't capture the "count" element here. I want both the trend as well as the count to be captured , in this case stable and high count.
The above images are time series clustered based on counts. Am I missing out on any other clustering techniques that could achieve this? Even with just counts, how do I cluster differently according to my needs?
Any ideas would be much appreciated. Thanks in advance!
-
2It's not about missing any clustering techniques. If you feed K-means (or any other algo) with the raw data, then the results won't be good. You need to construct features out of the time series (like average day-over-day increase, number of times the next observation is above the previous one and so on). Regarding the high counts, I think you should define a threshold yourself. No algorithm will do this for you.Stergios– Stergios2017年08月10日 13:56:32 +00:00Commented Aug 10, 2017 at 13:56
-
Can you edit your question saying what clustering technique you tried with DTW as distance and what all distance metrics you tried for K-Means clustering apart from Euclidean?nth-attempt– nth-attempt2017年08月10日 19:02:08 +00:00Commented Aug 10, 2017 at 19:02
-
K-Means with euclidean distance does not by itself make use of the time series. To see that you can just shuffle the time series and you should get same clusters since the distance is euclidean. @Stergios You are essentially trying to build time based features to feed it to K-Means. Do you know any other clustering methods where you can just directly do clustering on raw time series? One thing I know is to use DTW as distance and use hierarchical clustering.nth-attempt– nth-attempt2017年08月10日 19:09:12 +00:00Commented Aug 10, 2017 at 19:09
-
@ultramarine I'm not aware of any algorithm that would take raw time-series and cluster them.Stergios– Stergios2017年08月11日 05:24:07 +00:00Commented Aug 11, 2017 at 5:24
-
Improve your preprocessing and feature extraction!Has QUIT--Anony-Mousse– Has QUIT--Anony-Mousse2017年08月11日 06:53:53 +00:00Commented Aug 11, 2017 at 6:53
3 Answers 3
The best thing you can do is to extract some features form your time series. The first feature to extract in your case is the trend linear trend estimation
Another thing you can do is to cluster the cumulative version of your time series like suggested and explained in this other post: Time series distance metrics
Comments
You can use DTW to cluster trends by computing the total min distance, see my answer here for another similar question. I had a problem that is very close to this and I ended up with deploying my own python package for this purpose. Check this for details. You can also see a demo here.
Comments
First, you should make your data stationary - remove trend and cyclic components. And then can do ANOVA between hour or month as category & your value registered as numerical parameter. As well, you can use these time-steps as categories in ML-training. Still it is supervised approach to evaluate X-Y dependency.
Nevertheless the assumption of the stationarity is important for any kind of analysis of time-series.
For clustering (as unsupervised approach) - feature-selection is of importance as well. tsfresh-python package gives interesting features that can be used in your analysis (mean, median, standard deviation, entropy) - example here. If you like to use it in sliding window - it's up to you.
I would also first demeaned all data to get comparable values for analysis.
Also, uncertainty from Quantile Regression can be used as feature for clustering.
Not only DTW-distances can be used to cluster time-series - R-language gives greater variability of distances in in dtwclust-package if compared with python
Everything depends on the aim of your analysis. And, of course, you'd better input to your OP example of some test-data.
P.S. for Hierarchical Clusteing can see this example
P.P.S. Interesting topic is "Ignoring Temporal Dependencies" (though not for unsupervised clustering):
Remember that time series data has temporal dependencies. Make sure your feature selection methods account for this."
perhaps, can be achieved with Optimization of the window size - or selection of wavelength in spectroscopy here
Comments
Explore related questions
See similar questions with these tags.