Clustering time series data in Python

Question 1

I am trying to cluster time series data in Python using different clustering techniques. K-means didn't give good results. The following images are what I have after clustering using agglomerative clustering. I also tried Dynamic Time warping. These two seem to give similar results.

What I would ideally like to have is two different clusters for the time series in the second image. The first image is a cluster for rapid increases. The second for no increase kind of like stable and the third is a cluster for decreasing trends. I would like to know which time series are stable as well as popular (by popular here, I mean high count). I tried hierarchical clustering but the results showed way too many hierarchies and I am not sure how to pick the level of hierarchy. Can someone shed light on how to go about splitting the time series in the second image into two different clusters, one with low counts and the other with high counts? Is it possible to do it? Or should I just visually pick a threshold to cut them into two?

Cluster with rapid increases:

enter image description here

Cluster with stable counts:

enter image description here

Cluster with decreasing trends:

enter image description here

This is very very vague but this is the result of my hierarchical clustering.

enter image description here I know this particular image is not useful at all but this is like a dead end for me as well.

In general, if you would like to differentiate trends, say for instance for YouTube videos, how do only some get picked up for the "trending" section and some others for "trending this week" section? I understand the "trending" section videos are the ones that show similar characteristics to the first image. "Trending this week" section has a collection of videos which have very high view counts but are quiet stable in terms of counts (i.e. not showing rapid increases). I know that in case of YouTube, there are many many other factors that are considered in addition to just view counts. With the second image, what I am trying to do is similar to "trending this week" section. I would like to pick the ones that have very high counts. How do I split the time series in this case?

I know DTW captures trends. DTW gave the same results as the above images. It has identified the trend in the second image which is "stable". But it doesn't capture the "count" element here. I want both the trend as well as the count to be captured , in this case stable and high count.

The above images are time series clustered based on counts. Am I missing out on any other clustering techniques that could achieve this? Even with just counts, how do I cluster differently according to my needs?

Any ideas would be much appreciated. Thanks in advance!

Question 2

It's not about missing any clustering techniques. If you feed K-means (or any other algo) with the raw data, then the results won't be good. You need to construct features out of the time series (like average day-over-day increase, number of times the next observation is above the previous one and so on). Regarding the high counts, I think you should define a threshold yourself. No algorithm will do this for you.

Question 3

Can you edit your question saying what clustering technique you tried with DTW as distance and what all distance metrics you tried for K-Means clustering apart from Euclidean?

Question 4

K-Means with euclidean distance does not by itself make use of the time series. To see that you can just shuffle the time series and you should get same clusters since the distance is euclidean. @Stergios You are essentially trying to build time based features to feed it to K-Means. Do you know any other clustering methods where you can just directly do clustering on raw time series? One thing I know is to use DTW as distance and use hierarchical clustering.

Question 5

@ultramarine I'm not aware of any algorithm that would take raw time-series and cluster them.

Question 6

Improve your preprocessing and feature extraction!

Question 7

The best thing you can do is to extract some features form your time series. The first feature to extract in your case is the trend linear trend estimation

Another thing you can do is to cluster the cumulative version of your time series like suggested and explained in this other post: Time series distance metrics

Question 8

You can use DTW to cluster trends by computing the total min distance, see my answer here for another similar question. I had a problem that is very close to this and I ended up with deploying my own python package for this purpose. Check this for details. You can also see a demo here.

Question 9

First, you should make your data stationary - remove trend and cyclic components. And then can do ANOVA between hour or month as category & your value registered as numerical parameter. As well, you can use these time-steps as categories in ML-training. Still it is supervised approach to evaluate X-Y dependency.

Nevertheless the assumption of the stationarity is important for any kind of analysis of time-series.

For clustering (as unsupervised approach) - feature-selection is of importance as well. tsfresh-python package gives interesting features that can be used in your analysis (mean, median, standard deviation, entropy) - example here. If you like to use it in sliding window - it's up to you.

I would also first demeaned all data to get comparable values for analysis.

Also, uncertainty from Quantile Regression can be used as feature for clustering.

Not only DTW-distances can be used to cluster time-series - R-language gives greater variability of distances in in dtwclust-package if compared with python

Everything depends on the aim of your analysis. And, of course, you'd better input to your OP example of some test-data.

P.S. for Hierarchical Clusteing can see this example

P.P.S. Interesting topic is "Ignoring Temporal Dependencies" (though not for unsupervised clustering):

Remember that time series data has temporal dependencies. Make sure your feature selection methods account for this."

perhaps, can be achieved with Optimization of the window size - or selection of wavelength in spectroscopy here

paolof89 1,3696 gold badges19 silver badges31 bronze badges · Answer 1 · 2018-01-31 14:37:22Z

The best thing you can do is to extract some features form your time series. The first feature to extract in your case is the trend linear trend estimation

Another thing you can do is to cluster the cumulative version of your time series like suggested and explained in this other post: Time series distance metrics

Dogan Askan 1,23811 silver badges22 bronze badges · Answer 2 · 2020-06-24 03:39:46Z

You can use DTW to cluster trends by computing the total min distance, see my answer here for another similar question. I had a problem that is very close to this and I ended up with deploying my own python package for this purpose. Check this for details. You can also see a demo here.

JeeyCi 6456 silver badges14 bronze badges · Answer 3 · 2025-02-09 17:16:19Z

First, you should make your data stationary - remove trend and cyclic components. And then can do ANOVA between hour or month as category & your value registered as numerical parameter. As well, you can use these time-steps as categories in ML-training. Still it is supervised approach to evaluate X-Y dependency.

Nevertheless the assumption of the stationarity is important for any kind of analysis of time-series.

For clustering (as unsupervised approach) - feature-selection is of importance as well. tsfresh-python package gives interesting features that can be used in your analysis (mean, median, standard deviation, entropy) - example here. If you like to use it in sliding window - it's up to you.

I would also first demeaned all data to get comparable values for analysis.

Also, uncertainty from Quantile Regression can be used as feature for clustering.

Not only DTW-distances can be used to cluster time-series - R-language gives greater variability of distances in in dtwclust-package if compared with python

Everything depends on the aim of your analysis. And, of course, you'd better input to your OP example of some test-data.

P.S. for Hierarchical Clusteing can see this example

P.P.S. Interesting topic is "Ignoring Temporal Dependencies" (though not for unsupervised clustering):

Remember that time series data has temporal dependencies. Make sure your feature selection methods account for this."

perhaps, can be achieved with Optimization of the window size - or selection of wavelength in spectroscopy here

CollectivesTM on Stack Overflow

Clustering time series data in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related