Question: Clustering w/ Motifs · stumpy-dev/stumpy · Discussion #71

pindapuj
Oct 3, 2019

Thank you for this great package!

I'm trying to cluster a set of time series, based on a "shared" set of motifs, and corresponding sub-sequences. I thought of extracting the motifs given the matrix profile for each time series separately, and then doing some k-means clustering based on the subsequences of each time series over all motifs. (The subsequences for each found motif for a given time series would represent the "features.")

//pseudo-code
for sample in sample_set:
 motif_set += find_motifs_from_sample(sample)
//some post processing b/c the motifs will result in different lengths
clusters = kmeans(motif_set)

But, I think that this approach amplifies individual time series differences, so the motifs aren't generalizable. Instead, I was wondering if it made more sense to find motifs shared across the set of time series. (Not sure if this is the correct way to think about the time series set. Its not really a multi-dimensional dataset b/c each sample is a representation of the same phenomenon.)

for i in sample_set:
 for j in sample_set:
 mp = matrix_profile(i,j)
 motifs += find_motifs_from_profile(mp)
//somehow compute which motifs are most shared
// compute the subsequences for this shared set of motifs
// use those as "features" for kmeans.

Is there a natural or better way of finding these motifs or using matrix-profiles/motifs for clustering? For example, I assume it would be incorrect to simply concatenate all my time series together separated by nans, and find the matrix profile and motifs for the entire profile. Then, treat those motifs as "shared" for clustering.

Any help would be appreciated! I'm very new to the world of time series mining. Thank you!

Replies: 6 comments 3 replies

seanlaw
Oct 3, 2019
Maintainer

@pindapuj Thank you for your interest in STUMPY and please correct me if I'm not understanding correctly. It sounds like you are essentially looking for a distance measure that allows you to compare multiple time series (of course, you can use that distance measure as input for clustering). Is this correct?

If so, then you might be interested in this metric called MPdist (published the original authors of the matrix profile papers). There is currently an open issue #46 that @Deepakgthomas is working on but you are welcome to pitch in as well. Have a look at the paper and let me know if that helps.

0 replies

seanlaw
Oct 3, 2019
Maintainer

@pindapuj I'm always curious as to what applications there are. Can you give me a brief description of what specific use case you are using STUMPY for?

0 replies

pindapuj
Oct 3, 2019
Author

@seanlaw Yes, I believe that is correct. If I understood correctly, I would have pairwise similarity and not necessarily full set similarity. (This is why I planning to compute the motifs across the entire set of observations and using the corresponding sub sequences as "features"-- I'd capture "global" structure.) MPdist does implicitly capture pair-wise motif-like information b/c it looks for the nearest neighbor subsequences. If I understand correctly, I'd be doing the following.

 // From #46 
 Compute the matrix profile AB for stumpy.stump(A, m=50, T_B=B, ignore_trivial=False)
 Compute the matrix profile BA for stumpy.stump(B, m=50, T_B=A, ignore_trivial=False)
 Concatenate both matrix profiles into PABBA
 Choose some k that is 5 percent of 2 * n

Say I have n samples. Then I would need to do n^2 matrix profile computations, and pull out k for each of them.

//len(obs_a) = 30
//k = 3
// mpdist is upper triangular
mpdist = np.empty((n,n))
for i in range(len(observations)):
 for j in range(i,len(observations)):
 if i != j:
 mp_ab= stumpy.stump(observations[i], m=4, T_B=observations[j], ignore_trivial=False)
 mp_ba = stumpy.stumpy(observations[j], m=4, T_B=observation[i], ignore_trivial=False)
 mp_abba = np.concat(mp_ab,mp_ba)
 \\ Can I really just pull out the k lowest?
 k_smallest = np.argpartition(mp_abba,k=3)
 mp_dist[i][j] = k_smallest
 mp_dist[j][i] = k_smallest
\\ Now, given the full matrix of mp_dist, each row represents the features of each observation.

Each row of the matrix, can be seen as that observation's feature vector. The maximum of the row gives the 1NN, but it need not be symmetric. ( NN(A) = B != NN(B) = A). I can use kmeans or some other clustering algorithm to group the edges.

I'm using STUMPY to do some exploratory analysis on some graph edge streams. I'd like understand the different types of edges that occur in my graph setting, so that I can use the edge clusters as "prototypes" for predicting graph behavior.

Thank you for your help!

0 replies

seanlaw
Oct 3, 2019
Maintainer

Yes, I believe that what you showed above is the right approach. I strongly suggest that you go over the MPdist paper if you haven’t already.

Please report back and let us know if it worked out!

_{Sent with GitHawk}

0 replies

seanlaw
Oct 17, 2019
Maintainer

@pindapuj Closing this for now but feel free to re-open if you have any further questions or comments!

0 replies

marwaajouz
Aug 28, 2023

Hello, I am currently working on a similar issue. I have 3 time series and I need to extract representative days. The approach that I am thinking of is to extract motifs for each time series and then cluster based on these motifs. I do understand this conversation occurred around 4 years ago. Are there new techniques that could be used using stumpy?

3 replies

@NimaSarajpoor

NimaSarajpoor Aug 28, 2023
Collaborator

@marwaajouz
Since your question is a little bit vague (e.g. what are those time series data? (like their length, etc), and what is "representative day"?), I am going to add some assumptions/ questions based on my previous experience.

Assumptions [Please correct those assumptions if they are wrong]
1- The length of the three time series (T1, T2, T3) are the same, maybe 8760 each as 8760 hours of one year?
2- Each time series is basically 365 days, each with 24 timestamp, that are concatenated with each other, ergo 8760 hr.
3- THEREFORE: Your time series is two-dimensional, in other words, the value at i-th time stamp is the array [T1[i], T2[i], T3[i]]
4- When you group "similar" days, you want the patterns that are from T1 to be similar to each other. The same goes for the patterns that are from T2, and the patterns that are from T3.

Questions
what is the definition of "similarity" in your problem?

@marwaajouz

marwaajouz Aug 31, 2023

Hello,
To answer your questions, yes you are correct in assumptions 1,2 and 3. All 3 time series have the same length with 1-hr timestamp. What I need to do is to get the motifs for time-series 1, motifs for time-series 2, and motifs for time-series 3. Then I cluster the three time-series together based on the motifs of each one; meaning, I want to extract for example 2 clusters. The hrs included in cluster 1 are the same for time-series 1, time-series 2 and time-series 3. And the hrs of cluster 2 are the same for time-series 1, time-series 2 and time-series 3. Hope this is clear

@NimaSarajpoor

NimaSarajpoor Sep 3, 2023
Collaborator

So, let's say you have something like this:

T_1 = [0, 0, 0, 1, 1, 1, 0 , 0, 0, 0]
T_2 =[10, 10, 10, -1, -1, -1, 10, 10, 10, 10]
T_3 =[-10, -10, -10, 0, 0, 0, -1, -1, -1, -1]

And you want to have two clusters, where one cluster has elements with indices [0, 1, 2, 6, 7, 8, 9] and the other has elements with indices [3, 4, 5]. Well, IMO, you are dealing with a complicated problem as, according to my knowledge, I have not seen any algorithm (or even a clear definition!) for it. You can try to take a look at the matrix profile basics tutorial, motifs tutorial, and maybe snippets tutorial as well. They may give you some idea. No guarantee though.

Question: Clustering w/ Motifs #71

Uh oh!

Uh oh!

pindapuj Oct 3, 2019

Replies: 6 comments · 3 replies

Uh oh!

Uh oh!

seanlaw Oct 3, 2019 Maintainer

Uh oh!

seanlaw Oct 3, 2019 Maintainer

Uh oh!

pindapuj Oct 3, 2019 Author

Uh oh!

seanlaw Oct 3, 2019 Maintainer

Uh oh!

seanlaw Oct 17, 2019 Maintainer

Uh oh!

marwaajouz Aug 28, 2023

Uh oh!

Uh oh!

NimaSarajpoor Aug 28, 2023 Collaborator

Uh oh!

marwaajouz Aug 31, 2023

Uh oh!

NimaSarajpoor Sep 3, 2023 Collaborator

pindapuj
Oct 3, 2019

Replies: 6 comments 3 replies

seanlaw
Oct 3, 2019
Maintainer

seanlaw
Oct 3, 2019
Maintainer

pindapuj
Oct 3, 2019
Author

seanlaw
Oct 3, 2019
Maintainer

seanlaw
Oct 17, 2019
Maintainer

marwaajouz
Aug 28, 2023

NimaSarajpoor Aug 28, 2023
Collaborator

NimaSarajpoor Sep 3, 2023
Collaborator