Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Question: Clustering w/ Motifs #71

Unanswered
pindapuj asked this question in Q&A
Discussion options

Thank you for this great package!

I'm trying to cluster a set of time series, based on a "shared" set of motifs, and corresponding sub-sequences. I thought of extracting the motifs given the matrix profile for each time series separately, and then doing some k-means clustering based on the subsequences of each time series over all motifs. (The subsequences for each found motif for a given time series would represent the "features.")

//pseudo-code
for sample in sample_set:
 motif_set += find_motifs_from_sample(sample)
//some post processing b/c the motifs will result in different lengths
clusters = kmeans(motif_set)

But, I think that this approach amplifies individual time series differences, so the motifs aren't generalizable. Instead, I was wondering if it made more sense to find motifs shared across the set of time series. (Not sure if this is the correct way to think about the time series set. Its not really a multi-dimensional dataset b/c each sample is a representation of the same phenomenon.)

for i in sample_set:
 for j in sample_set:
 mp = matrix_profile(i,j)
 motifs += find_motifs_from_profile(mp)
//somehow compute which motifs are most shared
// compute the subsequences for this shared set of motifs
// use those as "features" for kmeans.

Is there a natural or better way of finding these motifs or using matrix-profiles/motifs for clustering? For example, I assume it would be incorrect to simply concatenate all my time series together separated by nans, and find the matrix profile and motifs for the entire profile. Then, treat those motifs as "shared" for clustering.

Any help would be appreciated! I'm very new to the world of time series mining. Thank you!

You must be logged in to vote

Replies: 6 comments 3 replies

Comment options

@pindapuj Thank you for your interest in STUMPY and please correct me if I'm not understanding correctly. It sounds like you are essentially looking for a distance measure that allows you to compare multiple time series (of course, you can use that distance measure as input for clustering). Is this correct?

If so, then you might be interested in this metric called MPdist (published the original authors of the matrix profile papers). There is currently an open issue #46 that @Deepakgthomas is working on but you are welcome to pitch in as well. Have a look at the paper and let me know if that helps.

You must be logged in to vote
0 replies
Comment options

@pindapuj I'm always curious as to what applications there are. Can you give me a brief description of what specific use case you are using STUMPY for?

You must be logged in to vote
0 replies
Comment options

@seanlaw Yes, I believe that is correct. If I understood correctly, I would have pairwise similarity and not necessarily full set similarity. (This is why I planning to compute the motifs across the entire set of observations and using the corresponding sub sequences as "features"-- I'd capture "global" structure.) MPdist does implicitly capture pair-wise motif-like information b/c it looks for the nearest neighbor subsequences. If I understand correctly, I'd be doing the following.

 // From #46 
 Compute the matrix profile AB for stumpy.stump(A, m=50, T_B=B, ignore_trivial=False)
 Compute the matrix profile BA for stumpy.stump(B, m=50, T_B=A, ignore_trivial=False)
 Concatenate both matrix profiles into PABBA
 Choose some k that is 5 percent of 2 * n

Say I have n samples. Then I would need to do n^2 matrix profile computations, and pull out k for each of them.

//len(obs_a) = 30
//k = 3
// mpdist is upper triangular
mpdist = np.empty((n,n))
for i in range(len(observations)):
 for j in range(i,len(observations)):
 if i != j:
 mp_ab= stumpy.stump(observations[i], m=4, T_B=observations[j], ignore_trivial=False)
 mp_ba = stumpy.stumpy(observations[j], m=4, T_B=observation[i], ignore_trivial=False)
 mp_abba = np.concat(mp_ab,mp_ba)
 \\ Can I really just pull out the k lowest?
 k_smallest = np.argpartition(mp_abba,k=3)
 mp_dist[i][j] = k_smallest
 mp_dist[j][i] = k_smallest
\\ Now, given the full matrix of mp_dist, each row represents the features of each observation. 

Each row of the matrix, can be seen as that observation's feature vector. The maximum of the row gives the 1NN, but it need not be symmetric. ( NN(A) = B != NN(B) = A). I can use kmeans or some other clustering algorithm to group the edges.

I'm using STUMPY to do some exploratory analysis on some graph edge streams. I'd like understand the different types of edges that occur in my graph setting, so that I can use the edge clusters as "prototypes" for predicting graph behavior.

Thank you for your help!

You must be logged in to vote
0 replies
Comment options

Yes, I believe that what you showed above is the right approach. I strongly suggest that you go over the MPdist paper if you haven’t already.

Please report back and let us know if it worked out!

Sent with GitHawk

You must be logged in to vote
0 replies
Comment options

@pindapuj Closing this for now but feel free to re-open if you have any further questions or comments!

You must be logged in to vote
0 replies
Comment options

Hello, I am currently working on a similar issue. I have 3 time series and I need to extract representative days. The approach that I am thinking of is to extract motifs for each time series and then cluster based on these motifs. I do understand this conversation occurred around 4 years ago. Are there new techniques that could be used using stumpy?

You must be logged in to vote
3 replies
Comment options

@marwaajouz
Since your question is a little bit vague (e.g. what are those time series data? (like their length, etc), and what is "representative day"?), I am going to add some assumptions/ questions based on my previous experience.

Assumptions [Please correct those assumptions if they are wrong]
1- The length of the three time series (T1, T2, T3) are the same, maybe 8760 each as 8760 hours of one year?
2- Each time series is basically 365 days, each with 24 timestamp, that are concatenated with each other, ergo 8760 hr.
3- THEREFORE: Your time series is two-dimensional, in other words, the value at i-th time stamp is the array [T1[i], T2[i], T3[i]]
4- When you group "similar" days, you want the patterns that are from T1 to be similar to each other. The same goes for the patterns that are from T2, and the patterns that are from T3.

Questions
what is the definition of "similarity" in your problem?

Comment options

Hello,
To answer your questions, yes you are correct in assumptions 1,2 and 3. All 3 time series have the same length with 1-hr timestamp. What I need to do is to get the motifs for time-series 1, motifs for time-series 2, and motifs for time-series 3. Then I cluster the three time-series together based on the motifs of each one; meaning, I want to extract for example 2 clusters. The hrs included in cluster 1 are the same for time-series 1, time-series 2 and time-series 3. And the hrs of cluster 2 are the same for time-series 1, time-series 2 and time-series 3. Hope this is clear

Comment options

So, let's say you have something like this:

T_1 = [0, 0, 0, 1, 1, 1, 0 , 0, 0, 0]
T_2 =[10, 10, 10, -1, -1, -1, 10, 10, 10, 10]
T_3 =[-10, -10, -10, 0, 0, 0, -1, -1, -1, -1]

And you want to have two clusters, where one cluster has elements with indices [0, 1, 2, 6, 7, 8, 9] and the other has elements with indices [3, 4, 5]. Well, IMO, you are dealing with a complicated problem as, according to my knowledge, I have not seen any algorithm (or even a clear definition!) for it. You can try to take a look at the matrix profile basics tutorial, motifs tutorial, and maybe snippets tutorial as well. They may give you some idea. No guarantee though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
Converted from issue

This discussion was converted from issue #71 on December 11, 2020 18:30.

AltStyle によって変換されたページ (->オリジナル) /