UCR Suite for Time Series Subsequence Search
The UCR Suite:
Funded by NSF IIS - 1161997 II.
This webpage was build in support of the UCR Suite; Software
that enables ultrafast subsequence search under both Dynamic Time Warping (DTW)
and Euclidean Distance (ED). The work first appeared in a SIGKDD 2012 paper.
ACM SIGKDD Best Paper Award Winner 2012
ACM SIGKDD Test of Time Winner 2022
We observe that UCR-Suite wins in exact query answering
and on hard queries.. Echihabi et al, VLDB 2019.
Recent optimizations on DTW similarity search (the UCR Suite) can make this entire operation feasible in real time. Stuart Russell et al CHI 2013
The UCR Suite was developed by :
-
UC Riverside: Thanawin Rakthanmanon, Bilson Campana,
Abdullah Mueen, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
-
University of Sao Paulo: Gustavo Batista
-
Brigham and Women's Hospital: Brandon Westover
-
Authors Rakthanmanon,
Campana, Mueen and Batista contributed equally, and
should be consider joint first authors.
How fast is the UCR-Suite? It
depends on the data, query length, query shape, hardware, warping
constraint etc. However, to a first degree approximation:
- We can search a million datapoints in a second...
- We can search billions of datapoints in minutes...
- We can search trilliions of datapoints in hours.
What are the advantages of the UCR-Suite?
- It is exact, not aproximate.
- It does not require parameters to be set.
- It requires zero preprocessing time.
- It correctly z-normalizes the data.
- It has no minimum or maximum query length (We have searched queries as short as 16 and as long as 72,500, see DNA video)
- It can also handle exact queries under uniform scaling.
- The same idea works for both streaming data, and batch offline search.
- Finally, we are simply much faster than any known technique.
Here we show we
can search a day-long ECG tracing in 35 seconds under DTW, using a single core.
Using the same
query, we can search a year of ECG (8,518,554,188 datapoints) in 18 minutes
using a multi-core machine.
Thus we can
search 256Hz signals about thirty thousand times faster than real time.
[
フレーム]
Here we show we can support very long queries. We search for a query of length 72,500 in 21,435,268 datapoints in 18 seconds.
The refernce dendrogram we compared to at the end of this video is from:
D. P. Locke, et al. 2011.
Comparative and demographic analysis of orangutan genomes. Nature 469, 529-533.
[
フレーム]
How does changing the width of the warping effect the speed-up? See here for the numbers,
however, in brief, it makes very little difference. Over the range of 0
to 15, which would include the best accuracy setting for the vast
majority of the UCR archive problems, the difference is bearly
perceptable
Code:
The code .
Data:
- Face (four) dataset has been available for 8 years here, with Gun/NoGun data, and all UCR archive data.
- The raw DNA came from UCSC, our code to convert it to time seres is here.
- The music symbols where collected by Alicia Fornes, they are here. See fig 10 of this paper for samples.
- The online motif data is here.
- The code for random walk is here, including the exact seeds we used. See also.
- The 20 million random walk dataset is here, including all the queries used.
- The 22 hours and 23 minutes of ECG data (20,140,000 datapoints) shown in the video above is here, together with the exact query.
- The 1,000 star light curve data is the entire training set from StarLighhtCurves archived here.
- The 1.08 year of ECG data came from Physionet.org.
Here we list the exact set of data we trawled. This is too large for
our servers to host. If you want the exact data, just send us a 16 Gig thumb
drive with your return address, we will pay return shipping.