I have a list of dates in which a certain event happened. Taking into consideration that a particular occurrence of the event can:
- Be a one time thing or
- Be part of a periodic series, which happens every N months, where N is an integer
I want to classify and group the dates as they relate to each other. That is, I want to be able to detect when a periodic series is happening and which of the dates in the set belong to that periodic series and which ones are one-time events.
Examples:
20/01/2016, 20/02/2016, 20/03/2016, 20/04/2016
is a set of events which happen once a month.20/01/2016, 25/01/2016, 20/02/2016, 20/03/2016, 20/04/2016
is a set of events which happen once a month, with the exception of the second one, which is a one-time-only event.20/01/2016, 20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016
is a set of events which happen once every two months, with the exception of the third event, which is a one-time-only event.
I realise that it is not possible to create a conclusive algorithm (who is to say the last example does not show three different streams of events: one happening every two months, until 20/03/2016; then one happening every month until 20/05/2016; and finally a one-time event at 20/07/2016?), but I was wondering if there is some known algorithm which can do these kind of frequency-detection, even if non-conclusively.
Happy to implement the algorithm myself if it's easy enough and explained in detail. If libraries are necessary, please consider the project is in Python.
5 Answers 5
What you've left out of your requirements is what bit of info will be used to classify them.
Given the data you've presented I have to make two assumptions to make this solvable
- That each date given in a set is the same event, that is, there is no other distinguishing data to consider
- That you want to use the day on which the event happened to classify
If those hold then all you have to do dump each event into a dictionary using the day of the month on which it occured as the key. Let's call these events A.
Now over the same period of time as the set of the events was collected create events for every day that occured. Put that set in another dictionary. You now have a way to count how many firsts, seconds, thirds, etc of the month have happened. Let's call these events B.
Each day of the month is a bucket. Could be as many as 31 buckets. For each bucket divide A by B.
If a monthly event happened without fail on the day of the month in question it's score will be 1. If it never happened it's score will be 0. You can set your classifying cutoff value anywhere in between those two numbers.
-
Thanks for the suggestion. The assumptions are fine, but I believe the algorithm is missing two important bits: it does not work for events which happen e.g. every two months, and fails to classify individual events as part of a certain pattern. My last example would translate into a single bucket with 5 events over 7 possible ones. That indicates that something which looks almost monthly is occurring, but that's not what I was looking for.user2891462– user28914622019年11月25日 09:38:48 +00:00Commented Nov 25, 2019 at 9:38
-
@user2891462, as I sugggest in my answer, if you want to look for a bimonthly pattern, then you'd have to place the events into (up to) 61 possible buckets according to the day on which they occur within a bimester. In your last example data, the spurious single event on 20-Apr would then pale into insignificance compared to the four events falling on 20-Jan/Mar/May/Jul, because the former would fall into the day 51 bucket (31+20 within a bimestral cycle begining 01-Jan), whereas the latter would fall into the day 20 bucket.Steve– Steve2019年11月25日 10:28:35 +00:00Commented Nov 25, 2019 at 10:28
-
@user2891462 you haven't defined what you're looking for. 'Pattern' is far to broad. The third tuesday of every month is a pattern. But adding 'every N months' is as simple as counting the days in N months and letting that be your key.candied_orange– candied_orange2019年11月26日 00:41:16 +00:00Commented Nov 26, 2019 at 0:41
Not completely distinct from other answers but I'd probably go with a series of histograms. Take your data and sort it into different sets of 'buckets' or 'bins':
- day of week
- day of month
- day of bi-weekly
- day of year
- day of semi-annual
- other interesting scenarios ...
Some of these bins will contain more than most of the others. You might see one bin with a lot or several. Each of these bins sizes is a score that you can then associate back to the dates in the bin.
For each date, you then have all the binning strategies and what score it has. For each date you can sort the bins by scores. The top scored bins then are the most likely candidates for the frequency. For example, if you have a series of 20 dates and 15 land on Monday, then it's likely that you are looking at a meeting that is weekly.
You need to consider that some of these bins coincide. If a meeting is semi-annual e.g. it may also score well on the monthly binning. You probably want to assume the less frequent candidate is more likely.
I'm assuming here that you already know that these meeting are part of a series. If there are multiple series and non-series meetings in the mix, other information such as the name of the meeting and attendees can be used to distinguish them. In that case, I would also tweak the above to not count meetings on the same day towards a bin's score for a date.
If the period itself is defined as involving months, then a basic frequency analysis is just looking at what monthdays accumulate a disproportionately large number of events.
If months can be skipped or if the pattern operates over a longer period, then you'd be looking at the day of the bimester, trimester, or whatever other longer-range cycle.
If there is some human input in this process, the human eye will quickly detect cyclic patterns when reviewing the results of these basic analyses.
If you're looking at trying to detect and isolate individual patterns from a large number of potentially overlapping and concurrent patterns and random noise, and do it without human input, then I wouldn't know what method to use exactly, but I suspect it would be some technique from the audio processing field.
Abstract
Conceptually, you need the following elements:
- A method for enumerating hypotheses or possible interpretations for the series of dates. Note that this may be possibly of complexity O(n^2), so for larger lists of dates it may require some pruning.
- A fitness function that determines how 'good' a given hypothesis is.
Then you simply enumerate the hypotheses, keeping the one with best fitness as the answer.
Example
Here's a simple example using these dates: 20/01/2019
, 20/02/2019
, 20/04/2019
.
Enumerating hypotheses
There are 4 possible hypotheses:
- 3 single events
- A monthly event for
01
and02
, plus a single event for04
- A single event for
01
and a bi-monthly event for02
and04
- A quarterly event for
01
and04
plus a single event for02
Fitness function
The fitness function may assign different penalties to the total number of elements (single events or series) in a hypothesis, and it may prefer monthly repetitions over larger intervals. Depending on the weights, you may get different 'best' results, but in this case alternative 2 would probably be chosen.
-
1Note that this would apply to events that are already sorted into day-of-month buckets. One thing you also need to consider is how you treat days at the end of the month. Would you merge 31/01 with 28/02? 28/02 with 30/03?Hans-Martin Mosner– Hans-Martin Mosner2019年11月25日 10:48:48 +00:00Commented Nov 25, 2019 at 10:48
-
Agreed Hans, he'd have to consider the effect of "date rolling", and also run the analysis on the assumption that events were anchored to the ends of months (that is, a fixed number of days from the end) rather than the beginnings. Defining exactly the periods he's looking for, given the variety that exist in relation to dates, could be a significant task in itself.Steve– Steve2019年11月25日 12:08:15 +00:00Commented Nov 25, 2019 at 12:08
Proposed algorithm
- Sort the list of dates chronologically in ascending order.
- If there are any dates on the list, pop the first one (we'll call it current date). Otherwise, the algorithm has finished.
- Create a new list of integers, with as many elements as dates are left in the original dates list. The i-th element of this new list is the number of months separating our current date from the i-th date in the original list. If the i-th date is not separated an integer number of months from the current date, signal it somehow (with -1, for example).
- In this list, if we ignore -1, we have a sorted list of numbers, in ascending order. Find the first element which is not -1. That number will represent the frequency (in months) of the pattern we are analyzing. If there is no such element,
current_date
represents a one-time event, so go back to step 2. - Divide each element of the list by this first number, so as to normalize the list. If a number does not divide evenly by this first number, make it a -1.
- Look for the consecutive numbers (omitting -1) 1, 2, 3, 4... Each entry which belongs to this sequence represents a date which has occurred in a recurring pattern every N months.
- Pop from the list all the dates which were part of this pattern (since a date cannot be part of two different patterns).
- Go back to 2.
See it in action
So, for the last example in the original post:
[20/01/2016, 20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]
current_date = 20/01/2016
;list=[20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]
number_list=[2, 3, 4, 6]
- First element of the list: 2; Normalized list:
number_list=[1, -1, 2, 3]
- Entries 0, 2 and 3 from the list form a sequence of
1, 2, 3
, so they are part of the same series, which repeats every 2 months (since 2 is the element with which we have normalized). - Having classified them, we pop them from the list, so that
list=[20/05/2016]
- Start again with a new unclassified date. Pop the first element from the list, so
current_date=20/05/2016
;list=[]
. - There are no more elements in the list, so
current_date
is a one-time event.
As you can see, we have classified each individual event as either a one-time event or part of a recurring pattern every N months, with N being known for each identified pattern.