Detect frequency pattern in a list of dates

Question 1

I have a list of dates in which a certain event happened. Taking into consideration that a particular occurrence of the event can:

Be a one time thing or
Be part of a periodic series, which happens every N months, where N is an integer

I want to classify and group the dates as they relate to each other. That is, I want to be able to detect when a periodic series is happening and which of the dates in the set belong to that periodic series and which ones are one-time events.

Examples:

20/01/2016, 20/02/2016, 20/03/2016, 20/04/2016 is a set of events which happen once a month.
20/01/2016, 25/01/2016, 20/02/2016, 20/03/2016, 20/04/2016 is a set of events which happen once a month, with the exception of the second one, which is a one-time-only event.
20/01/2016, 20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016 is a set of events which happen once every two months, with the exception of the third event, which is a one-time-only event.

I realise that it is not possible to create a conclusive algorithm (who is to say the last example does not show three different streams of events: one happening every two months, until 20/03/2016; then one happening every month until 20/05/2016; and finally a one-time event at 20/07/2016?), but I was wondering if there is some known algorithm which can do these kind of frequency-detection, even if non-conclusively.

Happy to implement the algorithm myself if it's easy enough and explained in detail. If libraries are necessary, please consider the project is in Python.

Question 2

What you've left out of your requirements is what bit of info will be used to classify them.

Given the data you've presented I have to make two assumptions to make this solvable

That each date given in a set is the same event, that is, there is no other distinguishing data to consider
That you want to use the day on which the event happened to classify

If those hold then all you have to do dump each event into a dictionary using the day of the month on which it occured as the key. Let's call these events A.

Now over the same period of time as the set of the events was collected create events for every day that occured. Put that set in another dictionary. You now have a way to count how many firsts, seconds, thirds, etc of the month have happened. Let's call these events B.

Each day of the month is a bucket. Could be as many as 31 buckets. For each bucket divide A by B.

If a monthly event happened without fail on the day of the month in question it's score will be 1. If it never happened it's score will be 0. You can set your classifying cutoff value anywhere in between those two numbers.

Question 3

Thanks for the suggestion. The assumptions are fine, but I believe the algorithm is missing two important bits: it does not work for events which happen e.g. every two months, and fails to classify individual events as part of a certain pattern. My last example would translate into a single bucket with 5 events over 7 possible ones. That indicates that something which looks almost monthly is occurring, but that's not what I was looking for.

Question 4

@user2891462, as I sugggest in my answer, if you want to look for a bimonthly pattern, then you'd have to place the events into (up to) 61 possible buckets according to the day on which they occur within a bimester. In your last example data, the spurious single event on 20-Apr would then pale into insignificance compared to the four events falling on 20-Jan/Mar/May/Jul, because the former would fall into the day 51 bucket (31+20 within a bimestral cycle begining 01-Jan), whereas the latter would fall into the day 20 bucket.

Question 5

@user2891462 you haven't defined what you're looking for. 'Pattern' is far to broad. The third tuesday of every month is a pattern. But adding 'every N months' is as simple as counting the days in N months and letting that be your key.

Question 6

Not completely distinct from other answers but I'd probably go with a series of histograms. Take your data and sort it into different sets of 'buckets' or 'bins':

day of week
day of month
day of bi-weekly
day of year
day of semi-annual
other interesting scenarios ...

Some of these bins will contain more than most of the others. You might see one bin with a lot or several. Each of these bins sizes is a score that you can then associate back to the dates in the bin.

For each date, you then have all the binning strategies and what score it has. For each date you can sort the bins by scores. The top scored bins then are the most likely candidates for the frequency. For example, if you have a series of 20 dates and 15 land on Monday, then it's likely that you are looking at a meeting that is weekly.

You need to consider that some of these bins coincide. If a meeting is semi-annual e.g. it may also score well on the monthly binning. You probably want to assume the less frequent candidate is more likely.

I'm assuming here that you already know that these meeting are part of a series. If there are multiple series and non-series meetings in the mix, other information such as the name of the meeting and attendees can be used to distinguish them. In that case, I would also tweak the above to not count meetings on the same day towards a bin's score for a date.

Question 7

If the period itself is defined as involving months, then a basic frequency analysis is just looking at what monthdays accumulate a disproportionately large number of events.

If months can be skipped or if the pattern operates over a longer period, then you'd be looking at the day of the bimester, trimester, or whatever other longer-range cycle.

If there is some human input in this process, the human eye will quickly detect cyclic patterns when reviewing the results of these basic analyses.

If you're looking at trying to detect and isolate individual patterns from a large number of potentially overlapping and concurrent patterns and random noise, and do it without human input, then I wouldn't know what method to use exactly, but I suspect it would be some technique from the audio processing field.

Question 8

Abstract

Conceptually, you need the following elements:

A method for enumerating hypotheses or possible interpretations for the series of dates. Note that this may be possibly of complexity O(n^2), so for larger lists of dates it may require some pruning.
A fitness function that determines how 'good' a given hypothesis is.

Then you simply enumerate the hypotheses, keeping the one with best fitness as the answer.

Example

Here's a simple example using these dates: 20/01/2019, 20/02/2019, 20/04/2019.

Enumerating hypotheses

There are 4 possible hypotheses:

3 single events
A monthly event for 01 and 02, plus a single event for 04
A single event for 01 and a bi-monthly event for 02 and 04
A quarterly event for 01 and 04 plus a single event for 02

Fitness function

The fitness function may assign different penalties to the total number of elements (single events or series) in a hypothesis, and it may prefer monthly repetitions over larger intervals. Depending on the weights, you may get different 'best' results, but in this case alternative 2 would probably be chosen.

Question 9

Note that this would apply to events that are already sorted into day-of-month buckets. One thing you also need to consider is how you treat days at the end of the month. Would you merge 31/01 with 28/02? 28/02 with 30/03?

Question 10

Agreed Hans, he'd have to consider the effect of "date rolling", and also run the analysis on the assumption that events were anchored to the ends of months (that is, a fixed number of days from the end) rather than the beginnings. Defining exactly the periods he's looking for, given the variety that exist in relation to dates, could be a significant task in itself.

Question 11

Proposed algorithm

Sort the list of dates chronologically in ascending order.
If there are any dates on the list, pop the first one (we'll call it current date). Otherwise, the algorithm has finished.
Create a new list of integers, with as many elements as dates are left in the original dates list. The i-th element of this new list is the number of months separating our current date from the i-th date in the original list. If the i-th date is not separated an integer number of months from the current date, signal it somehow (with -1, for example).
In this list, if we ignore -1, we have a sorted list of numbers, in ascending order. Find the first element which is not -1. That number will represent the frequency (in months) of the pattern we are analyzing. If there is no such element, current_date represents a one-time event, so go back to step 2.
Divide each element of the list by this first number, so as to normalize the list. If a number does not divide evenly by this first number, make it a -1.
Look for the consecutive numbers (omitting -1) 1, 2, 3, 4... Each entry which belongs to this sequence represents a date which has occurred in a recurring pattern every N months.
Pop from the list all the dates which were part of this pattern (since a date cannot be part of two different patterns).
Go back to 2.

See it in action

So, for the last example in the original post: [20/01/2016, 20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]

current_date = 20/01/2016; list=[20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]
number_list=[2, 3, 4, 6]
First element of the list: 2; Normalized list: number_list=[1, -1, 2, 3]
Entries 0, 2 and 3 from the list form a sequence of 1, 2, 3, so they are part of the same series, which repeats every 2 months (since 2 is the element with which we have normalized).
Having classified them, we pop them from the list, so that list=[20/05/2016]
Start again with a new unclassified date. Pop the first element from the list, so current_date=20/05/2016; list=[].
There are no more elements in the list, so current_date is a one-time event.

As you can see, we have classified each individual event as either a one-time event or part of a recurring pattern every N months, with N being known for each identified pattern.

score 2 · Answer 1 · 2019-11-24 22:01:41Z

What you've left out of your requirements is what bit of info will be used to classify them.

Given the data you've presented I have to make two assumptions to make this solvable

That each date given in a set is the same event, that is, there is no other distinguishing data to consider
That you want to use the day on which the event happened to classify

If those hold then all you have to do dump each event into a dictionary using the day of the month on which it occured as the key. Let's call these events A.

Now over the same period of time as the set of the events was collected create events for every day that occured. Put that set in another dictionary. You now have a way to count how many firsts, seconds, thirds, etc of the month have happened. Let's call these events B.

Each day of the month is a bucket. Could be as many as 31 buckets. For each bucket divide A by B.

If a monthly event happened without fail on the day of the month in question it's score will be 1. If it never happened it's score will be 0. You can set your classifying cutoff value anywhere in between those two numbers.

Thanks for the suggestion. The assumptions are fine, but I believe the algorithm is missing two important bits: it does not work for events which happen e.g. every two months, and fails to classify individual events as part of a certain pattern. My last example would translate into a single bucket with 5 events over 7 possible ones. That indicates that something which looks almost monthly is occurring, but that's not what I was looking for.
@user2891462, as I sugggest in my answer, if you want to look for a bimonthly pattern, then you'd have to place the events into (up to) 61 possible buckets according to the day on which they occur within a bimester. In your last example data, the spurious single event on 20-Apr would then pale into insignificance compared to the four events falling on 20-Jan/Mar/May/Jul, because the former would fall into the day 51 bucket (31+20 within a bimestral cycle begining 01-Jan), whereas the latter would fall into the day 20 bucket.
@user2891462 you haven't defined what you're looking for. 'Pattern' is far to broad. The third tuesday of every month is a pattern. But adding 'every N months' is as simple as counting the days in N months and letting that be your key.

JimmyJames JimmyJames 30.8k3 gold badges59 silver badges110 bronze badges · Answer 2 · 2019-11-26 15:50:26Z

Not completely distinct from other answers but I'd probably go with a series of histograms. Take your data and sort it into different sets of 'buckets' or 'bins':

day of week
day of month
day of bi-weekly
day of year
day of semi-annual
other interesting scenarios ...

Some of these bins will contain more than most of the others. You might see one bin with a lot or several. Each of these bins sizes is a score that you can then associate back to the dates in the bin.

For each date, you then have all the binning strategies and what score it has. For each date you can sort the bins by scores. The top scored bins then are the most likely candidates for the frequency. For example, if you have a series of 20 dates and 15 land on Monday, then it's likely that you are looking at a meeting that is weekly.

You need to consider that some of these bins coincide. If a meeting is semi-annual e.g. it may also score well on the monthly binning. You probably want to assume the less frequent candidate is more likely.

I'm assuming here that you already know that these meeting are part of a series. If there are multiple series and non-series meetings in the mix, other information such as the name of the meeting and attendees can be used to distinguish them. In that case, I would also tweak the above to not count meetings on the same day towards a bin's score for a date.

Steve Steve 12.4k2 gold badges20 silver badges36 bronze badges · Answer 3 · 2019-11-25 01:07:28Z

If the period itself is defined as involving months, then a basic frequency analysis is just looking at what monthdays accumulate a disproportionately large number of events.

If months can be skipped or if the pattern operates over a longer period, then you'd be looking at the day of the bimester, trimester, or whatever other longer-range cycle.

If there is some human input in this process, the human eye will quickly detect cyclic patterns when reviewing the results of these basic analyses.

If you're looking at trying to detect and isolate individual patterns from a large number of potentially overlapping and concurrent patterns and random noise, and do it without human input, then I wouldn't know what method to use exactly, but I suspect it would be some technique from the audio processing field.

score 0 · Answer 4 · 2019-11-25 08:13:31Z

Abstract

Conceptually, you need the following elements:

A method for enumerating hypotheses or possible interpretations for the series of dates. Note that this may be possibly of complexity O(n^2), so for larger lists of dates it may require some pruning.
A fitness function that determines how 'good' a given hypothesis is.

Then you simply enumerate the hypotheses, keeping the one with best fitness as the answer.

Example

Here's a simple example using these dates: 20/01/2019, 20/02/2019, 20/04/2019.

Enumerating hypotheses

There are 4 possible hypotheses:

3 single events
A monthly event for 01 and 02, plus a single event for 04
A single event for 01 and a bi-monthly event for 02 and 04
A quarterly event for 01 and 04 plus a single event for 02

Fitness function

The fitness function may assign different penalties to the total number of elements (single events or series) in a hypothesis, and it may prefer monthly repetitions over larger intervals. Depending on the weights, you may get different 'best' results, but in this case alternative 2 would probably be chosen.

Note that this would apply to events that are already sorted into day-of-month buckets. One thing you also need to consider is how you treat days at the end of the month. Would you merge 31/01 with 28/02? 28/02 with 30/03?
Agreed Hans, he'd have to consider the effect of "date rolling", and also run the analysis on the assumption that events were anchored to the ends of months (that is, a fixed number of days from the end) rather than the beginnings. Defining exactly the periods he's looking for, given the variety that exist in relation to dates, could be a significant task in itself.

user2891462 user2891462 2632 silver badges8 bronze badges · Answer 5 · 2019-11-26 08:34:33Z

Proposed algorithm

Sort the list of dates chronologically in ascending order.
If there are any dates on the list, pop the first one (we'll call it current date). Otherwise, the algorithm has finished.
Create a new list of integers, with as many elements as dates are left in the original dates list. The i-th element of this new list is the number of months separating our current date from the i-th date in the original list. If the i-th date is not separated an integer number of months from the current date, signal it somehow (with -1, for example).
In this list, if we ignore -1, we have a sorted list of numbers, in ascending order. Find the first element which is not -1. That number will represent the frequency (in months) of the pattern we are analyzing. If there is no such element, current_date represents a one-time event, so go back to step 2.
Divide each element of the list by this first number, so as to normalize the list. If a number does not divide evenly by this first number, make it a -1.
Look for the consecutive numbers (omitting -1) 1, 2, 3, 4... Each entry which belongs to this sequence represents a date which has occurred in a recurring pattern every N months.
Pop from the list all the dates which were part of this pattern (since a date cannot be part of two different patterns).
Go back to 2.

See it in action

So, for the last example in the original post: [20/01/2016, 20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]

current_date = 20/01/2016; list=[20/03/2016, 20/04/2016, 20/05/2016, 20/07/2016]
number_list=[2, 3, 4, 6]
First element of the list: 2; Normalized list: number_list=[1, -1, 2, 3]
Entries 0, 2 and 3 from the list form a sequence of 1, 2, 3, so they are part of the same series, which repeats every 2 months (since 2 is the element with which we have normalized).
Having classified them, we pop them from the list, so that list=[20/05/2016]
Start again with a new unclassified date. Pop the first element from the list, so current_date=20/05/2016; list=[].
There are no more elements in the list, so current_date is a one-time event.

As you can see, we have classified each individual event as either a one-time event or part of a recurring pattern every N months, with N being known for each identified pattern.

Stack Exchange Network

Detect frequency pattern in a list of dates

5 Answers 5

Abstract

Example

Enumerating hypotheses

Fitness function

Proposed algorithm

See it in action

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Detect frequency pattern in a list of dates

5 Answers 5

Abstract

Example

Enumerating hypotheses

Fitness function

Proposed algorithm

See it in action

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions