Predicting time series data with python

Question 1

I'm starting with machine learning and so far have only tested scikit-learn but I couldn't find the right algorithm or an example similar to my problem.

I have a time series showing where an event happened. The location of the event is identified with an integer between 1 and 25 ( including ). At a certain date, an event cannot happen at the same place twice and it always happens in 5 places.

My data looks like this:

2015年01月01日,1,3,5,8,9,10
2015年01月03日,23,16,3,5,9
2015年01月05日,22,16,6,13,11

The first column is the date and the others are the places. Dates aren't included if nothing happened.

Do you have any recommendations on which algorithm should I take a look to try to predict the numbers ( places ) in the next time series?

An algorithm that is available in a Python library like scikit-learn would be perfect!

Question 2

maybe you could use pandas library for this csv

Question 3

One idea would to treat it as a multi-class problem. You can imagine this as your target y having 25 rows (actually 24 but forget about it for now) where each column is 1 or 0 representing wether the event happened or not.

As predictors for your X you can chose some lagged average or the last lets say 3 observations. See this question for more details.

Some code:

from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s="""
2015年01月01日,1,2,3
2015年01月03日,1,2,4
2015年01月05日,1,2,4
2015年01月07日,1,4,3
"""
df = pd.read_csv(StringIO(s), index_col=0, parse_dates=True, header=None)
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(df.values)
labels
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]
 [1 0 1 1]]

We have 4 classes and 4 examples so we get a 4x4 matrix. Columns represent classes/locations and rows are events.

Now we will use the first 3 observations to predict the fourth one:

X = labels[:-1] 
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]]

We get 4 classes and 3 observations. We need to make it a vector because this is only a sample:

>>> X.flatten()
[1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1]

Each column here is a feature/predictor that can be interpreted in the following way: A 1 in the first column means that class one was present 3 days a go. A 0 in the 7th column means that class 3 was not present in 2 days ago, and so on.

So now we have one sample/event (one row of the final X matrix) and the corresponding label(one row of the target y):

>>> labels[-1]
[1 0 1 1]

If you follow this procedure you will be able to get a training set that can be fed to a classifier.

Question 4

Do you mean something like this? gist.github.com/anonymous/d80d2a571b7b53c518d2 How should my predictors look like?

Question 5

Thanks for the code but it looks like enc is not defined. Besides that I understand your explanation. I'm just not sure how I should call RandomForestClassifier.fit(training, labels). Should I call it for each place history? Example: [1,1,1,1] for place 1, [1,1,1,0] for place 2 etc.

Question 6

That line was a mistake, see my edit. Why I explained is how to obtain a row in X and y. You should do the same for each place in history and in this case you will get n_events - window_size rows like the one I have just shown. Then you can do RandomForestClassifier.fit(X, y).

Question 7

Got it! Took sometime to understand but I guess I got. Thanks a lot for your explanation.

elyase 41.2k12 gold badges121 silver badges123 bronze badges · Accepted Answer · 2015-01-12 19:59:29Z

One idea would to treat it as a multi-class problem. You can imagine this as your target y having 25 rows (actually 24 but forget about it for now) where each column is 1 or 0 representing wether the event happened or not.

As predictors for your X you can chose some lagged average or the last lets say 3 observations. See this question for more details.

Some code:

from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s="""
2015年01月01日,1,2,3
2015年01月03日,1,2,4
2015年01月05日,1,2,4
2015年01月07日,1,4,3
"""
df = pd.read_csv(StringIO(s), index_col=0, parse_dates=True, header=None)
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(df.values)
labels
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]
 [1 0 1 1]]

We have 4 classes and 4 examples so we get a 4x4 matrix. Columns represent classes/locations and rows are events.

Now we will use the first 3 observations to predict the fourth one:

X = labels[:-1] 
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]]

We get 4 classes and 3 observations. We need to make it a vector because this is only a sample:

>>> X.flatten()
[1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1]

Each column here is a feature/predictor that can be interpreted in the following way: A 1 in the first column means that class one was present 3 days a go. A 0 in the 7th column means that class 3 was not present in 2 days ago, and so on.

So now we have one sample/event (one row of the final X matrix) and the corresponding label(one row of the target y):

>>> labels[-1]
[1 0 1 1]

If you follow this procedure you will be able to get a training set that can be fed to a classifier.

Do you mean something like this? gist.github.com/anonymous/d80d2a571b7b53c518d2 How should my predictors look like?
Thanks for the code but it looks like enc is not defined. Besides that I understand your explanation. I'm just not sure how I should call RandomForestClassifier.fit(training, labels). Should I call it for each place history? Example: [1,1,1,1] for place 1, [1,1,1,0] for place 2 etc.
That line was a mistake, see my edit. Why I explained is how to obtain a row in X and y. You should do the same for each place in history and in this case you will get n_events - window_size rows like the one I have just shown. Then you can do RandomForestClassifier.fit(X, y).
Got it! Took sometime to understand but I guess I got. Thanks a lot for your explanation.

CollectivesTM on Stack Overflow

Predicting time series data with python

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related