I'm starting with machine learning and so far have only tested scikit-learn but I couldn't find the right algorithm or an example similar to my problem.
I have a time series showing where an event happened. The location of the event is identified with an integer between 1 and 25 ( including ). At a certain date, an event cannot happen at the same place twice and it always happens in 5 places.
My data looks like this:
2015年01月01日,1,3,5,8,9,10
2015年01月03日,23,16,3,5,9
2015年01月05日,22,16,6,13,11
The first column is the date and the others are the places. Dates aren't included if nothing happened.
Do you have any recommendations on which algorithm should I take a look to try to predict the numbers ( places ) in the next time series?
An algorithm that is available in a Python library like scikit-learn would be perfect!
-
maybe you could use pandas library for this csvtumbleweed– tumbleweed2015年01月13日 00:16:56 +00:00Commented Jan 13, 2015 at 0:16
1 Answer 1
One idea would to treat it as a multi-class problem. You can imagine this as your target y having 25 rows (actually 24 but forget about it for now) where each column is 1 or 0 representing wether the event happened or not.
As predictors for your X you can chose some lagged average or the last lets say 3 observations. See this question for more details.
Some code:
from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s="""
2015年01月01日,1,2,3
2015年01月03日,1,2,4
2015年01月05日,1,2,4
2015年01月07日,1,4,3
"""
df = pd.read_csv(StringIO(s), index_col=0, parse_dates=True, header=None)
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(df.values)
labels
[[1 1 1 0]
[1 1 0 1]
[1 1 0 1]
[1 0 1 1]]
We have 4 classes and 4 examples so we get a 4x4 matrix. Columns represent classes/locations and rows are events.
Now we will use the first 3 observations to predict the fourth one:
X = labels[:-1]
[[1 1 1 0]
[1 1 0 1]
[1 1 0 1]]
We get 4 classes and 3 observations. We need to make it a vector because this is only a sample:
>>> X.flatten()
[1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1]
Each column here is a feature/predictor that can be interpreted in the following way: A 1 in the first column means that class one was present 3 days a go. A 0 in the 7th column means that class 3 was not present in 2 days ago, and so on.
So now we have one sample/event (one row of the final X matrix) and the corresponding label(one row of the target y):
>>> labels[-1]
[1 0 1 1]
If you follow this procedure you will be able to get a training set that can be fed to a classifier.
4 Comments
X and y. You should do the same for each place in history and in this case you will get n_events - window_size rows like the one I have just shown. Then you can do RandomForestClassifier.fit(X, y).