2

I'm starting with machine learning and so far have only tested scikit-learn but I couldn't find the right algorithm or an example similar to my problem.

I have a time series showing where an event happened. The location of the event is identified with an integer between 1 and 25 ( including ). At a certain date, an event cannot happen at the same place twice and it always happens in 5 places.

My data looks like this:

2015年01月01日,1,3,5,8,9,10
2015年01月03日,23,16,3,5,9
2015年01月05日,22,16,6,13,11

The first column is the date and the others are the places. Dates aren't included if nothing happened.

Do you have any recommendations on which algorithm should I take a look to try to predict the numbers ( places ) in the next time series?

An algorithm that is available in a Python library like scikit-learn would be perfect!

asked Jan 12, 2015 at 19:48
1
  • maybe you could use pandas library for this csv Commented Jan 13, 2015 at 0:16

1 Answer 1

1

One idea would to treat it as a multi-class problem. You can imagine this as your target y having 25 rows (actually 24 but forget about it for now) where each column is 1 or 0 representing wether the event happened or not.

As predictors for your X you can chose some lagged average or the last lets say 3 observations. See this question for more details.

Some code:

from io import StringIO
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s="""
2015年01月01日,1,2,3
2015年01月03日,1,2,4
2015年01月05日,1,2,4
2015年01月07日,1,4,3
"""
df = pd.read_csv(StringIO(s), index_col=0, parse_dates=True, header=None)
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(df.values)
labels
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]
 [1 0 1 1]]

We have 4 classes and 4 examples so we get a 4x4 matrix. Columns represent classes/locations and rows are events.

Now we will use the first 3 observations to predict the fourth one:

X = labels[:-1] 
[[1 1 1 0]
 [1 1 0 1]
 [1 1 0 1]]

We get 4 classes and 3 observations. We need to make it a vector because this is only a sample:

>>> X.flatten()
[1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1]

Each column here is a feature/predictor that can be interpreted in the following way: A 1 in the first column means that class one was present 3 days a go. A 0 in the 7th column means that class 3 was not present in 2 days ago, and so on.

So now we have one sample/event (one row of the final X matrix) and the corresponding label(one row of the target y):

>>> labels[-1]
[1 0 1 1]

If you follow this procedure you will be able to get a training set that can be fed to a classifier.

answered Jan 12, 2015 at 19:59
Sign up to request clarification or add additional context in comments.

4 Comments

Do you mean something like this? gist.github.com/anonymous/d80d2a571b7b53c518d2 How should my predictors look like?
Thanks for the code but it looks like enc is not defined. Besides that I understand your explanation. I'm just not sure how I should call RandomForestClassifier.fit(training, labels). Should I call it for each place history? Example: [1,1,1,1] for place 1, [1,1,1,0] for place 2 etc.
That line was a mistake, see my edit. Why I explained is how to obtain a row in X and y. You should do the same for each place in history and in this case you will get n_events - window_size rows like the one I have just shown. Then you can do RandomForestClassifier.fit(X, y).
Got it! Took sometime to understand but I guess I got. Thanks a lot for your explanation.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.