4
\$\begingroup\$

I want to find a more efficient solution to the following problem:

My dataset is about online games. Each player may have multiple plays (rows in the df). Each play has its corresponding timestamp. However, some events are lacking their id (session identifier). I need an efficient method to fill the NaN of the id columns, using information from the other two columns.

This is the dataset:

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3', '4'],
 'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09', 
 '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
 '2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16', 
 '2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22', '2018-01-03 00:58:16', 
 '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16', '2018-03-01 02:44:21'], 
 'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 
 'g', np.nan, 'f', 'g', 'h']}
#create dataframe
df = pd.DataFrame(data=d)
df
 player date id
0 1 2018年01月01日 00:17:01 NaN
1 1 2018年01月01日 00:17:05 NaN
2 1 2018年01月01日 00:19:05 a
3 1 2018年01月01日 00:21:07 a
4 1 2018年01月01日 00:22:09 b
5 1 2018年01月01日 00:22:17 NaN
6 1 2018年01月01日 00:25:09 b
7 1 2018年01月01日 00:25:11 c
8 1 2018年01月01日 00:27:28 c
9 1 2018年01月01日 00:29:29 c
10 1 2018年01月01日 00:30:35 c
11 2 2018年02月01日 00:31:16 d
12 2 2018年02月01日 00:35:22 d
13 2 2018年02月01日 00:38:16 e
14 2 2018年02月01日 00:38:20 e
15 2 2018年02月01日 00:55:15 NaN
16 3 2018年01月03日 00:55:22 f
17 3 2018年01月03日 00:58:16 f
18 3 2018年01月03日 00:58:21 g
19 3 2018年03月01日 01:00:35 NaN
20 3 2018年03月01日 01:20:16 f
21 3 2018年03月01日 01:31:16 g
22 4 2018年03月01日 02:44:21 h

This was my line of reasoning:

1. Groupby player and id

computing the first and last date for each play.

my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
 min max
 player id 
 1 a 2018年01月01日 00:19:05 2018年01月01日 00:21:07
 b 2018年01月01日 00:22:09 2018年01月01日 00:25:09
 c 2018年01月01日 00:25:11 2018年01月01日 00:30:35
 2 d 2018年02月01日 00:31:16 2018年02月01日 00:35:22
 e 2018年02月01日 00:38:16 2018年02月01日 00:38:20
 3 f 2018年01月03日 00:55:22 2018年03月01日 01:20:16
 g 2018年01月03日 00:58:21 2018年03月01日 01:31:16
 4 h 2018年03月01日 02:44:21 2018年03月01日 02:44:21

2. Create a function

that compares the timestamp of the play to every id available, and see whether or not it falls within range. Basically, if it falls within range of a single session of play id, I want to fill the NaN with the id label. If it is not within range of any session id, I want to fill the NaN with 0. If it is within range of multiple sessions (theoretically it should not be), it fills the missing value with -99.

#define a function to sort the missing values 
def check_function(time):
 #compare every date event with the range of the sessions. 
 current_sessions = group.loc[(group['min']<time) & (group['max']>time)]
 #store length, that is the number of matches. 
 count = len(current_sessions)
 #How many matches are there for any missing id value?
 # if 0 the event lies outside all the possible ranges
 if count == 0:
 return 0
 #if >1, it is impossible to say to which session the event belongs
 if count > 1:
 return -99
 #in this case the event belongs clearly to just one session
 return current_sessions.index[0][1]

3. Apply the function player by player

And store the results to create a new dataframe, with the desired output.

grouped = my_agg.groupby(level=0)
final = pd.DataFrame()
for name, group in grouped:
 mv_per_player = df.loc[df['player'] == name] 
 mv_per_player.loc[mv_per_player.id.isnull(),'id'] = mv_per_player.loc[mv_per_player.id.isnull(),'date'].apply(check_function)
 final = final.append(mv_per_player)
final
 player date id
0 1 2018年01月01日 00:17:01 0
1 1 2018年01月01日 00:17:05 0
2 1 2018年01月01日 00:19:05 a
3 1 2018年01月01日 00:21:07 a
4 1 2018年01月01日 00:22:09 b
5 1 2018年01月01日 00:22:17 b
6 1 2018年01月01日 00:25:09 b
7 1 2018年01月01日 00:25:11 c
8 1 2018年01月01日 00:27:28 c
9 1 2018年01月01日 00:29:29 c
10 1 2018年01月01日 00:30:35 c
11 2 2018年02月01日 00:31:16 d
12 2 2018年02月01日 00:35:22 d
13 2 2018年02月01日 00:38:16 e
14 2 2018年02月01日 00:38:20 e
15 2 2018年02月01日 00:55:15 0
16 3 2018年01月03日 00:55:22 f
17 3 2018年01月03日 00:58:16 f
18 3 2018年01月03日 00:58:21 g
19 3 2018年03月01日 01:00:35 -99
20 3 2018年03月01日 01:20:16 f
21 3 2018年03月01日 01:31:16 g
22 4 2018年03月01日 02:44:21 h

Conclusion

This process works, but is there a way to make the process more efficient and faster? I think the bottleneck is the for loop, but I cannot find an alternative solution.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 4, 2018 at 13:41
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

When you construct df, date should be an actual datetime index, potentially via to_datetime().

There's a feature of dataframes where you can access columns as attributes, which you use in e.g. .date. I discourage this style, and suggest that you instead write ['date'] - it's easier for a static analyser to digest, and makes it clearer which attributes are from Pandas, as distinct from columns you added.

When you call agg, don't pass the built-in Python functions min, max. Write the strings 'min', 'max' instead for Pandas to use its own implementations.

Cache mv_per_player.id.isnull() since you need it twice.

final = final.append(mv_per_player) is no longer supported, and even if it were it isn't a good idea. For better performance populate a list, and then call concat outside of the loop.

There's probably more to improve here, but it's convoluted enough that I stopped at the easy wins. I also show a way to express regression tests at the end.

from pathlib import Path
import numpy as np
import pandas as pd
df = pd.DataFrame({
 'player': [
 '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
 '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3', '4',
 ],
 'date': pd.to_datetime([
 '2018-01-01 00:17:01', '2018-01-01 00:17:05', '2018-01-01 00:19:05', '2018-01-01 00:21:07',
 '2018-01-01 00:22:09', '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11',
 '2018-01-01 00:27:28', '2018-01-01 00:29:29', '2018-01-01 00:30:35', '2018-02-01 00:31:16',
 '2018-02-01 00:35:22', '2018-02-01 00:38:16', '2018-02-01 00:38:20', '2018-02-01 00:55:15',
 '2018-01-03 00:55:22', '2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35',
 '2018-03-01 01:20:16', '2018-03-01 01:31:16', '2018-03-01 02:44:21',
 ]),
 'id': [
 np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c',
 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 'g', np.nan, 'f', 'g', 'h',
 ],
})
date_extrema = df.groupby(['player', 'id'])['date'].agg(['min', 'max'])
def check_function(time: pd.Timestamp) -> object:
 """sort the missing values"""
 # compare every date event with the range of the sessions.
 current_sessions = player_times.loc[(player_times['min'] < time) & (player_times['max'] > time)]
 # store length, that is the number of matches.
 count = len(current_sessions)
 # How many matches are there for any missing id value?
 # if 0 the event lies outside all the possible ranges
 if count == 0:
 return 0
 # if >1, it is impossible to say to which session the event belongs
 if count > 1:
 return -99
 # in this case the event belongs clearly to just one session
 return current_sessions.index[0][1]
by_player = date_extrema.groupby(level='player')
subframes = []
for name, player_times in by_player:
 mv_per_player = df.loc[df['player'] == name]
 is_null = mv_per_player['id'].isnull()
 mv_per_player.loc[is_null, 'id'] = mv_per_player.loc[is_null, 'date'].apply(check_function)
 subframes.append(mv_per_player)
final = pd.concat(subframes)
final['id'] = final['id'].astype(str)
path = Path('reference.csv')
if path.exists():
 reference = pd.read_csv(path, dtype={'player': str}, parse_dates=['date'])
 pd.testing.assert_frame_equal(reference, final)
else:
 final.to_csv(path, index=False)
answered Jan 30 at 23:18
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.