Pandas: missing value imputation

Question 1

I want to find a more efficient solution to the following problem:

My dataset is about online games. Each player may have multiple plays (rows in the df). Each play has its corresponding timestamp. However, some events are lacking their id (session identifier). I need an efficient method to fill the NaN of the id columns, using information from the other two columns.

This is the dataset:

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3', '4'],
 'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09', 
 '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
 '2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16', 
 '2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22', '2018-01-03 00:58:16', 
 '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16', '2018-03-01 02:44:21'], 
 'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 
 'g', np.nan, 'f', 'g', 'h']}
#create dataframe
df = pd.DataFrame(data=d)
df
 player date id
0 1 2018年01月01日 00:17:01 NaN
1 1 2018年01月01日 00:17:05 NaN
2 1 2018年01月01日 00:19:05 a
3 1 2018年01月01日 00:21:07 a
4 1 2018年01月01日 00:22:09 b
5 1 2018年01月01日 00:22:17 NaN
6 1 2018年01月01日 00:25:09 b
7 1 2018年01月01日 00:25:11 c
8 1 2018年01月01日 00:27:28 c
9 1 2018年01月01日 00:29:29 c
10 1 2018年01月01日 00:30:35 c
11 2 2018年02月01日 00:31:16 d
12 2 2018年02月01日 00:35:22 d
13 2 2018年02月01日 00:38:16 e
14 2 2018年02月01日 00:38:20 e
15 2 2018年02月01日 00:55:15 NaN
16 3 2018年01月03日 00:55:22 f
17 3 2018年01月03日 00:58:16 f
18 3 2018年01月03日 00:58:21 g
19 3 2018年03月01日 01:00:35 NaN
20 3 2018年03月01日 01:20:16 f
21 3 2018年03月01日 01:31:16 g
22 4 2018年03月01日 02:44:21 h

This was my line of reasoning:

1. Groupby player and id

computing the first and last date for each play.

my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
 min max
 player id 
 1 a 2018年01月01日 00:19:05 2018年01月01日 00:21:07
 b 2018年01月01日 00:22:09 2018年01月01日 00:25:09
 c 2018年01月01日 00:25:11 2018年01月01日 00:30:35
 2 d 2018年02月01日 00:31:16 2018年02月01日 00:35:22
 e 2018年02月01日 00:38:16 2018年02月01日 00:38:20
 3 f 2018年01月03日 00:55:22 2018年03月01日 01:20:16
 g 2018年01月03日 00:58:21 2018年03月01日 01:31:16
 4 h 2018年03月01日 02:44:21 2018年03月01日 02:44:21

2. Create a function

that compares the timestamp of the play to every id available, and see whether or not it falls within range. Basically, if it falls within range of a single session of play id, I want to fill the NaN with the id label. If it is not within range of any session id, I want to fill the NaN with 0. If it is within range of multiple sessions (theoretically it should not be), it fills the missing value with -99.

#define a function to sort the missing values 
def check_function(time):
 #compare every date event with the range of the sessions. 
 current_sessions = group.loc[(group['min']<time) & (group['max']>time)]
 #store length, that is the number of matches. 
 count = len(current_sessions)
 #How many matches are there for any missing id value?
 # if 0 the event lies outside all the possible ranges
 if count == 0:
 return 0
 #if >1, it is impossible to say to which session the event belongs
 if count > 1:
 return -99
 #in this case the event belongs clearly to just one session
 return current_sessions.index[0][1]

3. Apply the function player by player

And store the results to create a new dataframe, with the desired output.

grouped = my_agg.groupby(level=0)
final = pd.DataFrame()
for name, group in grouped:
 mv_per_player = df.loc[df['player'] == name] 
 mv_per_player.loc[mv_per_player.id.isnull(),'id'] = mv_per_player.loc[mv_per_player.id.isnull(),'date'].apply(check_function)
 final = final.append(mv_per_player)
final
 player date id
0 1 2018年01月01日 00:17:01 0
1 1 2018年01月01日 00:17:05 0
2 1 2018年01月01日 00:19:05 a
3 1 2018年01月01日 00:21:07 a
4 1 2018年01月01日 00:22:09 b
5 1 2018年01月01日 00:22:17 b
6 1 2018年01月01日 00:25:09 b
7 1 2018年01月01日 00:25:11 c
8 1 2018年01月01日 00:27:28 c
9 1 2018年01月01日 00:29:29 c
10 1 2018年01月01日 00:30:35 c
11 2 2018年02月01日 00:31:16 d
12 2 2018年02月01日 00:35:22 d
13 2 2018年02月01日 00:38:16 e
14 2 2018年02月01日 00:38:20 e
15 2 2018年02月01日 00:55:15 0
16 3 2018年01月03日 00:55:22 f
17 3 2018年01月03日 00:58:16 f
18 3 2018年01月03日 00:58:21 g
19 3 2018年03月01日 01:00:35 -99
20 3 2018年03月01日 01:20:16 f
21 3 2018年03月01日 01:31:16 g
22 4 2018年03月01日 02:44:21 h

Conclusion

This process works, but is there a way to make the process more efficient and faster? I think the bottleneck is the for loop, but I cannot find an alternative solution.

Question 2

When you construct df, date should be an actual datetime index, potentially via to_datetime().

There's a feature of dataframes where you can access columns as attributes, which you use in e.g. .date. I discourage this style, and suggest that you instead write ['date'] - it's easier for a static analyser to digest, and makes it clearer which attributes are from Pandas, as distinct from columns you added.

When you call agg, don't pass the built-in Python functions min, max. Write the strings 'min', 'max' instead for Pandas to use its own implementations.

Cache mv_per_player.id.isnull() since you need it twice.

final = final.append(mv_per_player) is no longer supported, and even if it were it isn't a good idea. For better performance populate a list, and then call concat outside of the loop.

There's probably more to improve here, but it's convoluted enough that I stopped at the easy wins. I also show a way to express regression tests at the end.

from pathlib import Path
import numpy as np
import pandas as pd
df = pd.DataFrame({
 'player': [
 '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
 '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3', '4',
 ],
 'date': pd.to_datetime([
 '2018-01-01 00:17:01', '2018-01-01 00:17:05', '2018-01-01 00:19:05', '2018-01-01 00:21:07',
 '2018-01-01 00:22:09', '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11',
 '2018-01-01 00:27:28', '2018-01-01 00:29:29', '2018-01-01 00:30:35', '2018-02-01 00:31:16',
 '2018-02-01 00:35:22', '2018-02-01 00:38:16', '2018-02-01 00:38:20', '2018-02-01 00:55:15',
 '2018-01-03 00:55:22', '2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35',
 '2018-03-01 01:20:16', '2018-03-01 01:31:16', '2018-03-01 02:44:21',
 ]),
 'id': [
 np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c',
 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 'g', np.nan, 'f', 'g', 'h',
 ],
})
date_extrema = df.groupby(['player', 'id'])['date'].agg(['min', 'max'])
def check_function(time: pd.Timestamp) -> object:
 """sort the missing values"""
 # compare every date event with the range of the sessions.
 current_sessions = player_times.loc[(player_times['min'] < time) & (player_times['max'] > time)]
 # store length, that is the number of matches.
 count = len(current_sessions)
 # How many matches are there for any missing id value?
 # if 0 the event lies outside all the possible ranges
 if count == 0:
 return 0
 # if >1, it is impossible to say to which session the event belongs
 if count > 1:
 return -99
 # in this case the event belongs clearly to just one session
 return current_sessions.index[0][1]
by_player = date_extrema.groupby(level='player')
subframes = []
for name, player_times in by_player:
 mv_per_player = df.loc[df['player'] == name]
 is_null = mv_per_player['id'].isnull()
 mv_per_player.loc[is_null, 'id'] = mv_per_player.loc[is_null, 'date'].apply(check_function)
 subframes.append(mv_per_player)
final = pd.concat(subframes)
final['id'] = final['id'].astype(str)
path = Path('reference.csv')
if path.exists():
 reference = pd.read_csv(path, dtype={'player': str}, parse_dates=['date'])
 pd.testing.assert_frame_equal(reference, final)
else:
 final.to_csv(path, index=False)

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-01-30 23:18:08Z

When you construct df, date should be an actual datetime index, potentially via to_datetime().

There's a feature of dataframes where you can access columns as attributes, which you use in e.g. .date. I discourage this style, and suggest that you instead write ['date'] - it's easier for a static analyser to digest, and makes it clearer which attributes are from Pandas, as distinct from columns you added.

When you call agg, don't pass the built-in Python functions min, max. Write the strings 'min', 'max' instead for Pandas to use its own implementations.

Cache mv_per_player.id.isnull() since you need it twice.

final = final.append(mv_per_player) is no longer supported, and even if it were it isn't a good idea. For better performance populate a list, and then call concat outside of the loop.

There's probably more to improve here, but it's convoluted enough that I stopped at the easy wins. I also show a way to express regression tests at the end.

from pathlib import Path
import numpy as np
import pandas as pd
df = pd.DataFrame({
 'player': [
 '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
 '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3', '4',
 ],
 'date': pd.to_datetime([
 '2018-01-01 00:17:01', '2018-01-01 00:17:05', '2018-01-01 00:19:05', '2018-01-01 00:21:07',
 '2018-01-01 00:22:09', '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11',
 '2018-01-01 00:27:28', '2018-01-01 00:29:29', '2018-01-01 00:30:35', '2018-02-01 00:31:16',
 '2018-02-01 00:35:22', '2018-02-01 00:38:16', '2018-02-01 00:38:20', '2018-02-01 00:55:15',
 '2018-01-03 00:55:22', '2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35',
 '2018-03-01 01:20:16', '2018-03-01 01:31:16', '2018-03-01 02:44:21',
 ]),
 'id': [
 np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c',
 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 'g', np.nan, 'f', 'g', 'h',
 ],
})
date_extrema = df.groupby(['player', 'id'])['date'].agg(['min', 'max'])
def check_function(time: pd.Timestamp) -> object:
 """sort the missing values"""
 # compare every date event with the range of the sessions.
 current_sessions = player_times.loc[(player_times['min'] < time) & (player_times['max'] > time)]
 # store length, that is the number of matches.
 count = len(current_sessions)
 # How many matches are there for any missing id value?
 # if 0 the event lies outside all the possible ranges
 if count == 0:
 return 0
 # if >1, it is impossible to say to which session the event belongs
 if count > 1:
 return -99
 # in this case the event belongs clearly to just one session
 return current_sessions.index[0][1]
by_player = date_extrema.groupby(level='player')
subframes = []
for name, player_times in by_player:
 mv_per_player = df.loc[df['player'] == name]
 is_null = mv_per_player['id'].isnull()
 mv_per_player.loc[is_null, 'id'] = mv_per_player.loc[is_null, 'date'].apply(check_function)
 subframes.append(mv_per_player)
final = pd.concat(subframes)
final['id'] = final['id'].astype(str)
path = Path('reference.csv')
if path.exists():
 reference = pd.read_csv(path, dtype={'player': str}, parse_dates=['date'])
 pd.testing.assert_frame_equal(reference, final)
else:
 final.to_csv(path, index=False)

Stack Exchange Network

Pandas: missing value imputation

I want to find a more efficient solution to the following problem:

This was my line of reasoning:

1. Groupby player and id

2. Create a function

3. Apply the function player by player

Conclusion

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas: missing value imputation

I want to find a more efficient solution to the following problem:

This was my line of reasoning:

1. Groupby player and id

2. Create a function

3. Apply the function player by player

Conclusion

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions