Fill a pandas dataframe (of predetermined size) with results from a similar dataframe (of size less than or equivalent to the original)

Question 1

I'd like some feedback/suggestions on how to improve the following. Specifically I want to know if what I'm doing is reliable and fast, or if there is a better way to accomplish this.

The problem:

I have some dataset containing counts of sales made at different times throughout the day, across different locations/shops. Let's say there are 4 different shops contained in this data (A, B, C, D), and there are 4 different time bins in the day [0,1,2,3]. I query this and return a query dataset, but the issue I have is that for this query there may be no transactions for a certain time bin. Or there may be no transactions even for a specific shop (maybe there was a rat infestation and it closed for the day).

Nevertheless, the end result must have the same number of rows (4 locations x 4 time bins), and simply contain zeros if there were no transactions there. In other words, I want records for all possible occurrences, even if they were not returned by the query itself.

Example:

import pandas as pd
# Specify the complete list of possible time bins
max_timebins = 3
bin_nums = list(range(max_timebins + 1))
# Specify the complete list of shops
shop_ids = ['A', 'B','C','D']
# Make a dataframe for all possible results without the counts
# This is just a dataframe with an index and no columns... this feels a little strange to me but it worked...
dat = {'shop':[], 'timebin':[]}
for shop in shop_ids:
 dat['shop']+=[shop]*len(bin_nums)
 dat['timebin'] += bin_nums
df_all = pd.DataFrame(dat)
df_all = df_all.set_index(list(dat.keys()))
# Example of a result of a query
dfq = pd.DataFrame(
 {
 'shop':['A', 'A', 'A', 'A',
 'B', 'B',
 'C', 'C', 'C',
 'D'],
 'time_bins':[0,1,2,3,
 0, 3,
 0,2,3,
 2],
 'counts':[100,220, 300, 440,
 500, 660,
 120, 340, 90,
 400]}).set_index(['shop', 'time_bins'])
result_df = pd.concat([df_all, dfq], axis=1).fillna(0).astype(int)

Question 2

You can create the index directly:

pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)

Here is how you can achieve the same result (including the full index even if there is no data) in a simpler way:

import pandas as pd
# Specify the complete list of possible time bins
max_timebins = 3
# Specify the complete list of shops
shop_ids = ["A", "B", "C", "D"]
target_index = pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)
# Example of a result of a query
dfq = pd.DataFrame(
 {
 "shop": ["A", "A", "A", "A", "B", "B", "C", "C", "C", "D"],
 "time_bins": [0, 1, 2, 3, 0, 3, 0, 2, 3, 2],
 "counts": [100, 220, 300, 440, 500, 660, 120, 340, 90, 400],
 }
).set_index(["shop", "time_bins"])
result_df = dfq.reindex(target_index, fill_value=pd.NA)
print(result_df)

Output:

 counts
shop timebin 
A 0 100
 1 220
 2 300
 3 440
B 0 500
 1 <NA>
 2 <NA>
 3 660
C 0 120
 1 <NA>
 2 340
 3 90
D 0 <NA>
 1 <NA>
 2 400
 3 <NA>

There might be a better solution, but that would require knowing a bit more context.

AMC AMC 3031 silver badge8 bronze badges · Answer 1 · 2020-11-26 21:55:31Z

You can create the index directly:

pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)

Here is how you can achieve the same result (including the full index even if there is no data) in a simpler way:

import pandas as pd
# Specify the complete list of possible time bins
max_timebins = 3
# Specify the complete list of shops
shop_ids = ["A", "B", "C", "D"]
target_index = pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)
# Example of a result of a query
dfq = pd.DataFrame(
 {
 "shop": ["A", "A", "A", "A", "B", "B", "C", "C", "C", "D"],
 "time_bins": [0, 1, 2, 3, 0, 3, 0, 2, 3, 2],
 "counts": [100, 220, 300, 440, 500, 660, 120, 340, 90, 400],
 }
).set_index(["shop", "time_bins"])
result_df = dfq.reindex(target_index, fill_value=pd.NA)
print(result_df)

Output:

 counts
shop timebin 
A 0 100
 1 220
 2 300
 3 440
B 0 500
 1 <NA>
 2 <NA>
 3 660
C 0 120
 1 <NA>
 2 340
 3 90
D 0 <NA>
 1 <NA>
 2 400
 3 <NA>

There might be a better solution, but that would require knowing a bit more context.

Stack Exchange Network

Fill a pandas dataframe (of predetermined size) with results from a similar dataframe (of size less than or equivalent to the original)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fill a pandas dataframe (of predetermined size) with results from a similar dataframe (of size less than or equivalent to the original)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions