1
\$\begingroup\$

I'd like some feedback/suggestions on how to improve the following. Specifically I want to know if what I'm doing is reliable and fast, or if there is a better way to accomplish this.

The problem:

I have some dataset containing counts of sales made at different times throughout the day, across different locations/shops. Let's say there are 4 different shops contained in this data (A, B, C, D), and there are 4 different time bins in the day [0,1,2,3]. I query this and return a query dataset, but the issue I have is that for this query there may be no transactions for a certain time bin. Or there may be no transactions even for a specific shop (maybe there was a rat infestation and it closed for the day).

Nevertheless, the end result must have the same number of rows (4 locations x 4 time bins), and simply contain zeros if there were no transactions there. In other words, I want records for all possible occurrences, even if they were not returned by the query itself.

Example:

import pandas as pd
# Specify the complete list of possible time bins
max_timebins = 3
bin_nums = list(range(max_timebins + 1))
# Specify the complete list of shops
shop_ids = ['A', 'B','C','D']
# Make a dataframe for all possible results without the counts
# This is just a dataframe with an index and no columns... this feels a little strange to me but it worked...
dat = {'shop':[], 'timebin':[]}
for shop in shop_ids:
 dat['shop']+=[shop]*len(bin_nums)
 dat['timebin'] += bin_nums
df_all = pd.DataFrame(dat)
df_all = df_all.set_index(list(dat.keys()))
# Example of a result of a query
dfq = pd.DataFrame(
 {
 'shop':['A', 'A', 'A', 'A',
 'B', 'B',
 'C', 'C', 'C',
 'D'],
 'time_bins':[0,1,2,3,
 0, 3,
 0,2,3,
 2],
 'counts':[100,220, 300, 440,
 500, 660,
 120, 340, 90,
 400]}).set_index(['shop', 'time_bins'])
result_df = pd.concat([df_all, dfq], axis=1).fillna(0).astype(int)
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Nov 26, 2020 at 10:12
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

You can create the index directly:

pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)

Here is how you can achieve the same result (including the full index even if there is no data) in a simpler way:

import pandas as pd
# Specify the complete list of possible time bins
max_timebins = 3
# Specify the complete list of shops
shop_ids = ["A", "B", "C", "D"]
target_index = pd.MultiIndex.from_product(
 (shop_ids, range(max_timebins + 1)), names=("shop", "timebin")
)
# Example of a result of a query
dfq = pd.DataFrame(
 {
 "shop": ["A", "A", "A", "A", "B", "B", "C", "C", "C", "D"],
 "time_bins": [0, 1, 2, 3, 0, 3, 0, 2, 3, 2],
 "counts": [100, 220, 300, 440, 500, 660, 120, 340, 90, 400],
 }
).set_index(["shop", "time_bins"])
result_df = dfq.reindex(target_index, fill_value=pd.NA)
print(result_df)

Output:

 counts
shop timebin 
A 0 100
 1 220
 2 300
 3 440
B 0 500
 1 <NA>
 2 <NA>
 3 660
C 0 120
 1 <NA>
 2 340
 3 90
D 0 <NA>
 1 <NA>
 2 400
 3 <NA>

There might be a better solution, but that would require knowing a bit more context.

answered Nov 26, 2020 at 21:55
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.