Pandas calculations per columns and per rows for very big datasets

Question 1

I am using a panda's dataframe and I am doing filtering and some calculations per column and per row.

The dataframe looks this:

 100 200 300 400
0 1 1 0 1
1 1 1 1 0

Each header of the dataframe represents a company ID (x) and the rows represent specific users (y). If the user has accessed that firm, then the in the cell x*y the value is equal with 1 or 0 otherwise. What I want to do is see how searched are 2 firms are in comparison with the other firms. If it is necessary I can go into more detail about the general formula.

for base in list_of_companies:
 counter = 0
 for peer in list2_of_companies:
 counter += 1
 if base == peer:
 "do nothing"
 else:
 # Calculate first the denominator since we slice the big matrix
 # In dataframes that only have accessed the base firm
 denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
 denominator = denominator_df.sum(axis=1).values.tolist()
 denominator = sum(denominator)-len(denominator)
 # Calculate the numerator. This is done later because
 # We slice up more the dataframe above by
 # Filtering records which have been accessed by both the base and the peer firm
 numerator_df = denominator_df.loc[(denominator_df[base] == 1) & (denominator_df[peer] == 1)]
 numerator = len(numerator_df.index)
 annual_search_fraction = numerator/denominator
 print("Base: {} and Peer: {} ==> {}".format(base, peer, annual_search_fraction))

In total I have data for 13 years. I have a ryzen 2700x and it only managed to do 2 days in about 12 hours.

Edit 1 (added example input):

The metric is the following:

enter image description here

1) The metric that I am trying to calculate is going to tell me how many times 2 companies are searched together in comparison with all the other searches.

2) The code is first selecting all the users which have accessed the base firm (denominator_df = df_matrix.loc[(df_matrix[base] == 1)])line. Then it calculates the denominator which counts how many unique combinations between the base firm and any other searched firm by the user are there and since I can count the number of firms accessed (by the user), I can subtract 1 to get the number of unique links between the base firm and the other firms.

3) Next, the code filters the previous denominator_df to select only the rows which accessed the base and the peer firm. Since I need to count the number of users which accessed the base and the peer firm, I use the command: numerator = len(numerator_df.index) to count the number of rows and that will give me the numerator.

The expected output from the dataframe at the top is the following:

Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0

4) The sanity check to see if the code gives the correct solution: all the metrics between 1 base firm and all the other peer firms have to sum up to 1 (all the metrics between base 100 and peer[200,300,400] summed together give 1.

Question 2

Please add a short example input, i.e. a valid df_matrix, list_of_companies and list2_of_companies, so reviewers can test if their approach also works and performs better than yours.

Question 3

if base == peer: "do nothing" Is that valid Python?

Question 4

@Mast ah, didn't know you can do that. Thanks for the tip

Question 5

Do you always want all combinations or can list_of_companies and list2_of_companies be 1) different from each other 2) be different from df.columns?

Question 6

I always want all the combinations. Thelist_of_companies and list2_of_companies are exactly the same and they are not different from df.columns, meaning that if you were to extract the list of column names and compare it to the list_of_companies there would be no difference.

Question 7

One quick and obvious improvement is possible in these two lines:

denominator = denominator_df.sum(axis=1).values.tolist()
denominator = sum(denominator)-len(denominator)

This is equivalent to:

denominator = denominator_df.sum(axis=1)
denominator = denominator.sum() - len(denominator)

which should be a lot faster since converting to a temporary list will be quite slow, as is summing using the Python built-in instead of using the pandas vectorized method.

Since you only care for the count in the numerator case, just use sum:

numerator = (denominator_df[peer] == 1).sum()

Note that checking for denominator_df[base] == 1 is unnecessary since that was already done in the construction of denominator_df.

But the real speed gains are probably in eliminating the double for loop altogether and writing this using vectorized methods. With some example input that might be possible.

Question 8

I implemented the differences in the code as you suggested and I did see some speed improvements. Regarding the example input, I will update the question now

Question 9

I'm not sure about the issue, but what I can gather so far from your code is, that your Ryzen will not be able to parralellize due to your 2 loops.

Question 10

Hi Philip, I'd recommend expanding a little bit about your point. Otherwise it doesn't help the OP much :)

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-01-26 11:00:41Z

One quick and obvious improvement is possible in these two lines:

denominator = denominator_df.sum(axis=1).values.tolist()
denominator = sum(denominator)-len(denominator)

This is equivalent to:

denominator = denominator_df.sum(axis=1)
denominator = denominator.sum() - len(denominator)

which should be a lot faster since converting to a temporary list will be quite slow, as is summing using the Python built-in instead of using the pandas vectorized method.

Since you only care for the count in the numerator case, just use sum:

numerator = (denominator_df[peer] == 1).sum()

Note that checking for denominator_df[base] == 1 is unnecessary since that was already done in the construction of denominator_df.

But the real speed gains are probably in eliminating the double for loop altogether and writing this using vectorized methods. With some example input that might be possible.

I implemented the differences in the code as you suggested and I did see some speed improvements. Regarding the example input, I will update the question now

Stack Exchange Network

Pandas calculations per columns and per rows for very big datasets

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas calculations per columns and per rows for very big datasets

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions