I am using a panda's dataframe and I am doing filtering and some calculations per column and per row.
The dataframe looks this:
100 200 300 400
0 1 1 0 1
1 1 1 1 0
Each header of the dataframe represents a company ID (x) and the rows represent specific users (y). If the user has accessed that firm, then the in the cell x*y the value is equal with 1 or 0 otherwise. What I want to do is see how searched are 2 firms are in comparison with the other firms. If it is necessary I can go into more detail about the general formula.
for base in list_of_companies:
counter = 0
for peer in list2_of_companies:
counter += 1
if base == peer:
"do nothing"
else:
# Calculate first the denominator since we slice the big matrix
# In dataframes that only have accessed the base firm
denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
denominator = denominator_df.sum(axis=1).values.tolist()
denominator = sum(denominator)-len(denominator)
# Calculate the numerator. This is done later because
# We slice up more the dataframe above by
# Filtering records which have been accessed by both the base and the peer firm
numerator_df = denominator_df.loc[(denominator_df[base] == 1) & (denominator_df[peer] == 1)]
numerator = len(numerator_df.index)
annual_search_fraction = numerator/denominator
print("Base: {} and Peer: {} ==> {}".format(base, peer, annual_search_fraction))
In total I have data for 13 years. I have a ryzen 2700x and it only managed to do 2 days in about 12 hours.
Edit 1 (added example input):
The metric is the following:
1) The metric that I am trying to calculate is going to tell me how many times 2 companies are searched together in comparison with all the other searches.
2) The code is first selecting all the users which have accessed the base firm (denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
)line. Then it calculates the denominator which counts how many unique combinations between the base firm and any other searched firm by the user are there and since I can count the number of firms accessed (by the user), I can subtract 1 to get the number of unique links between the base firm and the other firms.
3) Next, the code filters the previous denominator_df
to select only the rows which accessed the base and the peer firm. Since I need to count the number of users which accessed the base and the peer firm, I use the command: numerator = len(numerator_df.index)
to count the number of rows and that will give me the numerator.
The expected output from the dataframe at the top is the following:
Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0
4) The sanity check to see if the code gives the correct solution: all the metrics between 1 base firm and all the other peer firms have to sum up to 1 (all the metrics between base 100 and peer[200,300,400] summed together give 1.
2 Answers 2
One quick and obvious improvement is possible in these two lines:
denominator = denominator_df.sum(axis=1).values.tolist()
denominator = sum(denominator)-len(denominator)
This is equivalent to:
denominator = denominator_df.sum(axis=1)
denominator = denominator.sum() - len(denominator)
which should be a lot faster since converting to a temporary list will be quite slow, as is summing using the Python built-in instead of using the pandas
vectorized method.
Since you only care for the count in the numerator case, just use sum
:
numerator = (denominator_df[peer] == 1).sum()
Note that checking for denominator_df[base] == 1
is unnecessary since that was already done in the construction of denominator_df
.
But the real speed gains are probably in eliminating the double for
loop altogether and writing this using vectorized methods. With some example input that might be possible.
-
\$\begingroup\$ I implemented the differences in the code as you suggested and I did see some speed improvements. Regarding the example input, I will update the question now \$\endgroup\$Adrian– Adrian2019年01月28日 09:57:22 +00:00Commented Jan 28, 2019 at 9:57
I'm not sure about the issue, but what I can gather so far from your code is, that your Ryzen will not be able to parralellize due to your 2 loops.
-
1\$\begingroup\$ Hi Philip, I'd recommend expanding a little bit about your point. Otherwise it doesn't help the OP much :) \$\endgroup\$IEatBagels– IEatBagels2019年01月25日 14:36:54 +00:00Commented Jan 25, 2019 at 14:36
df_matrix
,list_of_companies
andlist2_of_companies
, so reviewers can test if their approach also works and performs better than yours. \$\endgroup\$if base == peer: "do nothing"
Is that valid Python? \$\endgroup\$list_of_companies
andlist2_of_companies
be 1) different from each other 2) be different fromdf.columns
? \$\endgroup\$list_of_companies
andlist2_of_companies
are exactly the same and they are not different fromdf.columns
, meaning that if you were to extract the list of column names and compare it to thelist_of_companies
there would be no difference. \$\endgroup\$