I asked this question here: https://stackoverflow.com/q/55640147/5202255 and was told to post on this forum. I would like to know whether my solution can be improved or if there is another approach to the problem. Any help is really appreciated!
I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
X =
A B
0 1 [a, b, c]
1 2 [a, b]
2 3 [c, d]
I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element 'a' occurs in index 0,1, which gives {'a':[0,1]}. The solution for this example dataframe is:
Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]}
I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column 'B' can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient. Any help with a faster more efficient way will be really appreciated! Here is my solution code:
import itertools
import pandas as pd
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
B_dict = []
for idx,val in X.iterrows():
B = val['B']
B_dict.append(dict(zip(B,[[idx]]*len(B))))
B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]
print ('Result:',B_dict[0])
Output
Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]}
The code for the final line in the for loop was borrowed from here https://stackoverflow.com/questions/45649141/combine-values-of-same-keys-in-a-list-of-dicts, and https://stackoverflow.com/questions/16096754/remove-none-value-from-a-list-without-removing-the-0-value
2 Answers 2
itertuples
When traversing over the rows of a DataFrame
, using itertuples is generally faster than iterrows
. The latter makes a new Series
for each row, the former a namedtuple
, which is generally faster.
defaultdict
Appending items to a list on a key is typically done with a defaultdict
.
from collections import defaultdict
result = collections.defaultdict(list)
for row in X.itertuples():
idx = row.Index
for item in row.B:
result[item].append(idx)
The method you use to combine the values in the dictionary is meant to include empty items as None
. Here you add them, and then filter them out, so using a defaultdict
will simplify this up a lot.
naming
Try to follow pep-8.
snake_case
for variables, etc.- spaces around operators
- ...
timings
dummy data
def make_dummydata(rows, max_length, seed=0):
letters = string.ascii_letters
np.random.seed(seed)
random.seed(seed)
col1 = np.random.randint(0, 10, size=rows)
items_per_row = np.random.randint(0, max_length, size=rows) + 1
col2 = [random.choices(letters, k=amount) for amount in items_per_row]
return pd.DataFrame({"A": col1, "B": col2})
benchmark method
import timeit
def benchmark(cases, functions):
for rows, max_length in cases:
df = make_dummydata(rows, max_length)
for name, function in functions.items():
result = timeit.timeit(
stmt=f"function(df)",
globals={"df": df, "function": function},
number=1,
)
yield rows, max_length, name, result
results
cases = [(10, 2), (100, 10), (1000, 40), (10000, 200)]
functions = {
"OP": find_op,
"maarten": find_maarten,
"jezrael": find_jezrael,
}
list(benchmark())
[(10, 2, 'OP', 0.001344002000003286), (10, 2, 'maarten', 0.0003913850000003549), (10, 2, 'jezrael', 0.005293956000002709), (100, 10, 'OP', 0.027166392000005146), (100, 10, 'maarten', 0.0004795910000012782), (100, 10, 'jezrael', 0.013824836999994261), (1000, 40, 'OP', 0.3434149869999956), (1000, 40, 'maarten', 0.0032574399999987236), (1000, 40, 'jezrael', 0.018533767000000978), (10_000, 200, 'OP', 33.48681208600001), (10_000, 200, 'maarten', 0.10972772499999905), (10_000, 200, 'jezrael', 0.7631061700000004), (350_000, 1000, 'maarten', 22.097186581000003), (350_000, 1000, 'jezrael', 516.128048978)]
The method using defaultdict
is a lot faster. This only says something about the run-time, not the memory usage, where my method does not create a large intermediary DataFrame
.
-
\$\begingroup\$ Thanks a lot Maarten for this insightful response! Plenty to learn from it. The defaultdict approach is indeed the best solution here. I have accepted your solution and also the solution to my problem posted here: stackoverflow.com/questions/55640147/…, which are the same. \$\endgroup\$network_coder– network_coder2019年04月13日 14:28:59 +00:00Commented Apr 13, 2019 at 14:28
First, I would use this solution by @jezrael to expand your lists into rows of the dataframe, repeating the values of the index where necessary:
df2 = pd.DataFrame(df.B.tolist(), index=df.index) \
.stack() \
.reset_index(level=1, drop=True) \
.reset_index(name="B")
# index B
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 a
# 4 1 b
# 5 2 c
# 6 2 d
Then you can simply group by B
and get all values of index
:
df2.groupby("B")["index"].apply(list).to_dict()
# {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]}
This should be faster for large dataframes (but you should profile it to make sure it is). However, it will create a largish intermediate dataframe (basically duplicate your current one), so might not be usable for very large dataframes.
-
1\$\begingroup\$ Thanks for the response and help Graipher! As pointed out by the user below, this solution becomes slower when the data size becomes larger. The defaultdict approach is much faster, so I have accepted that as the solution. I used the solution given to my problem here: stackoverflow.com/questions/55640147/…. Thanks! \$\endgroup\$network_coder– network_coder2019年04月13日 14:26:18 +00:00Commented Apr 13, 2019 at 14:26