Fastest way to find dataframe indexes of column elements that exist as lists

Question 1

I asked this question here: https://stackoverflow.com/q/55640147/5202255 and was told to post on this forum. I would like to know whether my solution can be improved or if there is another approach to the problem. Any help is really appreciated!

I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:

X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
X = 
 A B
0 1 [a, b, c]
1 2 [a, b]
2 3 [c, d]

I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element 'a' occurs in index 0,1, which gives {'a':[0,1]}. The solution for this example dataframe is:

Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]}

I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column 'B' can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient. Any help with a faster more efficient way will be really appreciated! Here is my solution code:

import itertools
import pandas as pd
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
B_dict = []
for idx,val in X.iterrows():
 B = val['B']
 B_dict.append(dict(zip(B,[[idx]]*len(B))))
 B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]
print ('Result:',B_dict[0])

Output

Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]}

The code for the final line in the for loop was borrowed from here https://stackoverflow.com/questions/45649141/combine-values-of-same-keys-in-a-list-of-dicts, and https://stackoverflow.com/questions/16096754/remove-none-value-from-a-list-without-removing-the-0-value

Question 2

`itertuples`

When traversing over the rows of a DataFrame, using itertuples is generally faster than iterrows. The latter makes a new Series for each row, the former a namedtuple, which is generally faster.

`defaultdict`

Appending items to a list on a key is typically done with a defaultdict.

from collections import defaultdict
result = collections.defaultdict(list)
for row in X.itertuples():
 idx = row.Index
 for item in row.B:
 result[item].append(idx)

The method you use to combine the values in the dictionary is meant to include empty items as None. Here you add them, and then filter them out, so using a defaultdict will simplify this up a lot.

naming

Try to follow pep-8.

snake_case for variables, etc.
spaces around operators
...

timings

dummy data

def make_dummydata(rows, max_length, seed=0):
 letters = string.ascii_letters
 np.random.seed(seed)
 random.seed(seed)
 col1 = np.random.randint(0, 10, size=rows)
 items_per_row = np.random.randint(0, max_length, size=rows) + 1
 col2 = [random.choices(letters, k=amount) for amount in items_per_row]
 return pd.DataFrame({"A": col1, "B": col2})

benchmark method

import timeit
def benchmark(cases, functions):
 for rows, max_length in cases:
 df = make_dummydata(rows, max_length)
 for name, function in functions.items():
 result = timeit.timeit(
 stmt=f"function(df)",
 globals={"df": df, "function": function},
 number=1,
 )
 yield rows, max_length, name, result

results

cases = [(10, 2), (100, 10), (1000, 40), (10000, 200)]
functions = {
 "OP": find_op,
 "maarten": find_maarten,
 "jezrael": find_jezrael,
}
list(benchmark())

[(10, 2, 'OP', 0.001344002000003286),
 (10, 2, 'maarten', 0.0003913850000003549),
 (10, 2, 'jezrael', 0.005293956000002709),
 (100, 10, 'OP', 0.027166392000005146),
 (100, 10, 'maarten', 0.0004795910000012782),
 (100, 10, 'jezrael', 0.013824836999994261),
 (1000, 40, 'OP', 0.3434149869999956),
 (1000, 40, 'maarten', 0.0032574399999987236),
 (1000, 40, 'jezrael', 0.018533767000000978),
 (10_000, 200, 'OP', 33.48681208600001),
 (10_000, 200, 'maarten', 0.10972772499999905),
 (10_000, 200, 'jezrael', 0.7631061700000004),
 (350_000, 1000, 'maarten', 22.097186581000003),
 (350_000, 1000, 'jezrael', 516.128048978)]

The method using defaultdict is a lot faster. This only says something about the run-time, not the memory usage, where my method does not create a large intermediary DataFrame.

Question 3

Thanks a lot Maarten for this insightful response! Plenty to learn from it. The defaultdict approach is indeed the best solution here. I have accepted your solution and also the solution to my problem posted here: stackoverflow.com/questions/55640147/…, which are the same.

Question 4

First, I would use this solution by @jezrael to expand your lists into rows of the dataframe, repeating the values of the index where necessary:

df2 = pd.DataFrame(df.B.tolist(), index=df.index) \ 
 .stack() \ 
 .reset_index(level=1, drop=True) \ 
 .reset_index(name="B")
# index B
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 a
# 4 1 b
# 5 2 c
# 6 2 d

Then you can simply group by B and get all values of index:

df2.groupby("B")["index"].apply(list).to_dict()
# {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]}

This should be faster for large dataframes (but you should profile it to make sure it is). However, it will create a largish intermediate dataframe (basically duplicate your current one), so might not be usable for very large dataframes.

Question 5

Thanks for the response and help Graipher! As pointed out by the user below, this solution becomes slower when the data size becomes larger. The defaultdict approach is much faster, so I have accepted that as the solution. I used the solution given to my problem here: stackoverflow.com/questions/55640147/…. Thanks!

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2019-04-12 10:45:34Z

`itertuples`

When traversing over the rows of a DataFrame, using itertuples is generally faster than iterrows. The latter makes a new Series for each row, the former a namedtuple, which is generally faster.

`defaultdict`

Appending items to a list on a key is typically done with a defaultdict.

from collections import defaultdict
result = collections.defaultdict(list)
for row in X.itertuples():
 idx = row.Index
 for item in row.B:
 result[item].append(idx)

The method you use to combine the values in the dictionary is meant to include empty items as None. Here you add them, and then filter them out, so using a defaultdict will simplify this up a lot.

naming

Try to follow pep-8.

snake_case for variables, etc.
spaces around operators
...

timings

dummy data

def make_dummydata(rows, max_length, seed=0):
 letters = string.ascii_letters
 np.random.seed(seed)
 random.seed(seed)
 col1 = np.random.randint(0, 10, size=rows)
 items_per_row = np.random.randint(0, max_length, size=rows) + 1
 col2 = [random.choices(letters, k=amount) for amount in items_per_row]
 return pd.DataFrame({"A": col1, "B": col2})

benchmark method

import timeit
def benchmark(cases, functions):
 for rows, max_length in cases:
 df = make_dummydata(rows, max_length)
 for name, function in functions.items():
 result = timeit.timeit(
 stmt=f"function(df)",
 globals={"df": df, "function": function},
 number=1,
 )
 yield rows, max_length, name, result

results

cases = [(10, 2), (100, 10), (1000, 40), (10000, 200)]
functions = {
 "OP": find_op,
 "maarten": find_maarten,
 "jezrael": find_jezrael,
}
list(benchmark())

[(10, 2, 'OP', 0.001344002000003286),
 (10, 2, 'maarten', 0.0003913850000003549),
 (10, 2, 'jezrael', 0.005293956000002709),
 (100, 10, 'OP', 0.027166392000005146),
 (100, 10, 'maarten', 0.0004795910000012782),
 (100, 10, 'jezrael', 0.013824836999994261),
 (1000, 40, 'OP', 0.3434149869999956),
 (1000, 40, 'maarten', 0.0032574399999987236),
 (1000, 40, 'jezrael', 0.018533767000000978),
 (10_000, 200, 'OP', 33.48681208600001),
 (10_000, 200, 'maarten', 0.10972772499999905),
 (10_000, 200, 'jezrael', 0.7631061700000004),
 (350_000, 1000, 'maarten', 22.097186581000003),
 (350_000, 1000, 'jezrael', 516.128048978)]

The method using defaultdict is a lot faster. This only says something about the run-time, not the memory usage, where my method does not create a large intermediary DataFrame.

Thanks a lot Maarten for this insightful response! Plenty to learn from it. The defaultdict approach is indeed the best solution here. I have accepted your solution and also the solution to my problem posted here: stackoverflow.com/questions/55640147/…, which are the same.

Stack Exchange Network

Fastest way to find dataframe indexes of column elements that exist as lists

2 Answers 2

`itertuples`

`defaultdict`

naming

timings

dummy data

benchmark method

results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fastest way to find dataframe indexes of column elements that exist as lists

2 Answers 2

itertuples

defaultdict

naming

timings

dummy data

benchmark method

results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`itertuples`

`defaultdict`