How to speed up the search by patterns matching and modifying a DataFrame

Question 1

I need to classify URLs from a DataFrame and modify it by exact match and contains conditions:

class PageClassifier:
 def __init__(self, contains_pat, match_pat):
 """
 :param match_pat: A dict with exact match patterns in values as lists
 :type match_pat: dict
 :param contains_pat: A dict with contains patterns in values as lists
 :type contains_pat: dict
 """
 self.match_pat = match_pat
 self.contains_pat = contains_pat
 def worker(self, data_frame):
 """
 Classifies pages by type (url patterns)
 :param data_frame: DataFrame to classify
 :return: Classified by URL patterns DataFrame
 """
 try:
 for key, value in self.contains_pat.items():
 reg_exp = '|'.join(value)
 data_frame.loc[data_frame['url'].str.contains(reg_exp, regex=True), ['page_class']] = key
 for key, value in self.match_pat.items():
 data_frame.loc[data_frame['url'].isin(value), ['page_class']] = key
 return data_frame
 except Exception as e:
 print('page_classifier(): ', e, type(e))
df = pd.read_csv('logs.csv',
 delimiter='\t', parse_dates=['date'],
 chunksize=1000000) 
contains = {'catalog': ['/category/', '/tags', '/search'], 'resources': ['.css', '.js', '.woff', '.ttf', '.html', '.php']}
match = {'info_pages': ['/information', '/about-us']}
 
classify = PageClassifier(contains, match)
new_pd = pd.DataFrame()
for num, chunk in enumerate(df):
 print('Start chunk ', num)
 new_pd = pd.concat([new_pd, classify.worker(chunk)])
new_pd.to_csv('classified.csv', sep='\t', index=False)

But it is very slow and takes to much RAM when I work with files over 10GB. How can I search and modify data faster? I need "exact match" and "contains" patterns searching in one func.

Question 2

The first thing I notice here that will tank performance the most is:

new_pd = pd.concat([new_pd, classify.worker(chunk)])

cs95 outlines this issue very well in their answer here. The general advice is "NEVER grow a DataFrame!". Essentially creating a new copy of the DataFrame in each iteration is quadratic in time complexity as the entire DataFrame is copied each iteration, and the DataFrame only gets larger which ends up costing more and more time.

If we wanted to improve this approach we might consider something like:

df_list = []
for num, chunk in enumerate(df):
 df_list.append(classify.worker(chunk))
new_pd = pd.concat(df_list, ignore_index=True)
new_pd.to_csv('classified.csv', sep='\t', index=False)

However, assuming we don't ever need the entire DataFrame in memory at once, and given that our logs.csv is so large that we need to read it in chunks, we should also consider writing out our DataFrame in chunks:

for num, chunk in enumerate(df):
 classify.worker(chunk).to_csv(
 'classified.csv', sep='\t', index=False,
 header=(num == 0), # only write the header for the first chunk,
 mode='w' if num == 0 else 'a' # append mode after the first iteration
 )

In terms of reading in the file, we appear to only using the url and page_class columns. Since we're not using the DateTime functionality of the date column we don't need to take the time to parse it.

df = pd.read_csv('logs.csv', delimiter='\t', chunksize=1000000)

Question 3

Thanks for your answer. Especially for "header=(num == 0), mode='w' if num == 0 else 'a'". This was very useful example for my practice.

Henry Ecker Henry Ecker 1616 bronze badges · Accepted Answer · 2021-10-15 04:35:09Z

The first thing I notice here that will tank performance the most is:

new_pd = pd.concat([new_pd, classify.worker(chunk)])

cs95 outlines this issue very well in their answer here. The general advice is "NEVER grow a DataFrame!". Essentially creating a new copy of the DataFrame in each iteration is quadratic in time complexity as the entire DataFrame is copied each iteration, and the DataFrame only gets larger which ends up costing more and more time.

If we wanted to improve this approach we might consider something like:

df_list = []
for num, chunk in enumerate(df):
 df_list.append(classify.worker(chunk))
new_pd = pd.concat(df_list, ignore_index=True)
new_pd.to_csv('classified.csv', sep='\t', index=False)

However, assuming we don't ever need the entire DataFrame in memory at once, and given that our logs.csv is so large that we need to read it in chunks, we should also consider writing out our DataFrame in chunks:

for num, chunk in enumerate(df):
 classify.worker(chunk).to_csv(
 'classified.csv', sep='\t', index=False,
 header=(num == 0), # only write the header for the first chunk,
 mode='w' if num == 0 else 'a' # append mode after the first iteration
 )

In terms of reading in the file, we appear to only using the url and page_class columns. Since we're not using the DateTime functionality of the date column we don't need to take the time to parse it.

df = pd.read_csv('logs.csv', delimiter='\t', chunksize=1000000)

Thanks for your answer. Especially for "header=(num == 0), mode='w' if num == 0 else 'a'". This was very useful example for my practice.

Stack Exchange Network

How to speed up the search by patterns matching and modifying a DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to speed up the search by patterns matching and modifying a DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions