1
\$\begingroup\$

I need to classify URLs from a DataFrame and modify it by exact match and contains conditions:

class PageClassifier:
 def __init__(self, contains_pat, match_pat):
 """
 :param match_pat: A dict with exact match patterns in values as lists
 :type match_pat: dict
 :param contains_pat: A dict with contains patterns in values as lists
 :type contains_pat: dict
 """
 self.match_pat = match_pat
 self.contains_pat = contains_pat
 def worker(self, data_frame):
 """
 Classifies pages by type (url patterns)
 :param data_frame: DataFrame to classify
 :return: Classified by URL patterns DataFrame
 """
 try:
 for key, value in self.contains_pat.items():
 reg_exp = '|'.join(value)
 data_frame.loc[data_frame['url'].str.contains(reg_exp, regex=True), ['page_class']] = key
 for key, value in self.match_pat.items():
 data_frame.loc[data_frame['url'].isin(value), ['page_class']] = key
 return data_frame
 except Exception as e:
 print('page_classifier(): ', e, type(e))
df = pd.read_csv('logs.csv',
 delimiter='\t', parse_dates=['date'],
 chunksize=1000000) 
contains = {'catalog': ['/category/', '/tags', '/search'], 'resources': ['.css', '.js', '.woff', '.ttf', '.html', '.php']}
match = {'info_pages': ['/information', '/about-us']}
 
classify = PageClassifier(contains, match)
new_pd = pd.DataFrame()
for num, chunk in enumerate(df):
 print('Start chunk ', num)
 new_pd = pd.concat([new_pd, classify.worker(chunk)])
new_pd.to_csv('classified.csv', sep='\t', index=False)

But it is very slow and takes to much RAM when I work with files over 10GB. How can I search and modify data faster? I need "exact match" and "contains" patterns searching in one func.

asked Oct 14, 2021 at 19:00
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

The first thing I notice here that will tank performance the most is:

new_pd = pd.concat([new_pd, classify.worker(chunk)])

cs95 outlines this issue very well in their answer here. The general advice is "NEVER grow a DataFrame!". Essentially creating a new copy of the DataFrame in each iteration is quadratic in time complexity as the entire DataFrame is copied each iteration, and the DataFrame only gets larger which ends up costing more and more time.

If we wanted to improve this approach we might consider something like:

df_list = []
for num, chunk in enumerate(df):
 df_list.append(classify.worker(chunk))
new_pd = pd.concat(df_list, ignore_index=True)
new_pd.to_csv('classified.csv', sep='\t', index=False)

However, assuming we don't ever need the entire DataFrame in memory at once, and given that our logs.csv is so large that we need to read it in chunks, we should also consider writing out our DataFrame in chunks:

for num, chunk in enumerate(df):
 classify.worker(chunk).to_csv(
 'classified.csv', sep='\t', index=False,
 header=(num == 0), # only write the header for the first chunk,
 mode='w' if num == 0 else 'a' # append mode after the first iteration
 )

In terms of reading in the file, we appear to only using the url and page_class columns. Since we're not using the DateTime functionality of the date column we don't need to take the time to parse it.

df = pd.read_csv('logs.csv', delimiter='\t', chunksize=1000000)
answered Oct 15, 2021 at 4:35
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for your answer. Especially for "header=(num == 0), mode='w' if num == 0 else 'a'". This was very useful example for my practice. \$\endgroup\$ Commented Oct 15, 2021 at 7:00

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.