I need to classify URLs from a DataFrame and modify it by exact match and contains conditions:
class PageClassifier:
def __init__(self, contains_pat, match_pat):
"""
:param match_pat: A dict with exact match patterns in values as lists
:type match_pat: dict
:param contains_pat: A dict with contains patterns in values as lists
:type contains_pat: dict
"""
self.match_pat = match_pat
self.contains_pat = contains_pat
def worker(self, data_frame):
"""
Classifies pages by type (url patterns)
:param data_frame: DataFrame to classify
:return: Classified by URL patterns DataFrame
"""
try:
for key, value in self.contains_pat.items():
reg_exp = '|'.join(value)
data_frame.loc[data_frame['url'].str.contains(reg_exp, regex=True), ['page_class']] = key
for key, value in self.match_pat.items():
data_frame.loc[data_frame['url'].isin(value), ['page_class']] = key
return data_frame
except Exception as e:
print('page_classifier(): ', e, type(e))
df = pd.read_csv('logs.csv',
delimiter='\t', parse_dates=['date'],
chunksize=1000000)
contains = {'catalog': ['/category/', '/tags', '/search'], 'resources': ['.css', '.js', '.woff', '.ttf', '.html', '.php']}
match = {'info_pages': ['/information', '/about-us']}
classify = PageClassifier(contains, match)
new_pd = pd.DataFrame()
for num, chunk in enumerate(df):
print('Start chunk ', num)
new_pd = pd.concat([new_pd, classify.worker(chunk)])
new_pd.to_csv('classified.csv', sep='\t', index=False)
But it is very slow and takes to much RAM when I work with files over 10GB. How can I search and modify data faster? I need "exact match" and "contains" patterns searching in one func.
1 Answer 1
The first thing I notice here that will tank performance the most is:
new_pd = pd.concat([new_pd, classify.worker(chunk)])
cs95 outlines this issue very well in their answer here. The general advice is "NEVER grow a DataFrame!". Essentially creating a new copy of the DataFrame in each iteration is quadratic in time complexity as the entire DataFrame is copied each iteration, and the DataFrame only gets larger which ends up costing more and more time.
If we wanted to improve this approach we might consider something like:
df_list = []
for num, chunk in enumerate(df):
df_list.append(classify.worker(chunk))
new_pd = pd.concat(df_list, ignore_index=True)
new_pd.to_csv('classified.csv', sep='\t', index=False)
However, assuming we don't ever need the entire DataFrame in memory at once, and given that our logs.csv
is so large that we need to read it in chunks, we should also consider writing out our DataFrame in chunks:
for num, chunk in enumerate(df):
classify.worker(chunk).to_csv(
'classified.csv', sep='\t', index=False,
header=(num == 0), # only write the header for the first chunk,
mode='w' if num == 0 else 'a' # append mode after the first iteration
)
In terms of reading in the file, we appear to only using the url
and page_class
columns. Since we're not using the DateTime functionality of the date
column we don't need to take the time to parse it.
df = pd.read_csv('logs.csv', delimiter='\t', chunksize=1000000)
-
\$\begingroup\$ Thanks for your answer. Especially for "header=(num == 0), mode='w' if num == 0 else 'a'". This was very useful example for my practice. \$\endgroup\$drkwng– drkwng2021年10月15日 07:00:45 +00:00Commented Oct 15, 2021 at 7:00