I have around 300 log files in a directory and each log file contains around 3300000 lines. I need to read through each file line by line and count how many hostnames that appear on each line. I wrote basic code for that task, but it takes more than 1 hour to run and takes lots of memory as well. How can I improve this code to make it run faster?
import pandas as pd
import gzip
directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files
listi=os.listdir(directory)#converting the logfiles into a list
for file in listi:#taking eaching log file in the list
tt=os.path.join(directory,file)# joining log file name along with the directory path
with gzip.open(tt,'rt') as f: #unzipping the log file
rows=[]#clearing the list after every loop
for line in f: #reading each line in the file
s=len(line.split('|'))
a=line.split('|')[s-3]
b=a.split('/')[0] #slicing just the hostname out of each line in the log file
if len(b.split('.'))==None:
''
else:
b=b.split('.')[0]
rows.append(b) # appending it to a list
df_temp= pd.DataFrame(columns=['hostname'],data=rows) #append list to the dataframe after every file is read
df_final=df_final.append(df_temp,ignore_index=True) #appending above dataframe to a new one to avoid overwriting
del df_temp #deleting temp dataframe to clear memory
df_final=df_final.groupby(["hostname"]).size().reset_index(name="Topic_Count") #doing the count
Sample log lines
tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokdd333**.sc.sc.com/16604/#001b0001|7632663/2344|342344|23244
Desired output
2 Answers 2
So I think you can improve the efficiency of your code like this.
First, as I said in one comment, you can replace:
s=len(line.split('|'))
a=line.split('|')[s-3]
by
a=line.split('|')[-3]
as no need to know the total length of a list
to get the third element from the end.
Second, assigning a
then b
with a value take some times, you can do it in one line:
a=line.split('|')[-3]
b=a.split('/')[0]
becomes
b=line.split('|')[-3].split('/')[0]
Third, I'm not sure len
can equal to None
, maybe you wanted to check for 0, but if our code run like this, I would say that:
if len(b.split('.'))==None:
''
else:
b=b.split('.')[0]
is not useful so you can calculate directly the final b
with:
b=line.split('|')[-3].split('/')[0].split('.')[0]
Forth, because actually you don't need to assign b
anymore, you can append the value into rows
directly, such as:
rows=[]
for line in f:
rows.append(line.split('|')[-3].split('/')[0].split('.')[0])
or as a list comprehension:
rows = [line.split('|')[-3].split('/')[0].split('.')[0] for line in f]
Fifth, again, you create df_temp
to use it once and then delete it, you can append
directly into df_final
such as:
df_temp= pd.DataFrame(columns=['hostname'],data=rows)
df_final=df_final.append(df_temp,ignore_index=True)
del df_temp
is better this way:
df_final=df_final.append(pd.DataFrame(columns=['hostname'],data=rows),
ignore_index=True)
Ultimately, rows
is not necessary anymore, so all the code from the line with ...
until the line del ...
can be written:
with gzip.open(tt,'rt') as f:
df_final=df_final.append(pd.DataFrame(columns=['hostname'],
data=[line.split('|')[-3].split('/')[0].split('.')[0] for line in f]),
ignore_index=True)
So far, I think we saved some time, but I know that appening dataframe in a loop is not the best practice, especially because you need to assign again df_final
each time. It's better to add all the dataframes that you want to append together in a list
, and then use pd.concat
outside of the loop. Your code becomes:
list_final = []
for file in listi:
tt=os.path.join(directory,file)
with gzip.open(tt,'rt') as f:
list_final.append(pd.DataFrame(columns=['hostname'],
data=[line.split('|')[-3].split('/')[0].split('.')[0]
for line in f]))
df_final = (pd.concat(list_final,ignore_index=True)
.groupby(["hostname"]).size().reset_index(name="Topic_Count"))
Timing
I create one file with around 3 millions of rows, running you method was 8.9 seconds while mine was 5.8 (a gain more than 30%). I run the code on a listi
containg 10 of this file, and your method gave more than 91 seconds (a bit more than stricly 10 times the method with one file) while mine was about 57 seconds (a bit less than 10 times the method for just one file).
I don't know about all the multiprocessing or serializing calculations in Python, but it may be a good option too.
-
1\$\begingroup\$ thank you for such a detailed explanation !doing all the split in one line really helped :) \$\endgroup\$Mamatha– Mamatha2018年08月13日 14:57:24 +00:00Commented Aug 13, 2018 at 14:57
-
\$\begingroup\$ @Mamatha you are welcome. good luck in improving your code :) \$\endgroup\$Ben.T– Ben.T2018年08月13日 16:31:00 +00:00Commented Aug 13, 2018 at 16:31
Splitting a string, and only taking one substring from that split is doing a lot of work, just to throw away most of the results.
Consider the following line:
tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
It seems safe to assume that tx: ... |rx: ... |
is a fixed format. Starting at dt:
, we might see some variation. For instance, dt:10
is longer than dt:0
. So while the position of the hostname might vary a little bit, it seems easy to get the starting point: just after the first |
character after the first 62 characters. Similarly, finding the end point: the first .
(if any) before the first /
:
start = line.index('|', 62)+1
slash = line.index('/', start)
dot = line.find('.', start, slash)
end = dot if dot > 0 else slash
b = line[start:end]
Running timing tests, I find this isolates the hostname in 40% of the time of:
a = line.split('|')[-4]
b = a.split('/')[0]
if len(b.split('.')) > 0:
b = b.split('.')[0]
Finally, if all you are doing is getting a total count of each hostname across all of the files, appending the hostname to a rows
list, and using pandas to count the occurrences is painful. Simply use a Counter
:
import collections
counter = collections.Counter()
for file in ...:
for line in ...:
...
hostname = line[start:end]
counter[hostname] += 1
And then create your panda from the counter
, with the hostname counts already totaled.
Finally, as shown above, use better variable names, such as hostname
instead of b
.
Assuming that you are not I/O bound, you may be able to gain some speed using the multiprocessing
. Below, the list of files distributed to a number of workers, one per CPU. Each Process
takes a file, unzips and read it line-by-line, counting hostnames, and returning the counter. The main process receives the results for each file from the pool of processes, and accumulates the results into a single counter using sum()
. Since the order of the results does not matter, .imap_unordered()
can be used to avoid the overhead of ensuring the order of results matches the order of the inputs.
from multiprocessing import Pool
from collections import Counter
def count_hostnames(file):
counter = Counter()
with gzip.open(file, 'rt') as f:
for line in f:
# ... omitted ...
hostname = line[start:end]
counter[hostname] += 1
return counter
if __name__ == '__main__': # Guard for multiprocessing re-import of __main__
files = os.listdir(directory)
files = [ os.path.join(directory, file) for file in files ]
with Pool() as pool:
counter = sum( pool.imap_unordered(count_hostnames, files), Counter() )
print(counter) # or create your panda
-
\$\begingroup\$ I read few times about the
Counter
method but never think about using it, this is awesome :) \$\endgroup\$Ben.T– Ben.T2018年08月10日 01:17:17 +00:00Commented Aug 10, 2018 at 1:17 -
\$\begingroup\$ Yes counter became my life saver ! program came down to 45 mins, so would that be the max threshold of speed and no way we can make it more faster ? \$\endgroup\$Mamatha– Mamatha2018年08月13日 14:53:02 +00:00Commented Aug 13, 2018 at 14:53
-
\$\begingroup\$ Python is an interpreted, loosely typed language; you could speed up your program by rewriting it in C. If you want to leave it in Python, you may get additional speed from bigger picture optimizations. Are all 300 log files different each time you run, or are some "history" files that you can cache the hostname counts from? Processing each file in a separate process (not Python thread!) may help if you are not I/O bound. (See the multiprocessing package) \$\endgroup\$AJNeufeld– AJNeufeld2018年08月13日 15:16:21 +00:00Commented Aug 13, 2018 at 15:16
-
\$\begingroup\$ Added attempted speedup using
multiprocessing.Pool
. If you have more than one CPU, and are not I/O bound, you may gain some speed. I'd be interested in hearing your result. \$\endgroup\$AJNeufeld– AJNeufeld2018年08月13日 20:29:36 +00:00Commented Aug 13, 2018 at 20:29
Explore related questions
See similar questions with these tags.
s
is not necessary, you can directly doa=line.split('|')[-3]
it should return the right value. \$\endgroup\$