Reading set of large log files line by line and count how many hostnames appear on each line

Question 1

I have around 300 log files in a directory and each log file contains around 3300000 lines. I need to read through each file line by line and count how many hostnames that appear on each line. I wrote basic code for that task, but it takes more than 1 hour to run and takes lots of memory as well. How can I improve this code to make it run faster?

import pandas as pd
import gzip
directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files
listi=os.listdir(directory)#converting the logfiles into a list
for file in listi:#taking eaching log file in the list 
 tt=os.path.join(directory,file)# joining log file name along with the directory path
 with gzip.open(tt,'rt') as f: #unzipping the log file 
 rows=[]#clearing the list after every loop
 for line in f: #reading each line in the file 
 s=len(line.split('|'))
 a=line.split('|')[s-3]
 b=a.split('/')[0] #slicing just the hostname out of each line in the log file 
 if len(b.split('.'))==None:
 ''
 else:
 b=b.split('.')[0]
 rows.append(b) # appending it to a list
 df_temp= pd.DataFrame(columns=['hostname'],data=rows) #append list to the dataframe after every file is read
 df_final=df_final.append(df_temp,ignore_index=True) #appending above dataframe to a new one to avoid overwriting
 del df_temp #deleting temp dataframe to clear memory
df_final=df_final.groupby(["hostname"]).size().reset_index(name="Topic_Count") #doing the count

Sample log lines

tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokdd333**.sc.sc.com/16604/#001b0001|7632663/2344|342344|23244

Desired output

enter image description here

Question 2

Well, you are searching a huge database. Why did you think it would be faster?

Question 3

@FreezePhoenixI am not sure ! will any kind of multiprocessing help it make run faster ?

Question 4

I think the calculation of s is not necessary, you can directly do a=line.split('|')[-3] it should return the right value.

Question 5

@Ben.T Thanks :) made the change, but its still going slow !

Question 6

@Ben.T sorry my bad !had changed names to make it more meaningful here, forgot to change that part. updated !

Question 7

So I think you can improve the efficiency of your code like this.

First, as I said in one comment, you can replace:

s=len(line.split('|'))
a=line.split('|')[s-3]

by

a=line.split('|')[-3]

as no need to know the total length of a list to get the third element from the end.

Second, assigning a then b with a value take some times, you can do it in one line:

a=line.split('|')[-3]
b=a.split('/')[0]

becomes

b=line.split('|')[-3].split('/')[0]

Third, I'm not sure len can equal to None, maybe you wanted to check for 0, but if our code run like this, I would say that:

if len(b.split('.'))==None:
 ''
else:
 b=b.split('.')[0]

is not useful so you can calculate directly the final b with:

b=line.split('|')[-3].split('/')[0].split('.')[0]

Forth, because actually you don't need to assign b anymore, you can append the value into rows directly, such as:

rows=[]
for line in f:
 rows.append(line.split('|')[-3].split('/')[0].split('.')[0])

or as a list comprehension:

rows = [line.split('|')[-3].split('/')[0].split('.')[0] for line in f]

Fifth, again, you create df_temp to use it once and then delete it, you can append directly into df_final such as:

df_temp= pd.DataFrame(columns=['hostname'],data=rows)
df_final=df_final.append(df_temp,ignore_index=True) 
del df_temp

is better this way:

df_final=df_final.append(pd.DataFrame(columns=['hostname'],data=rows),
 ignore_index=True)

Ultimately, rows is not necessary anymore, so all the code from the line with ... until the line del ... can be written:

with gzip.open(tt,'rt') as f:
 df_final=df_final.append(pd.DataFrame(columns=['hostname'],
 data=[line.split('|')[-3].split('/')[0].split('.')[0] for line in f]),
 ignore_index=True)

So far, I think we saved some time, but I know that appening dataframe in a loop is not the best practice, especially because you need to assign again df_final each time. It's better to add all the dataframes that you want to append together in a list, and then use pd.concat outside of the loop. Your code becomes:

list_final = []
for file in listi:
 tt=os.path.join(directory,file)
 with gzip.open(tt,'rt') as f: 
 list_final.append(pd.DataFrame(columns=['hostname'],
 data=[line.split('|')[-3].split('/')[0].split('.')[0] 
 for line in f]))
df_final = (pd.concat(list_final,ignore_index=True)
 .groupby(["hostname"]).size().reset_index(name="Topic_Count"))

Timing

I create one file with around 3 millions of rows, running you method was 8.9 seconds while mine was 5.8 (a gain more than 30%). I run the code on a listi containg 10 of this file, and your method gave more than 91 seconds (a bit more than stricly 10 times the method with one file) while mine was about 57 seconds (a bit less than 10 times the method for just one file).

I don't know about all the multiprocessing or serializing calculations in Python, but it may be a good option too.

Question 8

thank you for such a detailed explanation !doing all the split in one line really helped :)

Question 9

@Mamatha you are welcome. good luck in improving your code :)

Question 10

Splitting a string, and only taking one substring from that split is doing a lot of work, just to throw away most of the results.

Consider the following line:

tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455

It seems safe to assume that tx: ... |rx: ... | is a fixed format. Starting at dt:, we might see some variation. For instance, dt:10 is longer than dt:0. So while the position of the hostname might vary a little bit, it seems easy to get the starting point: just after the first | character after the first 62 characters. Similarly, finding the end point: the first . (if any) before the first /:

start = line.index('|', 62)+1
slash = line.index('/', start)
dot = line.find('.', start, slash)
end = dot if dot > 0 else slash
b = line[start:end]

Running timing tests, I find this isolates the hostname in 40% of the time of:

a = line.split('|')[-4]
b = a.split('/')[0]
if len(b.split('.')) > 0:
 b = b.split('.')[0]

Finally, if all you are doing is getting a total count of each hostname across all of the files, appending the hostname to a rows list, and using pandas to count the occurrences is painful. Simply use a Counter:

import collections
counter = collections.Counter()
for file in ...:
 for line in ...:
 ...
 hostname = line[start:end]
 counter[hostname] += 1

And then create your panda from the counter, with the hostname counts already totaled.

Finally, as shown above, use better variable names, such as hostname instead of b.

Assuming that you are not I/O bound, you may be able to gain some speed using the multiprocessing. Below, the list of files distributed to a number of workers, one per CPU. Each Process takes a file, unzips and read it line-by-line, counting hostnames, and returning the counter. The main process receives the results for each file from the pool of processes, and accumulates the results into a single counter using sum(). Since the order of the results does not matter, .imap_unordered() can be used to avoid the overhead of ensuring the order of results matches the order of the inputs.

from multiprocessing import Pool
from collections import Counter
def count_hostnames(file):
 counter = Counter()
 with gzip.open(file, 'rt') as f:
 for line in f:
 # ... omitted ...
 hostname = line[start:end]
 counter[hostname] += 1
 return counter
if __name__ == '__main__': # Guard for multiprocessing re-import of __main__
 files = os.listdir(directory)
 files = [ os.path.join(directory, file) for file in files ]
 with Pool() as pool:
 counter = sum( pool.imap_unordered(count_hostnames, files), Counter() )
 print(counter) # or create your panda

Question 11

I read few times about the Counter method but never think about using it, this is awesome :)

Question 12

Yes counter became my life saver ! program came down to 45 mins, so would that be the max threshold of speed and no way we can make it more faster ?

Question 13

Python is an interpreted, loosely typed language; you could speed up your program by rewriting it in C. If you want to leave it in Python, you may get additional speed from bigger picture optimizations. Are all 300 log files different each time you run, or are some "history" files that you can cache the hostname counts from? Processing each file in a separate process (not Python thread!) may help if you are not I/O bound. (See the multiprocessing package)

Question 14

Added attempted speedup using multiprocessing.Pool. If you have more than one CPU, and are not I/O bound, you may gain some speed. I'd be interested in hearing your result.

Ben.T Ben.T 1416 bronze badges · Answer 1 · 2018-08-10 00:50:06Z

So I think you can improve the efficiency of your code like this.

First, as I said in one comment, you can replace:

s=len(line.split('|'))
a=line.split('|')[s-3]

by

a=line.split('|')[-3]

as no need to know the total length of a list to get the third element from the end.

Second, assigning a then b with a value take some times, you can do it in one line:

a=line.split('|')[-3]
b=a.split('/')[0]

becomes

b=line.split('|')[-3].split('/')[0]

Third, I'm not sure len can equal to None, maybe you wanted to check for 0, but if our code run like this, I would say that:

if len(b.split('.'))==None:
 ''
else:
 b=b.split('.')[0]

is not useful so you can calculate directly the final b with:

b=line.split('|')[-3].split('/')[0].split('.')[0]

Forth, because actually you don't need to assign b anymore, you can append the value into rows directly, such as:

rows=[]
for line in f:
 rows.append(line.split('|')[-3].split('/')[0].split('.')[0])

or as a list comprehension:

rows = [line.split('|')[-3].split('/')[0].split('.')[0] for line in f]

Fifth, again, you create df_temp to use it once and then delete it, you can append directly into df_final such as:

df_temp= pd.DataFrame(columns=['hostname'],data=rows)
df_final=df_final.append(df_temp,ignore_index=True) 
del df_temp

is better this way:

df_final=df_final.append(pd.DataFrame(columns=['hostname'],data=rows),
 ignore_index=True)

Ultimately, rows is not necessary anymore, so all the code from the line with ... until the line del ... can be written:

with gzip.open(tt,'rt') as f:
 df_final=df_final.append(pd.DataFrame(columns=['hostname'],
 data=[line.split('|')[-3].split('/')[0].split('.')[0] for line in f]),
 ignore_index=True)

So far, I think we saved some time, but I know that appening dataframe in a loop is not the best practice, especially because you need to assign again df_final each time. It's better to add all the dataframes that you want to append together in a list, and then use pd.concat outside of the loop. Your code becomes:

list_final = []
for file in listi:
 tt=os.path.join(directory,file)
 with gzip.open(tt,'rt') as f: 
 list_final.append(pd.DataFrame(columns=['hostname'],
 data=[line.split('|')[-3].split('/')[0].split('.')[0] 
 for line in f]))
df_final = (pd.concat(list_final,ignore_index=True)
 .groupby(["hostname"]).size().reset_index(name="Topic_Count"))

Timing

I create one file with around 3 millions of rows, running you method was 8.9 seconds while mine was 5.8 (a gain more than 30%). I run the code on a listi containg 10 of this file, and your method gave more than 91 seconds (a bit more than stricly 10 times the method with one file) while mine was about 57 seconds (a bit less than 10 times the method for just one file).

I don't know about all the multiprocessing or serializing calculations in Python, but it may be a good option too.

thank you for such a detailed explanation !doing all the split in one line really helped :)
@Mamatha you are welcome. good luck in improving your code :)

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Answer 2 · 2018-08-10 00:51:05Z

Splitting a string, and only taking one substring from that split is doing a lot of work, just to throw away most of the results.

Consider the following line:

tx:2018年05月05日T20:44:37:626 BST|rx:2018年05月05日T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455

It seems safe to assume that tx: ... |rx: ... | is a fixed format. Starting at dt:, we might see some variation. For instance, dt:10 is longer than dt:0. So while the position of the hostname might vary a little bit, it seems easy to get the starting point: just after the first | character after the first 62 characters. Similarly, finding the end point: the first . (if any) before the first /:

start = line.index('|', 62)+1
slash = line.index('/', start)
dot = line.find('.', start, slash)
end = dot if dot > 0 else slash
b = line[start:end]

Running timing tests, I find this isolates the hostname in 40% of the time of:

a = line.split('|')[-4]
b = a.split('/')[0]
if len(b.split('.')) > 0:
 b = b.split('.')[0]

Finally, if all you are doing is getting a total count of each hostname across all of the files, appending the hostname to a rows list, and using pandas to count the occurrences is painful. Simply use a Counter:

import collections
counter = collections.Counter()
for file in ...:
 for line in ...:
 ...
 hostname = line[start:end]
 counter[hostname] += 1

And then create your panda from the counter, with the hostname counts already totaled.

Finally, as shown above, use better variable names, such as hostname instead of b.

Assuming that you are not I/O bound, you may be able to gain some speed using the multiprocessing. Below, the list of files distributed to a number of workers, one per CPU. Each Process takes a file, unzips and read it line-by-line, counting hostnames, and returning the counter. The main process receives the results for each file from the pool of processes, and accumulates the results into a single counter using sum(). Since the order of the results does not matter, .imap_unordered() can be used to avoid the overhead of ensuring the order of results matches the order of the inputs.

from multiprocessing import Pool
from collections import Counter
def count_hostnames(file):
 counter = Counter()
 with gzip.open(file, 'rt') as f:
 for line in f:
 # ... omitted ...
 hostname = line[start:end]
 counter[hostname] += 1
 return counter
if __name__ == '__main__': # Guard for multiprocessing re-import of __main__
 files = os.listdir(directory)
 files = [ os.path.join(directory, file) for file in files ]
 with Pool() as pool:
 counter = sum( pool.imap_unordered(count_hostnames, files), Counter() )
 print(counter) # or create your panda

I read few times about the Counter method but never think about using it, this is awesome :)
Yes counter became my life saver ! program came down to 45 mins, so would that be the max threshold of speed and no way we can make it more faster ?
Python is an interpreted, loosely typed language; you could speed up your program by rewriting it in C. If you want to leave it in Python, you may get additional speed from bigger picture optimizations. Are all 300 log files different each time you run, or are some "history" files that you can cache the hostname counts from? Processing each file in a separate process (not Python thread!) may help if you are not I/O bound. (See the multiprocessing package)
Added attempted speedup using multiprocessing.Pool. If you have more than one CPU, and are not I/O bound, you may gain some speed. I'd be interested in hearing your result.

Stack Exchange Network

Reading set of large log files line by line and count how many hostnames appear on each line

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading set of large log files line by line and count how many hostnames appear on each line

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions