I wrote a small utility for searching our web app log files for certain strings. The idea is that if the search string is found somewhere in the entry, the entire entry should be printed.
The constraints are:
- One log entry can have one to many lines (stacktraces for example)
- Each log entry starts with an uppercase string specifying log type, followed by a pipe character ("ERROR|...")
- Each log entry ends with a pipe character followed by an exclamation mark ("|!")
- The utility should work much like a regular grep command, meaning that it takes input either from a list of files or from STDIN.
The script works as expected, but it is quite slow. Writing the same logic in an AWK script is about twice as fast when processing a large amount of data. Any suggestions on how to improve the performance?
logentrygrep.py
#!/bin/python2.7
from __future__ import print_function
import argparse
import fileinput
import re
import signal
import sys
# Disable printing stacktrace on KeyboardInterrupt
signal.signal(signal.SIGINT, lambda x,y: sys.exit(1))
# Parse arguments
parser = argparse.ArgumentParser(description='Tool for greping log entry contents')
parser.add_argument('pattern', help='Pattern to search for')
parser.add_argument('file', nargs='*', help='Files to search')
parser.add_argument('-i', '--ignore-case', dest="ignorecase", action='store_true', help='Case insensitive greping')
parser.add_argument('-v', '--verbose', dest="verbose", action='store_true', help='Print verbose messages')
args = parser.parse_args()
VERBOSE = args.verbose
if VERBOSE:
print("Pattern to search for: " + args.pattern)
print("File(s): " + (str(args.file) if args.file else "STDIN"))
class LogDataProcessor:
entryHeadPattern = re.compile("^[A-Z]{3,}\|")
entryTailPattern = re.compile("\|!$")
totalLinesProcessed = 0
lineBuffer = []
matchFound = False
def __init__(self, regex_pattern, ignorecase):
if ignorecase:
self.pattern = re.compile(regex_pattern, re.IGNORECASE)
else:
self.pattern = re.compile(regex_pattern)
def process(self, indata):
for line in fileinput.input(indata):
self.processLine(line.rstrip())
def processLine(self, line):
self.totalLinesProcessed += 1
if self.isEntryHead(line):
# Flush in case previous entry did not terminate correctly
self.flush()
self.addToBuffer(line)
# If it's a one-line entry (head and tail on the same line)
if self.isEntryTail(line):
self.flush()
elif not self.isBufferEmpty():
self.addToBuffer(line)
if self.isEntryTail(line):
self.flush()
def flush(self):
if self.matchFound:
self.printBuffer()
self.lineBuffer = []
self.matchFound = False
def printBuffer(self):
for line in self.lineBuffer:
print(line)
def addToBuffer(self, line):
if not self.matchFound and self.lineMatches(line):
self.matchFound = True
self.lineBuffer.append(line)
def isBufferEmpty(self):
return not self.lineBuffer
def isEntryHead(self, line):
return self.entryHeadPattern.match(line)
def isEntryTail(self, line):
return self.entryTailPattern.search(line)
def lineMatches(self, line):
return self.pattern.search(line)
processor = LogDataProcessor(args.pattern, args.ignorecase)
processor.process(args.file)
if VERBOSE:
print("Lines searched: " + str(processor.totalLinesProcessed))
1 Answer 1
On my windows 10 laptop, I ran your program on a dummy log file under the python profiler using the command line:
python -m cProfile -s cumulative loggrep.py "speech" \data\test.log
The dummy log file has about 4.3 Mbytes of text spread across about 100k lines comprising 32772 log entries. The search pattern was "speech", which occured 444 times.
This is the ouput of the profiler:
1322924 function calls (1322872 primitive calls) in 2.006 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
7/1 0.000 0.000 2.006 2.006 {built-in method builtins.exec}
1 0.000 0.000 2.006 2.006 loggrep.py:3(<module>)
1 0.077 0.077 1.997 1.997 loggrep.py:38(process)
100003 0.136 0.000 1.788 0.000 loggrep.py:42(processLine)
32772 0.013 0.000 1.369 0.000 loggrep.py:60(flush) <-----
444 0.014 0.000 1.355 0.003 loggrep.py:67(printBuffer) <-----
9722 1.341 0.000 1.341 0.000 {built-in method builtins.print} <-----
100003 0.061 0.000 0.132 0.000 loggrep.py:71(addToBuffer)
100004 0.042 0.000 0.116 0.000 fileinput.py:248(__next__)
100003 0.028 0.000 0.075 0.000 loggrep.py:80(isEntryHead)
100004 0.060 0.000 0.071 0.000 {method 'readline' of '_io.TextIOWrapper' objects}
195293 0.068 0.000 0.068 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
100003 0.027 0.000 0.062 0.000 loggrep.py:83(isEntryTail)
95290 0.027 0.000 0.060 0.000 loggrep.py:86(lineMatches)
100010 0.047 0.000 0.047 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
The header shows the program took 2.006 seconds to run. Looking at the
lines marked with "<-----", we see that flush()
was called 32772 times
and took about 1.369 seconds. From the code, we know flush()
calls
printbuffer()
, which calls print()
. The profile says print()
was
called 9722 times taking 1.341 seconds, or about 65% of the total run time.
I changed entryTailPattern
to "\|!\s*$"
. This let me get rid of the
.rstrip()
on each line so that the lines would keep the '\n' at the end.
This let me change the print()
for loop in printBuffer()
to a call to
sys.stdout.writelines()
. This saved about 0.5 seconds over the print()
for loop.
All the other function calls seem to be in the noise, so i don't see any other promising things to try.
Explore related questions
See similar questions with these tags.
awk
for what it is good for?... If I remember correctly it's possible to parse command line arguments with anawk
script too. Hint, Python could still be used, could evenpipe
things around to-&-fro betweengawk
and Python, and usingunbuffer
ing tricks it's still possible to sortayield
a sub-process into Python. Isn't easy but it reads like ya want more performance than convenience. \$\endgroup\$