I'm trying to speed up a Python script that reads a large log file (JSON lines, 50gb+) and filter out results that match 1 of 2000 CIDR ranges.
Logfile
20 million lines
{"ip":"xxx.xxx.xxx.xxx","timestamp":"2017-05-27T04:00:35-04:00","data":{},"error":"EOF","error_component":"banner"}
{"ip":"xxx.xxx.xxx.xxx","timestamp":"2017-05-27T04:00:35-04:00","data":{"banner":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","ehlo":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","starttls":"500 Unknown command\r\n"},"error":"Bad return code for STARTTLS","error_component":"starttls"}
{"ip":"xxx.xxx.xxx.xxx","timestamp":"2017-05-27T04:00:35-04:00","data":{},"error":"EOF","error_component":"banner"}
{"ip":"xxx.xxx.xxx.xxx","timestamp":"2017-05-27T04:00:35-04:00","data":{"banner":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","ehlo":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","starttls":"502 No such command\r\n"},"error":"Bad return code for STARTTLS","error_component":"starttls"}
CIDR file
2,000 lines
86.80.0.0/12
77.160.0.0/12
188.200.0.0/13
178.224.0.0/13
84.24.0.0/13
Script
import sys
import json
from netaddr import *
reload(sys)
sys.setdefaultencoding('utf-8')
filename = 'results.json'
filename = unicode(filename, 'utf-8')
cidr_filename = 'cidr.txt'
rowcount = 0
count = 0
# Load CIDR ranges
with open(cidr_filename, 'r') as f:
cidr = [line.strip() for line in f]
# Load JSON line by line
with open(filename) as f:
for line in f:
output = json.loads(line)
rowcount += 1
# Match if IP is in CIDR ranges
if all_matching_cidrs(output['ip'], cidr):
if 'banner' in output['data']:
print(output['ip'] + '\t' + output['data']['banner'])
count += 1
print('---------------------------------------')
print('LINES: {rowcount}')
print('RESULTS: {count}')
print('---------------------------------------')
Current results
Parsing an example set of 100,000 rows takes now 8 minutes using:
- Pypy
- MacBook Pro with 2.8 GHz Intel Core i7, 16Gb RAM, SSD
Parsing the complete set of 20,000,000 rows would take a staggering 26 hours.
---------------------------------------
LINES: 100000
RESULTS: 1243
---------------------------------------
real 7m57.739s
user 7m52.127s
sys 0m4.177s
The bottleneck is the number of CIDR ranges to search within, when I run an example set of 100,000 row against 1 CIDR range it takes only 1.2 seconds.
---------------------------------------
LINES: 100000
RESULTS: 4
---------------------------------------
real 0m1.201s
user 0m1.095s
sys 0m0.090s
Is there a faster way of accomplishing this? Would Multithreading/Multiprocessing speed things up? Any help or other feedback would be much appreciated!
Things I've done:
- Using Pypy, this is 9x(!) faster than Python 2.7 for this job.
- Tried using Tim Bray's Widefinder but couldn't make it work as it focuses on regex searches IMHO.
UPDATE
rolfl's solution brought my times to parse 20,344,457 rows from ±26 hours to 4.5 minutes!
---------------------------------------
LINES: 20344457
RESULTS: 130863
---------------------------------------
real 4m27.661s
user 3m55.171s
sys 0m26.793s
TemporalWolf's advice to cProfile my code showed that indeed json.loads() was a bottleneck:
ncalls tottime percall cumtime percall filename:lineno(function)
16607394 131.960 0.000 131.960 0.000 {_pypyjson.loads}
Following his advice to slice the IP address natively instead of loading each line as JSON it was 2.5x times faster!
---------------------------------------
LINES: 20344457
RESULTS: 130863
---------------------------------------
real 1m40.548s
user 1m13.885s
sys 0m22.664s
4 Answers 4
Is there a faster way of accomplishing this? Would Multithreading/Multiprocessing speed things up? Any help or other feedback would be much appreciated!
No, for the most part, multi-threading will make no difference for you. At some point the bottleneck should be the IO speed of reading 50GB of file content, and not the speed of the processing. You also need to read the file sequentially (I presume) to get the output in the same order as the input.
But, fundamentally, the solution should not need to have multi-threaded execution in order to improve the performance.
Learning how to measure performance of various parts of your code is an important skill. For the moment, it may be as simple as timing this code:
# Load JSON line by line with open(filename) as f: for line in f: output = json.loads(line) rowcount+=1
i.e. convert each line from JSON, and count the lines.... how fast is that? I would expect that the whole program should be just as fast when the IP CIDR lookups work fast too.
Your performance issue is almost certainly related to this line here:
if all_matching_cidrs(output['ip'], cidr):
Your timings already support this.... it takes 1 second to search all records in 1 CIRD, but significantly longer for 2000 CIDRs...
So, you have a performance problem in the order of \$O(mn)\$ where \$m\$ is the number of rows in the file, and \$n\$ is the number of CIDRs.
You can't improve the performance related to the number of rows in the files, but you can improve the cost of the CIDR lookups. What if it was a fixed-cost to check all CIDR matches? Then your overall performance becomes \$O(m)\$ and does not depend on the number of CIDR records.
You can do this by preprocessing the CIDR data in to a structure that allows a fixed-cost lookup.
The structure I would use is a binary tree consisting of nodes representing each bit in the CIDR specs. Each leaf node represents a CIDR to include. I.e. you preprocess the CIDRs in to the tree that at most has 32 levels (for a /32
CIDR).
Then, for the lookup, you take your IP from the JSON, convert it in to an integer, and start shifting bits from the most significant. For each bit, you start descending the CIDR tree, and if you can descend the tree until you hit a leaf node, then you have found a matching CIDR. At most, this will be 32 iterations down the tree, but for the most part, CIDR's seldom are that specific. So, let's assume at most a /24
CIDR, meaning that you reduce your lookups to at most 24 descents, instead of as many as 2000 complete checks.
It comes down to the algorithm.
Update - example lookup
Note, I hacked together this tree for supporting faster lookups of IPs in a number of CIDR ranges. Python is not my primary language, so inspect it carefully, and adjust as needed. Specifically, I have used some naive mecheanisms for parsing IP addresses in to integers. Use dedicated libraries to do that instead.
You can see it running on ideone: https://ideone.com/cd0O2I
def parseIPPart(ipx, shift):
try:
return int(ipx) << shift
except TypeError:
return 0
def parseIP(ipString):
ips_shifts = zip(ipString.split("."), range(24, -1, -8))
addr = [parseIPPart(ip, shift) for ip, shift in ips_shifts]
return sum(addr)
def parseCIDR(cidr):
addrString, bitsString = cidr.split('/')
try:
bits = int(bitsString)
except TypeError:
bits = 32
addr = parseIP(addrString)
return addr, bits
class CIDRTree:
class CIDRNode:
def __init__(self, depth):
self.depth = depth
self.isset = None
self.unset = None
self.leaf = False
def __init__(self):
self.root = CIDRTree.CIDRNode(-1)
def addCIDR(self, cidr):
ip, bits = parseCIDR(cidr)
node = self.root
for b in range(bits):
if node.leaf:
# Previous bigger CIDR Covers this subnet
return
mask = 1 << (31 - b)
if (ip & mask) != 0:
if node.isset is None:
node.isset = CIDRTree.CIDRNode(b)
kid = node.isset
else:
if node.unset is None:
node.unset = CIDRTree.CIDRNode(b)
kid = node.unset
node = kid
# node is now a representation of the leaf that comes from this CIDR.
# Clear out any more specific CIDRs that are no longer relevant (this CIDR includes a previous CIDR)
node.isset = None
node.unset = None
node.leaf = True
#print("Added CIDR ", ip, " and bits ", bits)
def matches(self, ipString):
ip = parseIP(ipString)
node = self.root
shift = 0
while node is not None and not node.leaf:
shift += 1
mask = 1 << (32 - shift)
val = (ip & mask) != 0
node = node.isset if val else node.unset
return node is not None and node.leaf
if __name__ == "__main__":
cidrTree = CIDRTree()
cidrTree.addCIDR("8.0.0.0/8")
cidrTree.addCIDR("9.8.7.0/24")
print ("Tree matches 8.8.8.8:", cidrTree.matches("8.8.8.8"))
print ("Tree matches 9.9.9.9:", cidrTree.matches("9.9.9.9"))
print ("Tree matches 9.8.7.6:", cidrTree.matches("9.8.7.6"))
-
6\$\begingroup\$ WOW, many thanks. It is insanely fast, 100,000 rows now take only 1.3 seconds instead of 7 minutes:
real 0m1.350s user 0m1.174s sys 0m0.146s
\$\endgroup\$JMK09– JMK092017年06月14日 14:56:40 +00:00Commented Jun 14, 2017 at 14:56 -
3\$\begingroup\$ I've run it on the complete set of 20,344,457 rows:
real 4m27.661s user 3m55.171s sys 0m26.793s
That's even faster than my initial attempt of 100,000 records :) \$\endgroup\$JMK09– JMK092017年06月14日 15:01:07 +00:00Commented Jun 14, 2017 at 15:01 -
3\$\begingroup\$ Just FYI, I crunched some numbers, and it's procesing about 75,000 records per second now, or 190MB/s, which I feel is in the right sort of ball-park for your system. SSD should be a bit faster, but there are some overheads, and I expect, that if you had a concurrent IO/processing system now (computing matches in one thread while blocking on IO in another), that you could reduce the time a bit.... but it's probably not worth it. There's still room for improvement, but it will be hard work.... \$\endgroup\$rolfl– rolfl2017年06月14日 19:35:12 +00:00Commented Jun 14, 2017 at 19:35
-
1\$\begingroup\$ Nitpick for the explanation: A binary search tree reduces complexity to O(log n), so the end complexity would be O(m log n). \$\endgroup\$TemporalWolf– TemporalWolf2017年06月14日 20:42:35 +00:00Commented Jun 14, 2017 at 20:42
-
1\$\begingroup\$ @TemporalWolf - that's not strictly accurate. The reality is that the
n
in thelog n
is a constant, 32 (the number of bits in an IP address), not the number of CIDRs, so, being a constant the complexity is just \$O(m)\$ \$\endgroup\$rolfl– rolfl2017年06月14日 21:51:44 +00:00Commented Jun 14, 2017 at 21:51
I do not see the reason behind using sys.setdefaultencoding()
regarding the log and CIDR files you are dealing with.
You may be interested in reading:
-
\$\begingroup\$ I run into some encoding issues when parsing (I believe it was) Japanese characters. I removed these 2 lines and run the script again, it didn't had any performance improvements.
real 7m17.190s user 7m14.737s sys 0m1.853s
\$\endgroup\$JMK09– JMK092017年06月14日 12:56:51 +00:00Commented Jun 14, 2017 at 12:56
Use a lookup table to avoid quadratic behaviour
Assuming that you have N
IPs to match against M
CIDR ranges, checking the way you do costs O(N*M)
since you're checking each CIDR once per IP.
Assuming that the bottleneck of this program is not IO but checking CIDR ranges, we can improve this via pre-computation
Changing anything without proper performance profiling is pointless, suggest use of
perf
to confirm the big assumptions first.
There's not that many IPv4s
There are 2^32 IPv4 adresses = 4294967296 IPs, just above 4 billion. A lot for a human, but not for a CPU.
For each IP, it either is or isn't part of a CIDR you were loaded with. So storing that info for each of 4 billion means a boolean = 1 bit x 2^32.
So conceptually, we could store this onto 2^32 bits < 500Kb of memory, and for each given IP, check the is-IP-in-a-CIDR lookup table for that IP, providing instant response, without having to iterate through the CIDR ranges for each IP.
Trading CPU time for memory
We're not doing magic, here, just trading off CPU time against memory: we have plenty of memory nowadays, but (CPU) time is expensive, the thing we're optimizing for.
At the same time, 500Kb is cheap enough to allocate, and is still small enough to likely fit L2 caches of modern CPU, which keeps it performant.
Suggested implementation
So, let's dream up the solution:
from ipaddress import IPv4Address
def ip_offset(ip: str) -> int:
"""Convert an IP string into a consistent integer offset"""
# Magic trick: ANY function here could work, as long as 1 IP = 1 int value
# Here done by using the "packed" version:
# https://docs.python.org/3/library/ipaddress.html#ipaddress.IPv4Address.packed
# > The binary representation of this address - a bytes object of the
# > appropriate length (most significant octet first).
return int(IPv4Address(cidr_ip).packed)
ipv4_in_cidr_lookup: list[bool]= [False] * 2^32 # see optimizations section below
# Offline Pre-compute what CIDR ranges hit what IPv4 before we run
for cidr in CIDR_RANGES:
for cidr_ip in cidr: # Enumerate all IPs in a CIDR range
cidr_ip_integer = ip_offset(cidr_ip)
ipv4_in_cidr_lookup[cidr_ip_integer] = True
# Save/reload the array to file if you want to remember that.
some_ip = "188.200.2.2"
some_ip_integer = ip_offset(some_ip)
is_ip_in_listed_ranges = ipv4_in_cidr_lookup[some_ip_integer]
Exact code not run, this is just for illustration purposes
Python byte-array for memory reduction
I assumed an array of bits, so 8 bits per byte, but the naive implementation as list[bool]
costs a lot more memory than that, due to Python's focus on convenience first, not performance.
To make things faster, I'd suggest using one of the many byte array implementations, search Pypi.org for a package that does it, no big deal.
Biggest downside of not using such an optimized array-of-bits, is that once memory use grows beyond ~ size of L2 or L3 caches, the lookup table can't be kept in cache, causing performance to degrade massively as the code will need to fetch the lookup table often, only for it to be displaced by other random memory need.
Keeping memory footprint small, keeps the memory in tight, close to CPU locations, keeping caches warm.
Why not fancier? binary-trees etc
I am making an assumption on how to build this that having a simple lookup table with IP-byte-packing-based offset computation yields good predictability by the CPU, causing a lot of successful CPU pipelining of instructions.
It's hard to me to explain, but it's public knowledge that the most perfect algorithm that minimizes instructions can still run really slow when implemented on CPUs that do heavy work on CPU instructions pipelining (like our modern CPUs do), when the algorithm does hard-to-predict work, causing instructions pipeline flushing a lot when branch prediction fails.
In this case, a simple lookup table with bytes-of-IP as key, in byte order, makes for easy to predict memory access pattern, hoping for an efficient result, rather than potentially more optimized binary tree systems that may thrash around memory more, costlier in practice.
If we REALLY want to know what CIDR range is hit
Right now I assumed we just want to know who's in the naughty CIDR list, not which CIDR range is hit this time.
I am taking the bold decision that it's more important to separate concerns and do a two-pass approach, for performance reasons:
First, iterate via the above solution to make up a list of IPs in known-bad CIDR ranges. Finish iterating through the 50GB of JSON newline IPs, generating the list of IPs matching CIDRs.
Only then, once the lookup is completed for all IPs, do I suggest trying to find which range did it actually come from.
If it's performance critical, one can make up a similar map as before for all IPs, but have an integer (depends on how many CIDR ranges exist, short <128? etc) as data type, with the value 0 meaning no CIDR hits, and otherwise the value at that block meaning what CIDR range offset was hit. This assume no collisions of CIDR ranges.
Conclusion
This post is taking an algorithmic approach to performance management, trying to trade off (plentiful) memory against (limited, to be optimized) CPU time, by making assumptions that I/O is not the bottleneck, and that CIDR ranges don't change at runtime.
Note that this lookup table solution is valid for multi-processing, as long as the different processes/threads use Copy-On-Write memory to share the (read-only) IP-to-CIDR lookup table!
First and foremost, Python is not my language of choice (or in other words, I don't know it at all), so I'm just trying to add something regardless of the language.
Yes, you would benefit a lot from multithreading here, it's quite a basic case. I give for granted that Python does nothing to complicate using of multiple thread, obviously, because doing multithreading itself is cheese task even when using the basic OS APIs. As a general rule of thumb, unless you are dealing with a really fast computation where the additional overhead added by threading will worsen the performances, there is no point in not using parallelism everywhere. It's one of the easiest things to do in computer programming, and it allows you to "choke" the hardware to the point that no matter what, you are granted to be going as fast as the PC allows.
So, one of the thing here is that the CPU is waiting for your hard disk to provide data, and then the hard disk is waiting for the CPU to ask for something.
The first and more obvious point is that accessing disk to get a few bytes each time is terribly inefficient: disks (HDD or SDD is the same, in this case) are best for sustained reading. The OS cache will help you so that it will read a bunch of data ahead for every request you make, but you know you are going through the entire file so you shouldn't rely on the cache and try to be as efficient as possible in your code.
In fact, as rolfl points out
At some point the bottleneck should be the IO speed of reading 50GB of file content
And one of the comments to his answer shows that it's exactly what is happening:
I crunched some numbers, and it's processing about 75,000 records per second now, or 190MB/s
190 MB/s is ridiculously low for an hard disk, unless is a 5-8 years old cheap model; right now even an SD card can be faster than that, sometimes. From a decent SSD I'd expect at least twice that speed, and today even SSD in the range of 100$ can easily saturate the SATA interface.
That is, you want to read big chunks of data every time; no matter what, don't read a line at time. There is no magic number, but unless the computer has serious memory issues 100 megabyte each time should be more than good.
Now, problem is, while waiting for data the CPU is sitting idle doing nothing. Then, while the CPU is crunching your data, the disk is sitting idle waiting for something to do: needless to say this is time for some free multithreading.
Deciding what to implement is a bit complicate without knowing the exact details and limits of the project, because you can simply have a number of threads equal to the number of cores, each one working onto an equal part of the file (easily doable in an hour total), or going to write a central dispatcher that read chunks of the file, create threads up to a certain limit (maybe doing some throttling), and collecting the results (and this can take up to a day of work). It all depends on money and time you have available to do this but, yeah, go for it.
-
3\$\begingroup\$ The default Python implementation doesn't play well with parallelism, as it uses a global lock: wiki.python.org/moin/GlobalInterpreterLock \$\endgroup\$D. Jurcau– D. Jurcau2017年06月17日 09:36:48 +00:00Commented Jun 17, 2017 at 9:36
Explore related questions
See similar questions with these tags.
all_matching_cidrs(output['ip'], cidr):
do? \$\endgroup\$50GB
...JSON
... dear god... \$\endgroup\$json.loads()
takes a significant amount of time and most lines don't match your cidr list, naively slicing for IP addresses (and processsing for shorter than max length) takes about 1/20th the time ofjson.loads()
, which you only need to do for lines you are processing... a 95% reduction in cost for lines which are skipped. Is that significant? You won't know until you profile. \$\endgroup\$