I have a script that compares in some way each line from file1
and file2
, and outputs the lines if there is a difference. I want to make it faster - right now it's in Python. I could use threads, but I would like to know is there some easier way to improve it?
Since each test is independent, it could run in parallel - I just need to make sure that each line from file1
is compared with each line from file2
.
EDIT: The bottleneck so far is the processor(comparison process); the disc usage isn't that big, but the core with program is 100%. Note that files are "large"(e.g. over 20MB), so I understand that it takes some time to process them.
-
related: Evaluating concurrent application design approaches on Linuxgnat– gnat2016年07月07日 09:39:47 +00:00Commented Jul 7, 2016 at 9:39
-
What is your bottleneck ? Is the files size big, or the comparison complex ? If it's a matter of size the disk will always be slower than the CPU no matter what you try. If the calculations on lines are complex then you might get good results by loading the file and forking through multiprocessing module instead of using threads.Diane M– Diane M2016年07月07日 12:11:37 +00:00Commented Jul 7, 2016 at 12:11
-
@ArthurHavlicek I've updated the question to mention the bottleneck.MatthewRock– MatthewRock2016年07月07日 12:52:39 +00:00Commented Jul 7, 2016 at 12:52
3 Answers 3
If you want to get real CPU parallelization as Mason stated you need to get around GIL by forking instead of using threads. This has extra overhead compared to threads but it may work if the process time is the bottleneck.
The best way to achieve this in a non-hacky way is to use multiprocess.Pool
and use a variant of map
. This will dispatch your iterable to a pool of workers who will consume the input and agregate the result in your parent process.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3])) #[1, 4, 9]
Probably not.
Python has what's known as a Global Interpreter Lock that ensures that the interpreter is never running on more than one thread of a process at a time. This means that, unless your processing is making very heavy use of native code processing such as NumPy that spends most of its time outside of the interpreter, it is impossible to speed it up by parallelizing it.
You might be able to get some speed gains by parallelizing via multiprocessing
, but that can impose some heavy overhead for setup and communication, so it's hard to say for sure without testing it.
-
1Turns out I can, with use of parallel. ;)MatthewRock– MatthewRock2016年07月11日 07:45:30 +00:00Commented Jul 11, 2016 at 7:45
I've managed to paralellize it using GNU Parallel
.
Firstly, I had to make a slight changes to the script - I had to make sure that only the "file" part will use seek to restart file pointer(you can't seek on pipe). I also had to use this hack to get utf8 stdout/stdin:
reload(sys)
sys.stedefaultencoding('utf8')
After that, I was ready to call the script:
parallel --pipepart -a file_to_stdin ./myscript.py --secondfile second_fle > result
(pipepart allows -a to be treated as input that goes to stdin, instead of passing lines from the file as the arguments)
This way, the changes to the script were minimal, and concurrency was achieved.
-
Hm, what are you using seek for? It kinda seems to me you are trying to do too low-level stuff. Reading two files line-by-line and comparing them normally isn't all that CPU-intensive.Roel Schroeven– Roel Schroeven2016年07月11日 08:09:58 +00:00Commented Jul 11, 2016 at 8:09
-
Oh, now I see you're not doing a simple string comparison between the lines from each file. If you're doing complex calculations, that can explain the CPU usage of course.Roel Schroeven– Roel Schroeven2016年07月11日 08:31:24 +00:00Commented Jul 11, 2016 at 8:31
-
@RoelSchroeven I compare two files. Each line with each line - in other words, I am mapping a comparator function over a cartesian product of lines in both files. Files are big, so for each line in one file, I go through the whole second file. Then, I need to go back to the beginning, so I use seek. I don't know other way to do this in Python, and this one works(but is ugly).MatthewRock– MatthewRock2016年07月11日 09:22:04 +00:00Commented Jul 11, 2016 at 9:22
-
Yes, now I see. I thought you had to compare each line from the first file with the corresponding line from the second file (like a simple diff), with a total of n comparisons (assuming both files have the same number of lines). But I completely misunderstood; you need n*m comparisons.Roel Schroeven– Roel Schroeven2016年07月11日 09:51:00 +00:00Commented Jul 11, 2016 at 9:51