Can I easily parallelize this script?

Question 1

I have a script that compares in some way each line from file1 and file2, and outputs the lines if there is a difference. I want to make it faster - right now it's in Python. I could use threads, but I would like to know is there some easier way to improve it?

Since each test is independent, it could run in parallel - I just need to make sure that each line from file1 is compared with each line from file2.

EDIT: The bottleneck so far is the processor(comparison process); the disc usage isn't that big, but the core with program is 100%. Note that files are "large"(e.g. over 20MB), so I understand that it takes some time to process them.

Question 2

related: Evaluating concurrent application design approaches on Linux

Question 3

What is your bottleneck ? Is the files size big, or the comparison complex ? If it's a matter of size the disk will always be slower than the CPU no matter what you try. If the calculations on lines are complex then you might get good results by loading the file and forking through multiprocessing module instead of using threads.

Question 4

@ArthurHavlicek I've updated the question to mention the bottleneck.

Question 5

If you want to get real CPU parallelization as Mason stated you need to get around GIL by forking instead of using threads. This has extra overhead compared to threads but it may work if the process time is the bottleneck.

The best way to achieve this in a non-hacky way is to use multiprocess.Pool and use a variant of map. This will dispatch your iterable to a pool of workers who will consume the input and agregate the result in your parent process.

from multiprocessing import Pool
def f(x):
 return x*x
if __name__ == '__main__':
 with Pool(5) as p:
 print(p.map(f, [1, 2, 3])) #[1, 4, 9]

Question 6

Probably not.

Python has what's known as a Global Interpreter Lock that ensures that the interpreter is never running on more than one thread of a process at a time. This means that, unless your processing is making very heavy use of native code processing such as NumPy that spends most of its time outside of the interpreter, it is impossible to speed it up by parallelizing it.

You might be able to get some speed gains by parallelizing via multiprocessing, but that can impose some heavy overhead for setup and communication, so it's hard to say for sure without testing it.

Question 7

Turns out I can, with use of parallel. ;)

Question 8

I've managed to paralellize it using GNU Parallel.

Firstly, I had to make a slight changes to the script - I had to make sure that only the "file" part will use seek to restart file pointer(you can't seek on pipe). I also had to use this hack to get utf8 stdout/stdin:

reload(sys)
sys.stedefaultencoding('utf8')

After that, I was ready to call the script:

parallel --pipepart -a file_to_stdin ./myscript.py --secondfile second_fle > result

(pipepart allows -a to be treated as input that goes to stdin, instead of passing lines from the file as the arguments)

This way, the changes to the script were minimal, and concurrency was achieved.

Question 9

Hm, what are you using seek for? It kinda seems to me you are trying to do too low-level stuff. Reading two files line-by-line and comparing them normally isn't all that CPU-intensive.

Question 10

Oh, now I see you're not doing a simple string comparison between the lines from each file. If you're doing complex calculations, that can explain the CPU usage of course.

Question 11

@RoelSchroeven I compare two files. Each line with each line - in other words, I am mapping a comparator function over a cartesian product of lines in both files. Files are big, so for each line in one file, I go through the whole second file. Then, I need to go back to the beginning, so I use seek. I don't know other way to do this in Python, and this one works(but is ugly).

Question 12

Yes, now I see. I thought you had to compare each line from the first file with the corresponding line from the second file (like a simple diff), with a total of n comparisons (assuming both files have the same number of lines). But I completely misunderstood; you need n*m comparisons.

Diane M Diane MDiane M 2,11611 silver badges17 bronze badges · Answer 1 · 2016-07-07 14:02:25Z

If you want to get real CPU parallelization as Mason stated you need to get around GIL by forking instead of using threads. This has extra overhead compared to threads but it may work if the process time is the bottleneck.

The best way to achieve this in a non-hacky way is to use multiprocess.Pool and use a variant of map. This will dispatch your iterable to a pool of workers who will consume the input and agregate the result in your parent process.

from multiprocessing import Pool
def f(x):
 return x*x
if __name__ == '__main__':
 with Pool(5) as p:
 print(p.map(f, [1, 2, 3])) #[1, 4, 9]

score 0 · Answer 2 · 2016-07-07 09:54:42Z

Probably not.

Python has what's known as a Global Interpreter Lock that ensures that the interpreter is never running on more than one thread of a process at a time. This means that, unless your processing is making very heavy use of native code processing such as NumPy that spends most of its time outside of the interpreter, it is impossible to speed it up by parallelizing it.

You might be able to get some speed gains by parallelizing via multiprocessing, but that can impose some heavy overhead for setup and communication, so it's hard to say for sure without testing it.

1

Turns out I can, with use of parallel. ;)

MatthewRock
– MatthewRock

2016年07月11日 07:45:30 +00:00
Commented Jul 11, 2016 at 7:45

score 0 · Answer 3 · 2016-07-11 07:44:53Z

0

I've managed to paralellize it using GNU Parallel.

Firstly, I had to make a slight changes to the script - I had to make sure that only the "file" part will use seek to restart file pointer(you can't seek on pipe). I also had to use this hack to get utf8 stdout/stdin:

reload(sys)
sys.stedefaultencoding('utf8')

After that, I was ready to call the script:

parallel --pipepart -a file_to_stdin ./myscript.py --secondfile second_fle > result

(pipepart allows -a to be treated as input that goes to stdin, instead of passing lines from the file as the arguments)

This way, the changes to the script were minimal, and concurrency was achieved.

Share

Improve this answer

answered Jul 11, 2016 at 7:44

MatthewRock's user avatar

MatthewRock MatthewRockMatthewRock

8191 gold badge6 silver badges15 bronze badges

4

Hm, what are you using seek for? It kinda seems to me you are trying to do too low-level stuff. Reading two files line-by-line and comparing them normally isn't all that CPU-intensive.

Roel Schroeven
– Roel Schroeven

2016年07月11日 08:09:58 +00:00
Commented Jul 11, 2016 at 8:09
Oh, now I see you're not doing a simple string comparison between the lines from each file. If you're doing complex calculations, that can explain the CPU usage of course.

Roel Schroeven
– Roel Schroeven

2016年07月11日 08:31:24 +00:00
Commented Jul 11, 2016 at 8:31
@RoelSchroeven I compare two files. Each line with each line - in other words, I am mapping a comparator function over a cartesian product of lines in both files. Files are big, so for each line in one file, I go through the whole second file. Then, I need to go back to the beginning, so I use seek. I don't know other way to do this in Python, and this one works(but is ugly).

MatthewRock
– MatthewRock

2016年07月11日 09:22:04 +00:00
Commented Jul 11, 2016 at 9:22
Yes, now I see. I thought you had to compare each line from the first file with the corresponding line from the second file (like a simple diff), with a total of n comparisons (assuming both files have the same number of lines). But I completely misunderstood; you need n*m comparisons.

Roel Schroeven
– Roel Schroeven

2016年07月11日 09:51:00 +00:00
Commented Jul 11, 2016 at 9:51

Add a comment |

Stack Exchange Network

Can I easily parallelize this script?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Can I easily parallelize this script?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions