0

I have a script that compares in some way each line from file1 and file2, and outputs the lines if there is a difference. I want to make it faster - right now it's in Python. I could use threads, but I would like to know is there some easier way to improve it?

Since each test is independent, it could run in parallel - I just need to make sure that each line from file1 is compared with each line from file2.

EDIT: The bottleneck so far is the processor(comparison process); the disc usage isn't that big, but the core with program is 100%. Note that files are "large"(e.g. over 20MB), so I understand that it takes some time to process them.

Bilesh Ganguly
3631 gold badge4 silver badges14 bronze badges
asked Jul 7, 2016 at 9:37
3
  • related: Evaluating concurrent application design approaches on Linux Commented Jul 7, 2016 at 9:39
  • What is your bottleneck ? Is the files size big, or the comparison complex ? If it's a matter of size the disk will always be slower than the CPU no matter what you try. If the calculations on lines are complex then you might get good results by loading the file and forking through multiprocessing module instead of using threads. Commented Jul 7, 2016 at 12:11
  • @ArthurHavlicek I've updated the question to mention the bottleneck. Commented Jul 7, 2016 at 12:52

3 Answers 3

1

If you want to get real CPU parallelization as Mason stated you need to get around GIL by forking instead of using threads. This has extra overhead compared to threads but it may work if the process time is the bottleneck.

The best way to achieve this in a non-hacky way is to use multiprocess.Pool and use a variant of map. This will dispatch your iterable to a pool of workers who will consume the input and agregate the result in your parent process.

from multiprocessing import Pool
def f(x):
 return x*x
if __name__ == '__main__':
 with Pool(5) as p:
 print(p.map(f, [1, 2, 3])) #[1, 4, 9]
answered Jul 7, 2016 at 14:02
0

Probably not.

Python has what's known as a Global Interpreter Lock that ensures that the interpreter is never running on more than one thread of a process at a time. This means that, unless your processing is making very heavy use of native code processing such as NumPy that spends most of its time outside of the interpreter, it is impossible to speed it up by parallelizing it.

You might be able to get some speed gains by parallelizing via multiprocessing, but that can impose some heavy overhead for setup and communication, so it's hard to say for sure without testing it.

answered Jul 7, 2016 at 9:54
1
  • 1
    Turns out I can, with use of parallel. ;) Commented Jul 11, 2016 at 7:45
0

I've managed to paralellize it using GNU Parallel.

Firstly, I had to make a slight changes to the script - I had to make sure that only the "file" part will use seek to restart file pointer(you can't seek on pipe). I also had to use this hack to get utf8 stdout/stdin:

reload(sys)
sys.stedefaultencoding('utf8')

After that, I was ready to call the script:

parallel --pipepart -a file_to_stdin ./myscript.py --secondfile second_fle > result

(pipepart allows -a to be treated as input that goes to stdin, instead of passing lines from the file as the arguments)

This way, the changes to the script were minimal, and concurrency was achieved.

answered Jul 11, 2016 at 7:44
4
  • Hm, what are you using seek for? It kinda seems to me you are trying to do too low-level stuff. Reading two files line-by-line and comparing them normally isn't all that CPU-intensive. Commented Jul 11, 2016 at 8:09
  • Oh, now I see you're not doing a simple string comparison between the lines from each file. If you're doing complex calculations, that can explain the CPU usage of course. Commented Jul 11, 2016 at 8:31
  • @RoelSchroeven I compare two files. Each line with each line - in other words, I am mapping a comparator function over a cartesian product of lines in both files. Files are big, so for each line in one file, I go through the whole second file. Then, I need to go back to the beginning, so I use seek. I don't know other way to do this in Python, and this one works(but is ugly). Commented Jul 11, 2016 at 9:22
  • Yes, now I see. I thought you had to compare each line from the first file with the corresponding line from the second file (like a simple diff), with a total of n comparisons (assuming both files have the same number of lines). But I completely misunderstood; you need n*m comparisons. Commented Jul 11, 2016 at 9:51

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.