I have a program that, among other things, parses some big files, and I would like to have this done in parallel to save time. The code flow looks something like this:
if __name__ == '__main__':
obj = program_object()
obj.do_so_some_stuff(argv)
obj.field1 = parse_file_one(f1)
obj.field2 = parse_file_two(f2)
obj.do_some_more_stuff()
I tried running the file parsing methods in separate processes like this:
p_1 = multiprocessing.Process(target=parse_file_one, args=(f1))
p_2 = multiprocessing.Process(target=parse_file_two, args=(f2))
p_1.start()
p_2.start()
p_1.join()
p_2.join()
There are 2 problems here. One is how to have the separate process modify the filed, but more importantly, forking the process duplicates my whole main! I get exception regarding argv when executing the
do_so_some_stuff(argv)
second time. That really is not what I wanted. It even happened when I run only 1 of the Processes.
How could I get just the file parsing methods to run in parallel to each other, and then continue back with main process like before?
-
Have you not read this docs.python.org/3.1/library/threading.htmlbosnjak– bosnjak2014年04月02日 11:32:23 +00:00Commented Apr 2, 2014 at 11:32
-
Or this: docs.python.org/2/library/multiprocessing.htmlbosnjak– bosnjak2014年04月02日 11:39:35 +00:00Commented Apr 2, 2014 at 11:39
-
Is it possible to optimize the stuff that needs to happen, besides doing it in separate threads? What is your 'parse_file_one(f1)'? normally multiprocessing should be a "last option" after everything else has been tried before.usethedeathstar– usethedeathstar2014年04月02日 12:14:13 +00:00Commented Apr 2, 2014 at 12:14
-
Those are two xml parsing methods. Each works with different structure of the xml, and then does a lot of different math on the values there. The files are about 100MB each, so this can take significant time when done consecutively. In system I see only 1 core is loaded and 3 are almost idle. I thought if I could run both at same time on separate cores, this should hopefully help.Stabby– Stabby2014年04月02日 12:24:11 +00:00Commented Apr 2, 2014 at 12:24
-
Are you sure there isn't some different problem with the code that fails? The multiprocessing shouldn't run you main twice.sth– sth2014年04月02日 14:46:58 +00:00Commented Apr 2, 2014 at 14:46
2 Answers 2
Try putting the parsing methods in a separate module.
Comments
First, i guess instead of:
obj = program_object()
program_object.do_so_some_stuff(argv)
you mean:
obj = program_object()
obj.do_so_some_stuff(argv)
Second, try using threading like this:
#!/usr/bin/python
import thread
if __name__ == '__main__':
try:
thread.start_new_thread( parse_file_one, (f1) )
thread.start_new_thread( parse_file_two, (f2) )
except:
print "Error: unable to start thread"
But, as pointed out by Wooble, depending on the implementation of your parsing functions, this might not be a solution that executes truly in parallel, because of the GIL.
In that case, you should check the Python multiprocessing module that will do true concurrent execution:
multiprocessingis a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.
3 Comments
parse_file_* methods actually do.