What I am trying to accomplish is to stream tweets from Twitter for an hour, write the list of tweets to a file, clean and run analysis on the most recent hour of tweets, and then repeat the process indefinitely.
The problem I am running into is that if I run the cleaning and analysis of the tweets in the same script that's handling the streaming - by either hard-coding it or importing the functionality from a module - the whole script waits until these procedures are complete, and then begins again with the streaming. Is there a way to call the cleaning and analysis module within the streaming script so they run concurrently and the streaming doesn't stop while the cleaning and analysis is happening?
I've tried to achieve this by using subprocess.call('python cleaner.py', shell=True) and subprocess.Popen('python cleaner.py', shell=True), but I don't really know how to use these tools properly, and the two examples above have resulted in the streaming being stopped, cleaner.py being run, and then the streaming resumed.
1 Answer 1
Subprocess
You can use subprocess.Popen, as you tried, to run a different script concurrently:
the_other_process = subprocess.Popen(['python', 'cleaner.py'])
That line alone does what you want. What you don't want to do is:
the_other_process.communicate()
# or
the_other_process.wait()
Those would stop current process and wait for the other one to finish. A very useful feature in other circumstances.
If you want to know whether the subprocess is finished (but not wait for it):
result = the_other_process.poll()
if result is not None:
print('the other process has finished and retuned %s' % result)
Thread
Concurrency can also be achieved using threads. In that case, you are not running a new process, you are just splitting the current process into concurrent parts. Try this:
def function_to_be_executed_concurrently():
for i in range(5):
time.sleep(1)
print('running in separate thread', i)
thread = threading.Thread(target=function_to_be_executed_concurrently)
thread.start()
for i in range(5):
time.sleep(1)
print('running in main thread', i)
The above code should result with mixed outputs of running in separate thread and running in main thread.
Thread vs process
- Using
subprocess, you can run anything which could be run standalone from the shell. It does not have to be python. - Using
threading, you can run any function in a concurrent thread of execution. - Threads share the same memory, so it is easy to share data between them (although there are issues when synchronization is needed). With processes, sharing data can become a problem. If a lot of data has to be shared, susbprocesses can be much slower.
- Starting a new process is slower and consumes more resources than running a thread
- Since threads run in the same process, they share are bound to the same GIL, which means most things will run on the same CPU core. If very slow CPU-consuming tasks need to be sped up, running them in separate processes my be faster.
Multiprocessing
multiprocessing module provides an interface similar to threading, but it runs subprocesses instead. This is useful when you need to take full advantage of all CPU cores.
** Note that subprocess.Popen(['python', 'cleaner.py']) is the same thing as subprocess.Popen('python cleaner.py', shell=True), but the former is better practice to learn.
For example, if there is a space in the path, this will fail:
subprocess.Popen('python My Documents\\cleaner.py', shell=True)
It fails because it interprets My and Documents\cleaner.py as two separate arguments.
On the other hand, this will work as expected:
subprocess.Popen(['python', 'My Documents\\cleaner.py'])
It works, because the arguments are explicitly separated by using a list.
The latter is especially superior if one of the arguments is in a variable:
subprocess.Popen(['python', path_to_file])
3 Comments
shell=True to the answer