Increase the speed of this python code for text processing

Question 1

I need to iterate through a file of words and perform tests on them.

import multiprocessing
import datetime
class line_Worker(object):
 def worker(self, chunks):
 for l in chunks:
 num = 0
 for num in range(100):
 print l, num
 return
if __name__ == '__main__':
 lines = [line.rstrip('\n') for line in open('words.txt')] #opens file and saves to list
 chunkNum = len(lines) / 26
 wordLists = [lines[x:x + chunkNum] for x in range(0, len(lines), chunkNum)]#divides list by 4 and saves to list of lists
 jobs = []
 timeStart = str(datetime.datetime.now())
 for i in range(27):
 p = multiprocessing.Process(target=line_Worker().worker, args=(wordLists[i],))
 jobs.append(p)
 p.start()
 for p in jobs:
 p.join() # wait for the process to finish
 timeStop = str(datetime.datetime.now())
 print timeStart
 print timeStop

I have a problem set of 35,498,500 individual lines, and I need to try to get it to run in roughly 3 minutes. The current run times are 16 minutes.

Is there any way to speed this up? Thanks all!

Question 2

If you are trying to be performant, why are you printing?

Question 3

Beside the obvious performance issue when you print, you're printing in a multi-threaded setting. That means that there is no way to determine the order in which stuff appears on screen. Is that fine?

Question 4

Have you measured which part of those 16 mins is spent in the reading in of the words.txt file?

Question 5

Its negligible sub 1 second, Im gonna try taking the print statements out and see what happens. I will post back.

Question 6

The purpose of the script is to demonstrate what a dictionary attack would look like to illustrate an internet safety demo

Question 7

Getting rid of the print statement fixed it thanks all!

import multiprocessing
import datetime
class line_Worker(object):
 def worker(self, chunks):
 for l in chunks:
 num = 0
 for num in range(100):
 print l, num#clearing this made the time 4 seconds thanks all!!
 return
if __name__ == '__main__':
 lines = [line.rstrip('\n') for line in open('words.txt')] #opens file and saves to list
 chunkNum = len(lines) / 26
 wordLists = [lines[x:x + chunkNum] for x in range(0, len(lines), chunkNum)]#divides list by 4 and saves to list of lists
 jobs = []
 timeStart = str(datetime.datetime.now())
 for i in range(27):
 p = multiprocessing.Process(target=line_Worker().worker, args=(wordLists[i],))
 jobs.append(p)
 p.start()
 for p in jobs:
 p.join() # wait for the process to finish
 timeStop = str(datetime.datetime.now())
 print timeStart
 print timeStop

Kyle Sponable Kyle Sponable 1614 bronze badges · Answer 1 · 2017-03-28 22:45:05Z

Getting rid of the print statement fixed it thanks all!

import multiprocessing
import datetime
class line_Worker(object):
 def worker(self, chunks):
 for l in chunks:
 num = 0
 for num in range(100):
 print l, num#clearing this made the time 4 seconds thanks all!!
 return
if __name__ == '__main__':
 lines = [line.rstrip('\n') for line in open('words.txt')] #opens file and saves to list
 chunkNum = len(lines) / 26
 wordLists = [lines[x:x + chunkNum] for x in range(0, len(lines), chunkNum)]#divides list by 4 and saves to list of lists
 jobs = []
 timeStart = str(datetime.datetime.now())
 for i in range(27):
 p = multiprocessing.Process(target=line_Worker().worker, args=(wordLists[i],))
 jobs.append(p)
 p.start()
 for p in jobs:
 p.join() # wait for the process to finish
 timeStop = str(datetime.datetime.now())
 print timeStart
 print timeStop

Stack Exchange Network

Increase the speed of this python code for text processing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Increase the speed of this python code for text processing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions