5
\$\begingroup\$

I need to iterate through a file of words and perform tests on them.

import multiprocessing
import datetime
class line_Worker(object):
 def worker(self, chunks):
 for l in chunks:
 num = 0
 for num in range(100):
 print l, num
 return
if __name__ == '__main__':
 lines = [line.rstrip('\n') for line in open('words.txt')] #opens file and saves to list
 chunkNum = len(lines) / 26
 wordLists = [lines[x:x + chunkNum] for x in range(0, len(lines), chunkNum)]#divides list by 4 and saves to list of lists
 jobs = []
 timeStart = str(datetime.datetime.now())
 for i in range(27):
 p = multiprocessing.Process(target=line_Worker().worker, args=(wordLists[i],))
 jobs.append(p)
 p.start()
 for p in jobs:
 p.join() # wait for the process to finish
 timeStop = str(datetime.datetime.now())
 print timeStart
 print timeStop

I have a problem set of 35,498,500 individual lines, and I need to try to get it to run in roughly 3 minutes. The current run times are 16 minutes.

Is there any way to speed this up? Thanks all!

Jake Dube
2502 silver badges11 bronze badges
asked Mar 22, 2017 at 0:19
\$\endgroup\$
7
  • 6
    \$\begingroup\$ If you are trying to be performant, why are you printing? \$\endgroup\$ Commented Mar 22, 2017 at 1:42
  • 1
    \$\begingroup\$ Beside the obvious performance issue when you print, you're printing in a multi-threaded setting. That means that there is no way to determine the order in which stuff appears on screen. Is that fine? \$\endgroup\$ Commented Mar 22, 2017 at 14:00
  • \$\begingroup\$ Have you measured which part of those 16 mins is spent in the reading in of the words.txt file? \$\endgroup\$ Commented Mar 23, 2017 at 9:45
  • \$\begingroup\$ Its negligible sub 1 second, Im gonna try taking the print statements out and see what happens. I will post back. \$\endgroup\$ Commented Mar 23, 2017 at 15:48
  • \$\begingroup\$ The purpose of the script is to demonstrate what a dictionary attack would look like to illustrate an internet safety demo \$\endgroup\$ Commented Mar 23, 2017 at 15:50

1 Answer 1

1
\$\begingroup\$

Getting rid of the print statement fixed it thanks all!

import multiprocessing
import datetime
class line_Worker(object):
 def worker(self, chunks):
 for l in chunks:
 num = 0
 for num in range(100):
 print l, num#clearing this made the time 4 seconds thanks all!!
 return
if __name__ == '__main__':
 lines = [line.rstrip('\n') for line in open('words.txt')] #opens file and saves to list
 chunkNum = len(lines) / 26
 wordLists = [lines[x:x + chunkNum] for x in range(0, len(lines), chunkNum)]#divides list by 4 and saves to list of lists
 jobs = []
 timeStart = str(datetime.datetime.now())
 for i in range(27):
 p = multiprocessing.Process(target=line_Worker().worker, args=(wordLists[i],))
 jobs.append(p)
 p.start()
 for p in jobs:
 p.join() # wait for the process to finish
 timeStop = str(datetime.datetime.now())
 print timeStart
 print timeStop
answered Mar 28, 2017 at 22:45
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.