4
\$\begingroup\$

My code down below generates entries for my program, but it's veryyy slow. I'm looking to generate about 10 million, is there any way to speed it up?

FirstNames, LastNames, and Objects (.txt) are all files with one entry per line

TempList=[]
maxData=10000000 #The maximum amount of entries that can be produced
import random,pickle,time,math,statistics
FirstNames = './FirstNames.txt'
LastNames = './LastNames.txt'
Objects = './Objects.txt'
def rawCount(filename):
 with open(filename, 'rb') as f:
 lines = 1
 buf_size = 1024 * 1024
 read_f = f.raw.read
 buf = read_f(buf_size)
 while buf:
 lines += buf.count(b'\n')
 buf = read_f(buf_size)
 return lines
def randomLine(filename):
 num = int(random.uniform(0, rawCount(filename)))
 with open(filename, 'r') as f:
 for i, line in enumerate(f, 1):
 if i == num:
 break
 return line.strip('\n')
def str_time_prop(start, end, format, prop):
 stime = time.mktime(time.strptime(start, format))
 etime = time.mktime(time.strptime(end, format))
 ptime = stime + prop * (etime - stime)
 return time.strftime(format, time.localtime(ptime))
def random_date(start, end, prop):
 return str_time_prop(start, end, '%m/%d/%Y', prop)
def numCheck(question,low,high):
 global errorState
 errorState = True
 while errorState == True:
 checkString = input(question)
 if len(checkString) == 0:
 print("\nYou have to enter something!\n")
 elif not checkString.isdigit():
 print("\nThat's not a number!\n")
 elif not low <= int(checkString) <= high:
 print("\nThe number must be between "+str(low)+" and "+str(high)+"!\n")
 else:
 errorState = False
 return checkString
def yesNoCheck(question):
 while True:
 sel = input("> ")
 if sel.lower() == "y":
 return True
 elif sel.lower() == "n":
 return False
 else:
 print("\nPlease type either 'y' or 'n'.\n")
 
last_times = []
def get_remaining_time(i, total, time):
 last_times.append(time)
 len_last_t = len(last_times)
 if len_last_t > 500:
 last_times.pop(0)
 mean_t = statistics.median(last_times)
 remain_s_tot = mean_t * (int(total) - i + 1) 
 remain_m = round(remain_s_tot / 60)
 remain_s = round(remain_s_tot % 60)
 #return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s"
 return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s."
#Ordered
MainList=[]
RaffleList=[]
TempList=[]
def addstuff():
 global TempList,MainList
 
 Name = str(randomLine(FirstNames)+" "+randomLine(LastNames))
 Amount = random.choice(range(1,500))
 Datehire = random_date("1/1/2008", "1/1/2030", random.random())
 Datereturn = random_date("2/1/2030", "1/1/2060", random.random())
 RandomObject = str(randomLine(Objects))
 TempList.append(Name) #Customer name
 TempList.append(str(random.choice(range(10000000,99999999)))) #Reciept number
 TempList.append(RandomObject) #Item hired
 TempList.append(str(Amount)) #Item Amount
 TempList.append(Datehire) #Date hired
 TempList.append(Datereturn) #Date returned
 TempList.append(str(math.ceil(int(Amount) / 25))) #Boxes needed
 raffle=str(random.choice(range(1,1000)))
 RaffleList.append(raffle)
 MainList+=[TempList]
 lista=TempList
 TempList=[]
 return lista,raffle
print("Random data generator\nHow many entries do you want?")
copies = numCheck("> ",1,maxData)
last_t = 0
print("Generating entries...\n")
for x in range(1,int(copies)):
 t = time.time()
 lista = addstuff()
 last_t = time.time() - t
 remain = get_remaining_time(x, copies, last_t)
 if x % 250 == 0:
 print(str(x)+")\t"+str(remain))
 
print("\nGeneration done.\n\nDo you want to save? (y/n)")
sel = yesNoCheck("> ")
if sel == True:
 with open('data1.dat', 'wb') as x:
 pickle.dump(MainList, x)
 with open('data2.dat', 'wb') as x:
 pickle.dump(RaffleList, x)
 print("\nSaved.")
 time.sleep(2)
else:
 print("Okay, don't know why you generated but cya!")
 time.sleep(2)
Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Jul 29, 2020 at 22:16
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

If you really want to know where to look at speeding things up, use a profiler. There is one in the standard library. There are also third party libraries.

My guess is that randomLine() and rawCount() are the biggest time sinks.

rawCount() reads an entire file to determine its size. randomLine() first calls rawCount() and then reads parts of the file again. To randomly select a line, randomLine() reads each entire file an average of 1.5 times and makes two calls to 'open(), two to close()and at least 2 toread()`.

(3 files)(6 function calls)(10 million random records) = a lot (180 million) of calls. That's a lot of I/O.

Instead, read a file into a list once. Then use random.choice() to pick an item. The functionality can be put into a convenient class:

import random
class RandomLineChooser:
 def __init__(self, filename):
 with open(filename) as f:
 self.lines = f.readlines()
 def choose(self):
 return random.choice(self.lines)
firstnames = RandomLineChooser(FirstNames)
lastnames = RandomLineChooser(LastNames)
objects = RandomLineChooser(Objects)

I'll also point out two useful Python libraries:

  • Faker, which is designed to generate fake data, and

  • Hypothesis, which is designed for testing, but can be used to generate fake data as well.

answered Aug 2, 2020 at 1:08
\$\endgroup\$
1
  • \$\begingroup\$ Hi, thanks for your answer. Reading the file into a list sounds like it would speed up the process heaps, so I'll try that when I get home. I also appreciate the library suggestions, but I'd like to at least try and make it myself, to improve my skill in python :). \$\endgroup\$ Commented Aug 2, 2020 at 1:17
4
\$\begingroup\$

Numpy

Use it, or perhaps its wrapper Pandas. Vectorization with these libraries will get you most of the way to a performant solution. This would replace your pickle.dump, and change the internal format of MainList and RaffleList.

Divmod

Use divmod rather than a separated division and modulation here:

remain_m = round(remain_s_tot / 60)
remain_s = round(remain_s_tot % 60)

Boolean selection

 if sel.lower() == "y":
 return True
 elif sel.lower() == "n":
 return False
 else:
 print("\nPlease type either 'y' or 'n'.\n")

can be

sel = input('> ').lower()
if sel in {'y', 'n'}:
 return sel == 'y'
print("\nPlease type either 'y' or 'n'.\n")

Randomly-chosen line

randomLine does not need to iterate at all. Instead, assuming that the line lengths are (within reason) on the same order of magnitude, you can simply

  1. Get the length of the file
  2. Seek to a random position in the file
  3. Read a buffer large enough to probably contain a newline
  4. Consume to that newline
  5. Consume to the next newline, and you have your random line.
answered Jul 30, 2020 at 1:34
\$\endgroup\$
1
  • \$\begingroup\$ Hi, thanks for answering. I'm not exactly sure on how to do the stuff about buffers and all that, but I'm probably just going to use the other guy's solution for putting the whole file into a list, as that seems like it would speed it up a lot. \$\endgroup\$ Commented Aug 2, 2020 at 1:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.