My code down below generates entries for my program, but it's veryyy slow. I'm looking to generate about 10 million, is there any way to speed it up?
FirstNames, LastNames, and Objects (.txt) are all files with one entry per line
TempList=[]
maxData=10000000 #The maximum amount of entries that can be produced
import random,pickle,time,math,statistics
FirstNames = './FirstNames.txt'
LastNames = './LastNames.txt'
Objects = './Objects.txt'
def rawCount(filename):
with open(filename, 'rb') as f:
lines = 1
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
def randomLine(filename):
num = int(random.uniform(0, rawCount(filename)))
with open(filename, 'r') as f:
for i, line in enumerate(f, 1):
if i == num:
break
return line.strip('\n')
def str_time_prop(start, end, format, prop):
stime = time.mktime(time.strptime(start, format))
etime = time.mktime(time.strptime(end, format))
ptime = stime + prop * (etime - stime)
return time.strftime(format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%m/%d/%Y', prop)
def numCheck(question,low,high):
global errorState
errorState = True
while errorState == True:
checkString = input(question)
if len(checkString) == 0:
print("\nYou have to enter something!\n")
elif not checkString.isdigit():
print("\nThat's not a number!\n")
elif not low <= int(checkString) <= high:
print("\nThe number must be between "+str(low)+" and "+str(high)+"!\n")
else:
errorState = False
return checkString
def yesNoCheck(question):
while True:
sel = input("> ")
if sel.lower() == "y":
return True
elif sel.lower() == "n":
return False
else:
print("\nPlease type either 'y' or 'n'.\n")
last_times = []
def get_remaining_time(i, total, time):
last_times.append(time)
len_last_t = len(last_times)
if len_last_t > 500:
last_times.pop(0)
mean_t = statistics.median(last_times)
remain_s_tot = mean_t * (int(total) - i + 1)
remain_m = round(remain_s_tot / 60)
remain_s = round(remain_s_tot % 60)
#return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s"
return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s."
#Ordered
MainList=[]
RaffleList=[]
TempList=[]
def addstuff():
global TempList,MainList
Name = str(randomLine(FirstNames)+" "+randomLine(LastNames))
Amount = random.choice(range(1,500))
Datehire = random_date("1/1/2008", "1/1/2030", random.random())
Datereturn = random_date("2/1/2030", "1/1/2060", random.random())
RandomObject = str(randomLine(Objects))
TempList.append(Name) #Customer name
TempList.append(str(random.choice(range(10000000,99999999)))) #Reciept number
TempList.append(RandomObject) #Item hired
TempList.append(str(Amount)) #Item Amount
TempList.append(Datehire) #Date hired
TempList.append(Datereturn) #Date returned
TempList.append(str(math.ceil(int(Amount) / 25))) #Boxes needed
raffle=str(random.choice(range(1,1000)))
RaffleList.append(raffle)
MainList+=[TempList]
lista=TempList
TempList=[]
return lista,raffle
print("Random data generator\nHow many entries do you want?")
copies = numCheck("> ",1,maxData)
last_t = 0
print("Generating entries...\n")
for x in range(1,int(copies)):
t = time.time()
lista = addstuff()
last_t = time.time() - t
remain = get_remaining_time(x, copies, last_t)
if x % 250 == 0:
print(str(x)+")\t"+str(remain))
print("\nGeneration done.\n\nDo you want to save? (y/n)")
sel = yesNoCheck("> ")
if sel == True:
with open('data1.dat', 'wb') as x:
pickle.dump(MainList, x)
with open('data2.dat', 'wb') as x:
pickle.dump(RaffleList, x)
print("\nSaved.")
time.sleep(2)
else:
print("Okay, don't know why you generated but cya!")
time.sleep(2)
2 Answers 2
If you really want to know where to look at speeding things up, use a profiler. There is one in the standard library. There are also third party libraries.
My guess is that randomLine()
and rawCount()
are the biggest time sinks.
rawCount()
reads an entire file to determine its size. randomLine()
first calls rawCount()
and then reads parts of the file again. To randomly select a line, randomLine()
reads each entire file an average of 1.5 times and makes two calls to 'open(), two to
close()and at least 2 to
read()`.
(3 files)(6 function calls)(10 million random records) = a lot (180 million) of calls. That's a lot of I/O.
Instead, read a file into a list once. Then use random.choice()
to pick an item. The functionality can be put into a convenient class:
import random
class RandomLineChooser:
def __init__(self, filename):
with open(filename) as f:
self.lines = f.readlines()
def choose(self):
return random.choice(self.lines)
firstnames = RandomLineChooser(FirstNames)
lastnames = RandomLineChooser(LastNames)
objects = RandomLineChooser(Objects)
I'll also point out two useful Python libraries:
Faker, which is designed to generate fake data, and
Hypothesis, which is designed for testing, but can be used to generate fake data as well.
-
\$\begingroup\$ Hi, thanks for your answer. Reading the file into a list sounds like it would speed up the process heaps, so I'll try that when I get home. I also appreciate the library suggestions, but I'd like to at least try and make it myself, to improve my skill in python :). \$\endgroup\$Joyte– Joyte2020年08月02日 01:17:52 +00:00Commented Aug 2, 2020 at 1:17
Numpy
Use it, or perhaps its wrapper Pandas. Vectorization with these libraries will get you most of the way to a performant solution. This would replace your pickle.dump
, and change the internal format of MainList
and RaffleList
.
Divmod
Use divmod
rather than a separated division and modulation here:
remain_m = round(remain_s_tot / 60)
remain_s = round(remain_s_tot % 60)
Boolean selection
if sel.lower() == "y":
return True
elif sel.lower() == "n":
return False
else:
print("\nPlease type either 'y' or 'n'.\n")
can be
sel = input('> ').lower()
if sel in {'y', 'n'}:
return sel == 'y'
print("\nPlease type either 'y' or 'n'.\n")
Randomly-chosen line
randomLine
does not need to iterate at all. Instead, assuming that the line lengths are (within reason) on the same order of magnitude, you can simply
- Get the length of the file
- Seek to a random position in the file
- Read a buffer large enough to probably contain a newline
- Consume to that newline
- Consume to the next newline, and you have your random line.
-
\$\begingroup\$ Hi, thanks for answering. I'm not exactly sure on how to do the stuff about buffers and all that, but I'm probably just going to use the other guy's solution for putting the whole file into a list, as that seems like it would speed it up a lot. \$\endgroup\$Joyte– Joyte2020年08月02日 01:14:13 +00:00Commented Aug 2, 2020 at 1:14