Generates random entries in a particular format

Question 1

My code down below generates entries for my program, but it's veryyy slow. I'm looking to generate about 10 million, is there any way to speed it up?

FirstNames, LastNames, and Objects (.txt) are all files with one entry per line

TempList=[]
maxData=10000000 #The maximum amount of entries that can be produced
import random,pickle,time,math,statistics
FirstNames = './FirstNames.txt'
LastNames = './LastNames.txt'
Objects = './Objects.txt'
def rawCount(filename):
 with open(filename, 'rb') as f:
 lines = 1
 buf_size = 1024 * 1024
 read_f = f.raw.read
 buf = read_f(buf_size)
 while buf:
 lines += buf.count(b'\n')
 buf = read_f(buf_size)
 return lines
def randomLine(filename):
 num = int(random.uniform(0, rawCount(filename)))
 with open(filename, 'r') as f:
 for i, line in enumerate(f, 1):
 if i == num:
 break
 return line.strip('\n')
def str_time_prop(start, end, format, prop):
 stime = time.mktime(time.strptime(start, format))
 etime = time.mktime(time.strptime(end, format))
 ptime = stime + prop * (etime - stime)
 return time.strftime(format, time.localtime(ptime))
def random_date(start, end, prop):
 return str_time_prop(start, end, '%m/%d/%Y', prop)
def numCheck(question,low,high):
 global errorState
 errorState = True
 while errorState == True:
 checkString = input(question)
 if len(checkString) == 0:
 print("\nYou have to enter something!\n")
 elif not checkString.isdigit():
 print("\nThat's not a number!\n")
 elif not low <= int(checkString) <= high:
 print("\nThe number must be between "+str(low)+" and "+str(high)+"!\n")
 else:
 errorState = False
 return checkString
def yesNoCheck(question):
 while True:
 sel = input("> ")
 if sel.lower() == "y":
 return True
 elif sel.lower() == "n":
 return False
 else:
 print("\nPlease type either 'y' or 'n'.\n")
 
last_times = []
def get_remaining_time(i, total, time):
 last_times.append(time)
 len_last_t = len(last_times)
 if len_last_t > 500:
 last_times.pop(0)
 mean_t = statistics.median(last_times)
 remain_s_tot = mean_t * (int(total) - i + 1) 
 remain_m = round(remain_s_tot / 60)
 remain_s = round(remain_s_tot % 60)
 #return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s"
 return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s."
#Ordered
MainList=[]
RaffleList=[]
TempList=[]
def addstuff():
 global TempList,MainList
 
 Name = str(randomLine(FirstNames)+" "+randomLine(LastNames))
 Amount = random.choice(range(1,500))
 Datehire = random_date("1/1/2008", "1/1/2030", random.random())
 Datereturn = random_date("2/1/2030", "1/1/2060", random.random())
 RandomObject = str(randomLine(Objects))
 TempList.append(Name) #Customer name
 TempList.append(str(random.choice(range(10000000,99999999)))) #Reciept number
 TempList.append(RandomObject) #Item hired
 TempList.append(str(Amount)) #Item Amount
 TempList.append(Datehire) #Date hired
 TempList.append(Datereturn) #Date returned
 TempList.append(str(math.ceil(int(Amount) / 25))) #Boxes needed
 raffle=str(random.choice(range(1,1000)))
 RaffleList.append(raffle)
 MainList+=[TempList]
 lista=TempList
 TempList=[]
 return lista,raffle
print("Random data generator\nHow many entries do you want?")
copies = numCheck("> ",1,maxData)
last_t = 0
print("Generating entries...\n")
for x in range(1,int(copies)):
 t = time.time()
 lista = addstuff()
 last_t = time.time() - t
 remain = get_remaining_time(x, copies, last_t)
 if x % 250 == 0:
 print(str(x)+")\t"+str(remain))
 
print("\nGeneration done.\n\nDo you want to save? (y/n)")
sel = yesNoCheck("> ")
if sel == True:
 with open('data1.dat', 'wb') as x:
 pickle.dump(MainList, x)
 with open('data2.dat', 'wb') as x:
 pickle.dump(RaffleList, x)
 print("\nSaved.")
 time.sleep(2)
else:
 print("Okay, don't know why you generated but cya!")
 time.sleep(2)

Question 2

If you really want to know where to look at speeding things up, use a profiler. There is one in the standard library. There are also third party libraries.

My guess is that randomLine() and rawCount() are the biggest time sinks.

rawCount() reads an entire file to determine its size. randomLine() first calls rawCount() and then reads parts of the file again. To randomly select a line, randomLine() reads each entire file an average of 1.5 times and makes two calls to 'open(), two to close()and at least 2 toread()`.

(3 files)(6 function calls)(10 million random records) = a lot (180 million) of calls. That's a lot of I/O.

Instead, read a file into a list once. Then use random.choice() to pick an item. The functionality can be put into a convenient class:

import random
class RandomLineChooser:
 def __init__(self, filename):
 with open(filename) as f:
 self.lines = f.readlines()
 def choose(self):
 return random.choice(self.lines)
firstnames = RandomLineChooser(FirstNames)
lastnames = RandomLineChooser(LastNames)
objects = RandomLineChooser(Objects)

I'll also point out two useful Python libraries:

Faker, which is designed to generate fake data, and
Hypothesis, which is designed for testing, but can be used to generate fake data as well.

Question 3

Hi, thanks for your answer. Reading the file into a list sounds like it would speed up the process heaps, so I'll try that when I get home. I also appreciate the library suggestions, but I'd like to at least try and make it myself, to improve my skill in python :).

Question 4

Numpy

Use it, or perhaps its wrapper Pandas. Vectorization with these libraries will get you most of the way to a performant solution. This would replace your pickle.dump, and change the internal format of MainList and RaffleList.

Divmod

Use divmod rather than a separated division and modulation here:

remain_m = round(remain_s_tot / 60)
remain_s = round(remain_s_tot % 60)

Boolean selection

 if sel.lower() == "y":
 return True
 elif sel.lower() == "n":
 return False
 else:
 print("\nPlease type either 'y' or 'n'.\n")

can be

sel = input('> ').lower()
if sel in {'y', 'n'}:
 return sel == 'y'
print("\nPlease type either 'y' or 'n'.\n")

Randomly-chosen line

randomLine does not need to iterate at all. Instead, assuming that the line lengths are (within reason) on the same order of magnitude, you can simply

Get the length of the file
Seek to a random position in the file
Read a buffer large enough to probably contain a newline
Consume to that newline
Consume to the next newline, and you have your random line.

Question 5

Hi, thanks for answering. I'm not exactly sure on how to do the stuff about buffers and all that, but I'm probably just going to use the other guy's solution for putting the whole file into a list, as that seems like it would speed it up a lot.

RootTwo RootTwo 10.7k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2020-08-02 01:08:16Z

If you really want to know where to look at speeding things up, use a profiler. There is one in the standard library. There are also third party libraries.

My guess is that randomLine() and rawCount() are the biggest time sinks.

rawCount() reads an entire file to determine its size. randomLine() first calls rawCount() and then reads parts of the file again. To randomly select a line, randomLine() reads each entire file an average of 1.5 times and makes two calls to 'open(), two to close()and at least 2 toread()`.

(3 files)(6 function calls)(10 million random records) = a lot (180 million) of calls. That's a lot of I/O.

Instead, read a file into a list once. Then use random.choice() to pick an item. The functionality can be put into a convenient class:

import random
class RandomLineChooser:
 def __init__(self, filename):
 with open(filename) as f:
 self.lines = f.readlines()
 def choose(self):
 return random.choice(self.lines)
firstnames = RandomLineChooser(FirstNames)
lastnames = RandomLineChooser(LastNames)
objects = RandomLineChooser(Objects)

I'll also point out two useful Python libraries:

Faker, which is designed to generate fake data, and
Hypothesis, which is designed for testing, but can be used to generate fake data as well.

Hi, thanks for your answer. Reading the file into a list sounds like it would speed up the process heaps, so I'll try that when I get home. I also appreciate the library suggestions, but I'd like to at least try and make it myself, to improve my skill in python :).

Stack Exchange Network

Generates random entries in a particular format

2 Answers 2

Numpy

Divmod

Boolean selection

Randomly-chosen line

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Generates random entries in a particular format

2 Answers 2

Numpy

Divmod

Boolean selection

Randomly-chosen line

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions