Python - Faster random business date generation

Question 1

I wrote a program that generates some synthetic data. Everything runs fine, but right now the program is so slow that I am only generating 1000 rows of data (and it still takes 3 or so minutes). Ideally I would like to generate about 100,000 (which at the moment takes upwards of 10 minutes, I killed the program before it finished running).

I've narrowed down the problem to the way I am generating random dates. After those three lines, the rest of the program executes in a few seconds.

import numpy.random as rnd
import datetime
import pandas as pd
random_days = []
for num in range(0,n):
 random_days.append(pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date))))

What I need is, given some number n, to generate that many dates at random from a sequence of business days (the business days part is also important). I need to convert the value to datetime because otherwise it returns a numpy timedelta64 object, which causes problems in other parts of the program.

Is there any way to improve my code to have it generate dates faster? Or do I need to settle for a small sample size?

EDIT:

Just to add a little more context:

I use this loop to generate the rest of my data:

for day in random_days:
 new_trans = one_trans(day)
 frame.append(new_trans)
frame = pd.concat(frame)

The function one_trans is the following:

def one_trans(date):
 trans = pd.Series([date.year, date.month, date.date(), fake.company(),
 fake.company(), fake.ssn(),
 (rnd.normal(5000000,10000)),
 random.sample(["USD", "EUR", "JPY", "BTC"], 1)[0]],
 index=["YEAR", "MONTH","DATE","SENDER","RECEIVER",
 "TRANSACID","AMOUNT","CURRENCY"])
 return trans.to_frame().transpose()

EDIT 2: This is how I implemented Vogel612's suggestion:

def rng_dates(n, start_date, end_date):
 for _ in range(n):
 yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))
random_days = rng_dates(n, start_date, end_date)
for day in random_days:
 new_trans = one_trans(day)
 frame.append(new_trans)
frame = pd.concat(frame)

Question 2

With Vogel612's suggestion, does it run now quicker?

Question 3

one of the really easy ways to probably make this faster is by foregoing the idea of "I generate n records and then do things with them".

Instead think something like "I generate a record and do things with it. n times".

Python has the really handy concept of iterators. Consider the following:

def rng_dates():
 while true:
 yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

to make this a little less infinity-ish, you could pass a number of records to it:

def rng_dates(n):
 # hat tip to Peilonrayz
 for _ in range(n):
 yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

this should allow some optimizations wrt. memory management, cache-misses and list-appending, which should translate into a pretty hefty speedup in larger sample-sizes

Question 4

Python, by choice, has no i++ operator. It'd also be more common to see for _ in range(n), rather than a while loop too.

Question 5

I added more details about my code and the way I use dates to generate the rest of my data. I'm just a little confused about how I could incorporate the function you wrote in generating my data. I feel like I might need to restructure the rest of the code (which I don't mind)

Vogel612 Vogel612 25.5k7 gold badges59 silver badges141 bronze badges · Accepted Answer · 2017-07-20 19:43:50Z

one of the really easy ways to probably make this faster is by foregoing the idea of "I generate n records and then do things with them".

Instead think something like "I generate a record and do things with it. n times".

Python has the really handy concept of iterators. Consider the following:

def rng_dates():
 while true:
 yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

to make this a little less infinity-ish, you could pass a number of records to it:

def rng_dates(n):
 # hat tip to Peilonrayz
 for _ in range(n):
 yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

this should allow some optimizations wrt. memory management, cache-misses and list-appending, which should translate into a pretty hefty speedup in larger sample-sizes

Python, by choice, has no i++ operator. It'd also be more common to see for _ in range(n), rather than a while loop too.
I added more details about my code and the way I use dates to generate the rest of my data. I'm just a little confused about how I could incorporate the function you wrote in generating my data. I feel like I might need to restructure the rest of the code (which I don't mind)

Stack Exchange Network

Python - Faster random business date generation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python - Faster random business date generation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions