I wrote a program that generates some synthetic data. Everything runs fine, but right now the program is so slow that I am only generating 1000 rows of data (and it still takes 3 or so minutes). Ideally I would like to generate about 100,000 (which at the moment takes upwards of 10 minutes, I killed the program before it finished running).
I've narrowed down the problem to the way I am generating random dates. After those three lines, the rest of the program executes in a few seconds.
import numpy.random as rnd
import datetime
import pandas as pd
random_days = []
for num in range(0,n):
random_days.append(pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date))))
What I need is, given some number n, to generate that many dates at random from a sequence of business days (the business days part is also important). I need to convert the value to datetime because otherwise it returns a numpy timedelta64 object, which causes problems in other parts of the program.
Is there any way to improve my code to have it generate dates faster? Or do I need to settle for a small sample size?
EDIT:
Just to add a little more context:
I use this loop to generate the rest of my data:
for day in random_days:
new_trans = one_trans(day)
frame.append(new_trans)
frame = pd.concat(frame)
The function one_trans is the following:
def one_trans(date):
trans = pd.Series([date.year, date.month, date.date(), fake.company(),
fake.company(), fake.ssn(),
(rnd.normal(5000000,10000)),
random.sample(["USD", "EUR", "JPY", "BTC"], 1)[0]],
index=["YEAR", "MONTH","DATE","SENDER","RECEIVER",
"TRANSACID","AMOUNT","CURRENCY"])
return trans.to_frame().transpose()
EDIT 2: This is how I implemented Vogel612's suggestion:
def rng_dates(n, start_date, end_date):
for _ in range(n):
yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))
random_days = rng_dates(n, start_date, end_date)
for day in random_days:
new_trans = one_trans(day)
frame.append(new_trans)
frame = pd.concat(frame)
-
\$\begingroup\$ With Vogel612's suggestion, does it run now quicker? \$\endgroup\$Serge Stroobandt– Serge Stroobandt2018年05月31日 18:55:07 +00:00Commented May 31, 2018 at 18:55
1 Answer 1
one of the really easy ways to probably make this faster is by foregoing the idea of "I generate n records and then do things with them".
Instead think something like "I generate a record and do things with it. n times".
Python has the really handy concept of iterators. Consider the following:
def rng_dates():
while true:
yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))
to make this a little less infinity-ish, you could pass a number of records to it:
def rng_dates(n):
# hat tip to Peilonrayz
for _ in range(n):
yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))
this should allow some optimizations wrt. memory management, cache-misses and list-appending, which should translate into a pretty hefty speedup in larger sample-sizes
-
3\$\begingroup\$ Python, by choice, has no
i++
operator. It'd also be more common to seefor _ in range(n)
, rather than awhile
loop too. \$\endgroup\$2017年07月20日 19:54:44 +00:00Commented Jul 20, 2017 at 19:54 -
\$\begingroup\$ I added more details about my code and the way I use dates to generate the rest of my data. I'm just a little confused about how I could incorporate the function you wrote in generating my data. I feel like I might need to restructure the rest of the code (which I don't mind) \$\endgroup\$Sergei– Sergei2017年07月20日 20:30:01 +00:00Commented Jul 20, 2017 at 20:30