I have a table in pandas that has two columns, QuarterHourDimID
and StartDateDimID
; these columns give me an ID for each date / quarter hour pairing. For instance for January 1st, 2015, at 12:15PM the StartDateDimID
would equal 1097
and QuarterHourDimID
would equal 26
. This is how the data I'm reading is organized.
It's a large table that I'm reading using pyodbc
and pandas.read_sql()
, ~450M rows and ~60 columns, so performance is an issue.
To parse the QuarterHourDimID
and StartDateDimID
columns into workable datetime
indexes I'm running an apply function on every row to create an additional column datetime
.
My code reading the table without the additional parsing is around 800ms; however when I run this apply function it adds around 4s to total run time (anywhere between 5.8-6s a query is expected.) The df
that is returned is around ~45K rows and 5 columns (~450days*~100quarter-hour-parts)
I am hoping to more efficiently rewrite what I've written and get any input along the way.
Below is the code I've written thus far:
import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = {}
AND DemographicGroupDimID = {}
AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
with pyodbc.connect(connection_string) as cnxn:
df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
def time_map(quarter_hour, date):
if quarter_hour > 72:
return date + timedelta(minutes=(quarter_hour % 73)*15)
return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
map_date = {}
init_date = datetime(year=2012, month=1, day=1)
for x in df.StartDateDimID.unique():
map_date[x] = init_date + timedelta(days=int(x)-1)
#this is the part of my code that is likely bogging things down
df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
map_date[row['StartDateDimID']]),
axis=1)
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
1 Answer 1
Style concerns
There is a bunch of constant that you define in the function that doesn't really need to. No need do (re)define your credentials for the DB at each call, for instance, nor the init_date
. You should extract these out as the constant they are.
Same thing for time_map
, there is nothing in it that make it mandatory to be defined inside table
, so move it out too.
You should also try to reduce too long lines and come up with better names: table
doesn't convey much.
Prepared SQL
When dealing with SQL, it is often recommended to not build your query string yourself and let the manager do it for you. pandas
let you do it using the params
parameter of pd.read_sql
. You’ll need to adapt the query a bit to change '{}'
into ?
.
Pandas efficiency
When dealing with pandas
, it is often more efficient to perform operations on a whole column at once; and more often than not, going back and forth between the pure Python world and the pandas
one will lead to the kind of performances you’re having.
pandas
has its own kind of objects to deal with time. Namely pd.Timestamp
and pd.Timedelta
. Converting your columns to these objects instead of datetime.datetime
or datetime.timedelta
should help you speed up your computation. pd.to_timedelta
can be quite handy for that.
You should also try to reduce the amount of extra computation. Even the tiniest ones will add up. I’m talking about your offset management, why start at 2012年01月01日 and add x - 1
days? You can perform '2011-12-31' + x
days instead. Same for the minutes: instead of starting at 6 o'clock and adding x - 1
×ばつ 15 minutes, why not start at 5:45?
Unfortunately, you’re dealing with strings to convert to timedeltas. You can take care of the conversion using df['QuarterHourDimID'].map(int)
, for instance; but it would be so much faster if you could extract integers directly out of your database.
Proposed improvements
import pandas as pd
import pyodbc
DB_CREDENTIALS = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
SQL_QUERY = """
SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression]
FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = ?
AND DemographicGroupDimID = ?
AND QuarterHourDimID IS NOT NULL
)"""
INIT_TIMESTAMP = pd.Timestamp('2011-12-31')
QUARTER_TO_SIX = pd.Timedelta(5.75, unit='h')
DAY = pd.Timedelta(1, unit='D')
def table(network, demo):
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(
sql=SQL_QUERY,
con=cnxn,
params=(network, demo),
index_col=None
)
quarters = pd.to_timedelta(df['QuarterHourDimID'].map(int), unit='m') * 15 + QUARTER_TO_SIX
# This is what I understood of `time_map`, if this isn't quite right, adapt accordingly
quarters[quarters>=DAY] -= DAY
df['datetime'] = pd.to_timedelta(df['StartDateDimID'].map(int), unit='D') + quarter_offsets + INIT_TIMESTAMP
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
-
\$\begingroup\$ Thanks for the reply Mathias, there's a lot of good information here. I'll do some testing this morning and see how some of your modifications perform. \$\endgroup\$mburke05– mburke052016年03月29日 11:24:16 +00:00Commented Mar 29, 2016 at 11:24
-
1\$\begingroup\$ @mburke05 Note that, if you can't do much about the order you apply your additions, the multiplication may yield faster results depending on when you apply it. You may want to compare
pd.to_timedelta(df[..], unit='m') * 15
topd.to_timedelta(df[..] * 15, unit='m')
. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年03月29日 11:49:56 +00:00Commented Mar 29, 2016 at 11:49 -
\$\begingroup\$ I think the big bottleneck is using
.apply
which is basically pure python as you say mixing with the fast c level of scripting that python benefits from. I'm going to try two tests; one just using a simpleINNER JOIN
on a date table that exists in the database, and another using some of the mod's above. \$\endgroup\$mburke05– mburke052016年03月29日 11:54:46 +00:00Commented Mar 29, 2016 at 11:54 -
\$\begingroup\$ I've posted a rehashed answer to the above in sql here: stackoverflow.com/questions/36267498/… ; Thanks again Mathias. \$\endgroup\$mburke05– mburke052016年03月29日 14:27:25 +00:00Commented Mar 29, 2016 at 14:27
Explore related questions
See similar questions with these tags.
cProfile
? Also, you can edit your post with these informations. \$\endgroup\$