#Style concerns
There is a bunch of constant that you define in the function that doesn't really need to. No need do (re)define your credentials for the DB at each call, for instance, nor the init_date
. You should extract these out as the constant they are.
Same thing for time_map
, there is nothing in it that make it mandatory to be defined inside table
, so move it out too.
You should also try to reduce too long lines and come up with better names: table
doesn't convey much.
#Prepared SQL
When dealing with SQL, it is often recommended to not build your query string yourself and let the manager do it for you. pandas
let you do it using the params
parameter of pd.read_sql
. You’ll need to adapt the query a bit to change '{}'
into ?
.
#Pandas efficiency
When dealing with pandas
, it is often more efficient to perform operations on a whole column at once; and more often than not, going back and forth between the pure Python world and the pandas
one will lead to the kind of performances you’re having.
pandas
has its own kind of objects to deal with time. Namely pd.Timestamp
and pd.Timedelta
. Converting your columns to these objects instead of datetime.datetime
or datetime.timedelta
should help you speed up your computation. pd.to_timedelta
can be quite handy for that.
You should also try to reduce the amount of extra computation. Even the tiniest ones will add up. I’m talking about your offset management, why start at 2012年01月01日 and add x - 1
days? You can perform '2011-12-31' + x
days instead. Same for the minutes: instead of starting at 6 o'clock and adding x - 1
×ばつ 15 minutes, why not start at 5:45?
Unfortunately, you’re dealing with strings to convert to timedeltas. You can take care of the conversion using df['QuarterHourDimID'].map(int)
, for instance; but it would be so much faster if you could extract integers directly out of your database.
#Proposed improvements
import pandas as pd
import pyodbc
DB_CREDENTIALS = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
SQL_QUERY = """
SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression]
FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = ?
AND DemographicGroupDimID = ?
AND QuarterHourDimID IS NOT NULL
)"""
INIT_TIMESTAMP = pd.Timestamp('2011-12-31')
QUARTER_TO_SIX = pd.Timedelta(5.75, unit='h')
DAY = pd.Timedelta(1, unit='D')
def table(network, demo):
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(
sql=SQL_QUERY,
con=cnxn,
params=(network, demo),
index_col=None
)
quarters = pd.to_timedelta(df['QuarterHourDimID'].map(int), unit='m') * 15 + QUARTER_TO_SIX
# This is what I understood of `time_map`, if this isn't quite right, adapt accordingly
quarters[quarters>=DAY] -= DAY
df['datetime'] = pd.to_timedelta(df['StartDateDimID'].map(int), unit='D') + quarter_offsets + INIT_TIMESTAMP
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
- 29.4k
- 3
- 49
- 98