Return to Revisions

2 of 3

added 401 characters in body

edited Mar 29, 2016 at 7:36

29.4k
3
49
98

#Style concerns

There is a bunch of constant that you define in the function that doesn't really need to. No need do (re)define your credentials for the DB at each call, for instance, nor the init_date. You should extract these out as the constant they are.

Same thing for time_map, there is nothing in it that make it mandatory to be defined inside table, so move it out too.

You should also try to reduce too long lines and come up with better names: table doesn't convey much.

#Prepared SQL

When dealing with SQL, it is often recommended to not build your query string yourself and let the manager do it for you. pandas let you do it using the params parameter of pd.read_sql. You’ll need to adapt the query a bit to change '{}' into ?.

#Pandas efficiency

When dealing with pandas, it is often more efficient to perform operations on a whole column at once; and more often than not, going back and forth between the pure Python world and the pandas one will lead to the kind of performances you’re having.

pandas has its own kind of objects to deal with time. Namely pd.Timestamp and pd.Timedelta. Converting your columns to these objects instead of datetime.datetime or datetime.timedelta should help you speed up your computation. pd.to_timedelta can be quite handy for that.

You should also try to reduce the amount of extra computation. Even the tiniest ones will add up. I’m talking about your offset management, why start at 2012年01月01日 and add x - 1 days? You can perform '2011-12-31' + x days instead. Same for the minutes: instead of starting at 6 o'clock and adding x - 1 ×ばつ 15 minutes, why not start at 5:45?

Unfortunately, you’re dealing with strings to convert to timedeltas. You can take care of the conversion using df['QuarterHourDimID'].map(int), for instance; but it would be so much faster if you could extract integers directly out of your database.

#Proposed improvements

import pandas as pd
import pyodbc
DB_CREDENTIALS = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
SQL_QUERY = """
SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression]
FROM TABLE_NAME
WHERE (MarketDimID = 1
 AND RecordTypeDimID = 2
 AND EstimateTypeDimID = 1
 AND DailyOrWeeklyDimID = 1
 AND RecordSequenceCodeDimID = 5
 AND ViewingTypeDimID = 4
 AND NetworkDimID = ?
 AND DemographicGroupDimID = ?
 AND QuarterHourDimID IS NOT NULL
)"""
INIT_TIMESTAMP = pd.Timestamp('2011-12-31')
QUARTER_TO_SIX = pd.Timedelta(5.75, unit='h')
DAY = pd.Timedelta(1, unit='D')
def table(network, demo):
 with pyodbc.connect(DB_CREDENTIALS) as cnxn:
 df = pd.read_sql(
 sql=SQL_QUERY,
 con=cnxn,
 params=(network, demo),
 index_col=None
 )
 quarters = pd.to_timedelta(df['QuarterHourDimID'].map(int), unit='m') * 15 + QUARTER_TO_SIX
 # This is what I understood of `time_map`, if this isn't quite right, adapt accordingly
 quarters[quarters>=DAY] -= DAY
 df['datetime'] = pd.to_timedelta(df['StartDateDimID'].map(int), unit='D') + quarter_offsets + INIT_TIMESTAMP
 if network == 1278:
 df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
 df = df.set_index(['datetime'])
 return df

answered Mar 28, 2016 at 20:12

301_Moved_Permanently

29.4k
3
49
98

default