Python Pandas Apply with a Lambda Function

Question 1

I have a table in pandas that has two columns, QuarterHourDimID and StartDateDimID ; these columns give me an ID for each date / quarter hour pairing. For instance for January 1st, 2015, at 12:15PM the StartDateDimID would equal 1097 and QuarterHourDimID would equal 26. This is how the data I'm reading is organized.

It's a large table that I'm reading using pyodbc and pandas.read_sql(), ~450M rows and ~60 columns, so performance is an issue.

To parse the QuarterHourDimID and StartDateDimID columns into workable datetime indexes I'm running an apply function on every row to create an additional column datetime.

My code reading the table without the additional parsing is around 800ms; however when I run this apply function it adds around 4s to total run time (anywhere between 5.8-6s a query is expected.) The df that is returned is around ~45K rows and 5 columns (~450days*~100quarter-hour-parts)

I am hoping to more efficiently rewrite what I've written and get any input along the way.

Below is the code I've written thus far:

import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
 connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
 sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
 WHERE (MarketDimID = 1
 AND RecordTypeDimID = 2
 AND EstimateTypeDimID = 1
 AND DailyOrWeeklyDimID = 1
 AND RecordSequenceCodeDimID = 5
 AND ViewingTypeDimID = 4
 AND NetworkDimID = {}
 AND DemographicGroupDimID = {}
 AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
 with pyodbc.connect(connection_string) as cnxn:
 df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
 def time_map(quarter_hour, date):
 if quarter_hour > 72:
 return date + timedelta(minutes=(quarter_hour % 73)*15)
 return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
 map_date = {}
 init_date = datetime(year=2012, month=1, day=1)
 for x in df.StartDateDimID.unique():
 map_date[x] = init_date + timedelta(days=int(x)-1)
 #this is the part of my code that is likely bogging things down
 df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
 map_date[row['StartDateDimID']]),
 axis=1)
 if network == 1278:
 df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
 df = df.set_index(['datetime'])
 return df

Question 2

Did you benchmark your code before making asumptions? If I were to take a wild guess, I would incriminate the for loop before the comment instead.

Question 3

I did some bench-marking as I realized this was a good idea with your comment. See this image: i.imgur.com/7cY9qAn.png

Question 4

OK. And what about cProfile? Also, you can edit your post with these informations.

Question 5

Style concerns

There is a bunch of constant that you define in the function that doesn't really need to. No need do (re)define your credentials for the DB at each call, for instance, nor the init_date. You should extract these out as the constant they are.

Same thing for time_map, there is nothing in it that make it mandatory to be defined inside table, so move it out too.

You should also try to reduce too long lines and come up with better names: table doesn't convey much.

Prepared SQL

When dealing with SQL, it is often recommended to not build your query string yourself and let the manager do it for you. pandas let you do it using the params parameter of pd.read_sql. You’ll need to adapt the query a bit to change '{}' into ?.

Pandas efficiency

When dealing with pandas, it is often more efficient to perform operations on a whole column at once; and more often than not, going back and forth between the pure Python world and the pandas one will lead to the kind of performances you’re having.

pandas has its own kind of objects to deal with time. Namely pd.Timestamp and pd.Timedelta. Converting your columns to these objects instead of datetime.datetime or datetime.timedelta should help you speed up your computation. pd.to_timedelta can be quite handy for that.

You should also try to reduce the amount of extra computation. Even the tiniest ones will add up. I’m talking about your offset management, why start at 2012年01月01日 and add x - 1 days? You can perform '2011-12-31' + x days instead. Same for the minutes: instead of starting at 6 o'clock and adding x - 1 ×ばつ 15 minutes, why not start at 5:45?

Unfortunately, you’re dealing with strings to convert to timedeltas. You can take care of the conversion using df['QuarterHourDimID'].map(int), for instance; but it would be so much faster if you could extract integers directly out of your database.

Proposed improvements

import pandas as pd
import pyodbc
DB_CREDENTIALS = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
SQL_QUERY = """
SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression]
FROM TABLE_NAME
WHERE (MarketDimID = 1
 AND RecordTypeDimID = 2
 AND EstimateTypeDimID = 1
 AND DailyOrWeeklyDimID = 1
 AND RecordSequenceCodeDimID = 5
 AND ViewingTypeDimID = 4
 AND NetworkDimID = ?
 AND DemographicGroupDimID = ?
 AND QuarterHourDimID IS NOT NULL
)"""
INIT_TIMESTAMP = pd.Timestamp('2011-12-31')
QUARTER_TO_SIX = pd.Timedelta(5.75, unit='h')
DAY = pd.Timedelta(1, unit='D')
def table(network, demo):
 with pyodbc.connect(DB_CREDENTIALS) as cnxn:
 df = pd.read_sql(
 sql=SQL_QUERY,
 con=cnxn,
 params=(network, demo),
 index_col=None
 )
 quarters = pd.to_timedelta(df['QuarterHourDimID'].map(int), unit='m') * 15 + QUARTER_TO_SIX
 # This is what I understood of `time_map`, if this isn't quite right, adapt accordingly
 quarters[quarters>=DAY] -= DAY
 df['datetime'] = pd.to_timedelta(df['StartDateDimID'].map(int), unit='D') + quarter_offsets + INIT_TIMESTAMP
 if network == 1278:
 df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
 df = df.set_index(['datetime'])
 return df

Question 6

Thanks for the reply Mathias, there's a lot of good information here. I'll do some testing this morning and see how some of your modifications perform.

Question 7

@mburke05 Note that, if you can't do much about the order you apply your additions, the multiplication may yield faster results depending on when you apply it. You may want to compare pd.to_timedelta(df[..], unit='m') * 15 to pd.to_timedelta(df[..] * 15, unit='m').

Question 8

I think the big bottleneck is using .apply which is basically pure python as you say mixing with the fast c level of scripting that python benefits from. I'm going to try two tests; one just using a simple INNER JOIN on a date table that exists in the database, and another using some of the mod's above.

Question 9

I've posted a rehashed answer to the above in sql here: stackoverflow.com/questions/36267498/… ; Thanks again Mathias.

301_Moved_Permanently 29.4k3 gold badges49 silver badges98 bronze badges · Answer 1 · 2016-03-28 20:12:22Z

Style concerns

There is a bunch of constant that you define in the function that doesn't really need to. No need do (re)define your credentials for the DB at each call, for instance, nor the init_date. You should extract these out as the constant they are.

Same thing for time_map, there is nothing in it that make it mandatory to be defined inside table, so move it out too.

You should also try to reduce too long lines and come up with better names: table doesn't convey much.

Prepared SQL

When dealing with SQL, it is often recommended to not build your query string yourself and let the manager do it for you. pandas let you do it using the params parameter of pd.read_sql. You’ll need to adapt the query a bit to change '{}' into ?.

Pandas efficiency

When dealing with pandas, it is often more efficient to perform operations on a whole column at once; and more often than not, going back and forth between the pure Python world and the pandas one will lead to the kind of performances you’re having.

pandas has its own kind of objects to deal with time. Namely pd.Timestamp and pd.Timedelta. Converting your columns to these objects instead of datetime.datetime or datetime.timedelta should help you speed up your computation. pd.to_timedelta can be quite handy for that.

You should also try to reduce the amount of extra computation. Even the tiniest ones will add up. I’m talking about your offset management, why start at 2012年01月01日 and add x - 1 days? You can perform '2011-12-31' + x days instead. Same for the minutes: instead of starting at 6 o'clock and adding x - 1 ×ばつ 15 minutes, why not start at 5:45?

Unfortunately, you’re dealing with strings to convert to timedeltas. You can take care of the conversion using df['QuarterHourDimID'].map(int), for instance; but it would be so much faster if you could extract integers directly out of your database.

Proposed improvements

import pandas as pd
import pyodbc
DB_CREDENTIALS = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
SQL_QUERY = """
SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression]
FROM TABLE_NAME
WHERE (MarketDimID = 1
 AND RecordTypeDimID = 2
 AND EstimateTypeDimID = 1
 AND DailyOrWeeklyDimID = 1
 AND RecordSequenceCodeDimID = 5
 AND ViewingTypeDimID = 4
 AND NetworkDimID = ?
 AND DemographicGroupDimID = ?
 AND QuarterHourDimID IS NOT NULL
)"""
INIT_TIMESTAMP = pd.Timestamp('2011-12-31')
QUARTER_TO_SIX = pd.Timedelta(5.75, unit='h')
DAY = pd.Timedelta(1, unit='D')
def table(network, demo):
 with pyodbc.connect(DB_CREDENTIALS) as cnxn:
 df = pd.read_sql(
 sql=SQL_QUERY,
 con=cnxn,
 params=(network, demo),
 index_col=None
 )
 quarters = pd.to_timedelta(df['QuarterHourDimID'].map(int), unit='m') * 15 + QUARTER_TO_SIX
 # This is what I understood of `time_map`, if this isn't quite right, adapt accordingly
 quarters[quarters>=DAY] -= DAY
 df['datetime'] = pd.to_timedelta(df['StartDateDimID'].map(int), unit='D') + quarter_offsets + INIT_TIMESTAMP
 if network == 1278:
 df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
 df = df.set_index(['datetime'])
 return df

Thanks for the reply Mathias, there's a lot of good information here. I'll do some testing this morning and see how some of your modifications perform.
@mburke05 Note that, if you can't do much about the order you apply your additions, the multiplication may yield faster results depending on when you apply it. You may want to compare pd.to_timedelta(df[..], unit='m') * 15 to pd.to_timedelta(df[..] * 15, unit='m').
I think the big bottleneck is using .apply which is basically pure python as you say mixing with the fast c level of scripting that python benefits from. I'm going to try two tests; one just using a simple INNER JOIN on a date table that exists in the database, and another using some of the mod's above.
I've posted a rehashed answer to the above in sql here: stackoverflow.com/questions/36267498/… ; Thanks again Mathias.

Stack Exchange Network

Python Pandas Apply with a Lambda Function

1 Answer 1

Style concerns

Prepared SQL

Pandas efficiency

Proposed improvements

You must log in to answer this question.

Hot Network Questions

Python Pandas Apply with a Lambda Function

1 Answer 1

Style concerns

Prepared SQL

Pandas efficiency

Proposed improvements

You must log in to answer this question.

Related

Hot Network Questions