Given the following dataframe:
| ID | date |
|---------------------|--------------------------------|
| 1 | 2022年02月03日 22:01:12+01:00 |
| 2 | 2022年02月04日 21:11:21+01:00 |
| 3 | 2022年02月05日 11:11:21+01:00 |
| 4 | 2022年02月07日 23:01:12+01:00 |
| 5 | 2022年02月07日 14:31:14+02:00 |
| 6 | 2022年02月08日 18:12:01+02:00 |
| 7 | 2022年02月09日 20:21:02+02:00 |
| 8 | 2022年02月11日 15:41:25+02:00 |
| 9 | 2022年02月15日 11:21:27+02:00 |
I would like to optimise the code of the following function:
Returns all rows whose date (YYYY-MM-DD) matches the date passed by parameter.
# Function to search by exact %Y-%m-%d date
def select_exact_date(df, date):
date = pd.to_datetime(date, format='%Y-%m-%d %H:%M:%S')
date_dt = date.date()
# Create a copy of the original df
subset = df.copy()
# Creates a temporary column to store the values related to the specific date
subset['tmp_date'] = subset['date'].apply(lambda a: a.date())
mask = (subset['tmp_date'] == date_dt)
df = df.loc[mask]
return df
Result:
select_exact_date(df, '2022-02-07')
#Output
| ID | date |
|---------------------|--------------------------------|
| 4 | 2022年02月07日 23:01:12+01:00 |
| 5 | 2022年02月07日 14:31:14+02:00 |
I hope you can help me. Thank you in advance.
1 Answer 1
That doesn't matter
Context does matter. A strict interpretation of the site guidelines would consider your question off-topic due to insufficient context, since code can't be reviewed in a vacuum (a black box, as you put it). If you think context doesn't matter in a review, then you might as well say that reviewing the code doesn't matter, either.
In this case, it does matter where your date
comes from, because you've failed to type-hint it so it could be anything that pd.to_datetime
accepts:
arg: int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
I'm going to make a sweeping assumption: that your date
parameter is a datetime.date
object, as it should be.
It seems that you've left the date
column full of datetime.datetime
timezone-aware objects. Working with Pandas will be made more difficult than it needs to be using this approach. Consider instead coercing the date column to an actual Pandas UTC datetime, avoiding apply()
like the plague:
from datetime import date
from io import StringIO
import pandas as pd
# ...
f = StringIO(
'''ID,date
1,2022年02月03日 22:01:12+01:00
2,2022年02月04日 21:11:21+01:00
3,2022年02月05日 11:11:21+01:00
4,2022年02月07日 23:01:12+01:00
5,2022年02月07日 14:31:14+02:00
6,2022年02月08日 18:12:01+02:00
7,2022年02月09日 20:21:02+02:00
8,2022年02月11日 15:41:25+02:00
9,2022年02月15日 11:21:27+02:00'''
)
df = pd.read_csv(
f,
parse_dates=['date'],
index_col='ID',
)
date_to_select = date(2022, 2, 7)
utc_dates = pd.to_datetime(df.date, utc=True).dt.date
matching_rows = df[utc_dates == date_to_select]
If it matters that the exact local timezone's day boundaries be respected you'll need to get more creative.
date
string parameter actually come from? I ask because if it's far-removed from user input or file content, it should have been converted to a datetime object earlier than where you show it. \$\endgroup\$input()
five functions ago? Read from a file into a different dataframe? Etc. \$\endgroup\$