I have a column in a Pandas Dataframe containing birth dates in object/string format:
0 16MAR39
1 21JAN56
2 18NOV51
3 05MAR64
4 05JUN48
I want to convert the to date formatting for processing. I have used
#Convert String to Datetime type
data['BIRTH'] = pd.to_datetime(data['BIRTH'])
but the result is ...
0 2039年03月16日
1 2056年01月21日
2 2051年11月18日
3 2064年03月05日
4 2048年06月05日
Name: BIRTH, dtype: datetime64[ns]
Clearly the dates have the wrong century prefix ("20" instead of "19")
I handled this using ...
data['BIRTH'] = np.where(data['BIRTH'].dt.year > 2000, data['BIRTH'] - pd.offsets.DateOffset(years=100), data['BIRTH'])
Result
0 1939年03月16日
1 1956年01月21日
2 1951年11月18日
3 1964年03月05日
4 1948年06月05日
Name: BIRTH, Length: 10302, dtype: datetime64[ns]
I am wondering:
- if there is a way to process the data that will get it right first time?
- If there is a better way to process the data after the incorrect conversion.
I'm an amateur coder and as far as I understand things Pandas is optimised for processing efficiency. So I wanted to use the Pandas datatime module for that reason. But is it better to consider Numpy's or Pandas' datetime module here? I know this dataset is small but I am trying to improve my skills so that when I am working on larger datasets I know what to consider.
2 Answers 2
This post on Stack Overflow explains why you are getting the wrong years.
https://stackoverflow.com/questions/37766353/pandas-to-datetime-parsing-wrong-year
Based on your code all of the two digit years in your data set will be converted to 19XX years. The only problem I can see is that if your data set includes dates across both centuries ( 19XX and 20XX) you'll end up forcing anything that should be 20XX to be 19XX the way you have it written. If your data set has dates in both centuries I'd recommend preprocessing your date strings to make them unambiguous (change 16MAR39 to 16MAR1939). This will require additional information from another tag in your data set if you've got it.
To your specific questions:
- Since the data is ambiguous there isn't a way to get it right the
first time. If you preprocess the data then it should work as you
want with the single
pd.to_datetime
command. - Processing the data on the front end to resolve the ambiguity (based on other information in your set) is probably a better solution than assuming you need to offset every date after 2000. For example, 1MAR05 will be read as 2005, then your code will offset by 100 years and you'll get 1905 when maybe it should have actually been 2005.
Amateur coder here learning too, but I don't think there is a built in function to make your data unambiguous or force a specific century prefix.
Since the date format itself is ambiguous, there is no way for python to decide this automatically. You will have to do this manually.
You can do this in a slightly more clear way than you do now.
date_separator = pd.to_datetime("20000101")
century = pd.DateOffset(years=100)
The date_separator can be anything suitable for your dataset, or pd.datetime.now()
if you want to set it at the current date
before_separator = data['BIRTH'] > date_separator
data.loc[before_separator , 'BIRTH'] = data['BIRTH'] - century
Explore related questions
See similar questions with these tags.