Parse through text and extract dates in uniform format

Question 1

I wrote a script to parse through text and extract all the dates. I would like it to be able to find as many different ways of writing text in dates as possible while containing as few false negatives as possible. I know this is something many people already do, so I may be re-inventing the wheel here. If I am, I would like to know whatever tool people use, but I'd also be interested in any way to make my technique better. I'm not well-versed in regex so there are likely optimizations I could make.

import re
test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
 'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
 '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
 'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
# Create a dictionary to convert from month names to numbers (e.g. Jan = 01)
month_dict = dict(jan='01', feb='02', mar='03', apr='04', may='05', jun='06', jul='07', aug='08', sep='09',
 oct='10', nov='11', dec='12')
def word_to_num(string):
 """
 This function converts a string to lowercase and only accepts the first three letter.
 This is to prepare a string for month_dict
 Example:
 word_to_num('January') -> jan
 """
 s = string.lower()[:3]
 return month_dict[s]
def date_converter(line):
 """
 This function extracts dates in every format from text and converts them to YYYYMMDD.
 Example:
 date_converter("It was the May 1st, 2009") -> 20090501
 """
 results = []
 day = '01'
 month = '01'
 year = '1900'
 # If format is MM/DD/YYYY or M/D/YY or some combination
 regex = re.search('([0]?\d|[1][0-2])[/-]([0-3]?\d)[/-]([1-2]\d{3}|\d{2})', line)
 # If format is DD Month YYYY or D Mon YY or some combination, also matches if no day given
 month_regex = re.search(
 '([0-3]?\d)\s*(Jan(?:uary)?(?:aury)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug('
 '?:ust)?|Sept?(?:ember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?(?:emeber)?).?,?\s([1-2]\d{3})',
 line)
 # If format is Month/DD/YYYY or Mon/D/YY or or Month DDth, YYYY or some combination
 rev_month_regex = re.search(
 '(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sept?(?:ember)?|Oct('
 '?:ober)?|Nov(?:ember)?|Dec(?:ember)?).?[-\s]([0-3]?\d)(?:st|nd|rd|th)?[-,\s]\s*([1-2]\d{3})',
 line)
 # If format is any combination of just Month or Mon and YY or YYYY
 no_day_regex = re.search(
 '(Jan(?:uary)?(?:aury)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sept?('
 '?:ember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?(?:emeber)?).?,?[\s]([1-2]\d{3}|\d{2})',
 line)
 # If format is MM/YYYY or M YYYY or some combination
 no_day_digits_regex = re.search('([0]?\d|[1][0-2])[/\s]([1-2]\d{3})', line)
 # If format only contains a year. If year is written alone it must be in form YYYY
 year_only_regex = re.search('([1-2]\d{3})', line)
 if regex:
 day = regex.group(2)
 month = regex.group(1)
 year = regex.group(3)
 elif month_regex:
 day = month_regex.group(1)
 month = word_to_num(month_regex.group(2))
 year = month_regex.group(3)
 elif rev_month_regex:
 day = rev_month_regex.group(2)
 month = word_to_num(rev_month_regex.group(1))
 year = rev_month_regex.group(3)
 elif no_day_regex:
 month = word_to_num(no_day_regex.group(1))
 year = no_day_regex.group(2)
 elif no_day_digits_regex:
 month = no_day_digits_regex.group(1)
 year = no_day_digits_regex.group(2)
 elif year_only_regex:
 year = year_only_regex.group(0)
 # Make sure all variables have correct number, add zeros if necessary
 month = month.zfill(2)
 day = day.zfill(2)
 if day == '00':
 day = '01'
 if year is not None and len(year) == 2:
 year = '19' + year
 results.append(year + month + day)
 return results
test_run = [date_converter(w) for w in test_cases]
print(test_run)

Question 2

I usually use dateutil parser, which works for all your current test cases as is:

from dateutil.parser import parse
test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
 'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
 '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
 'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
for date_string in test_cases:
 print(date_string, parse(date_string).strftime("%Y%m%d"))

The parser itself is really complicated - there are a lot of string manipulation, lookup and regular expressions techniques used there.

Question 3

And what about the part of "parsing through text". This case is simplified because the date is already extracted for all the test cases. try: test_cases = ' '.join(test_cases) in order to have a good sample to extract all the dates.

alecxe alecxealecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-14 03:55:06Z

I usually use dateutil parser, which works for all your current test cases as is:

from dateutil.parser import parse
test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
 'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
 '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
 'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
for date_string in test_cases:
 print(date_string, parse(date_string).strftime("%Y%m%d"))

The parser itself is really complicated - there are a lot of string manipulation, lookup and regular expressions techniques used there.

And what about the part of "parsing through text". This case is simplified because the date is already extracted for all the test cases. try: test_cases = ' '.join(test_cases) in order to have a good sample to extract all the dates.

Stack Exchange Network

Parse through text and extract dates in uniform format

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse through text and extract dates in uniform format

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions