I worked on a problem from Automate the Boring Stuff Chapter 7:
Write a regular expression that can detect dates in the DD/MM/YYYY format. Assume that the days range from 01 to 31, the months range from 01 to 12, and the years range from 1000 to 2999. Note that if the day or month is a single digit, it’ll have a leading zero.
The regular expression doesn’t have to detect correct days for each month or for leap years; it will accept nonexistent dates like 31/02/2020 or 31/04/2021. Then store these strings into variables named month, day, and year, and write additional code that can detect if it is a valid date. April, June, September, and November have 30 days, February has 28 days, and the rest of the months have 31 days. February has 29 days in leap years. Leap years are every year evenly divisible by 4, except for years evenly divisible by 100, unless the year is also evenly divisible by 400. Note how this calculation makes it impossible to make a reasonably sized regular expression that can detect a valid date.
Here's my code:
#Program that detects dates in text and copies and prints them
import pyperclip, re
#DD/MM/YEAR format
dateRegex = re.compile(r'(\d\d)/(\d\d)/(\d\d\d\d)')
#text = str(pyperclip.paste())
text = 'Hello. Your birthday is on 29/02/1990. His birthday is on 40/09/1992 and her birthday is on 09/09/2000.'
matches = []
for groups in dateRegex.findall(text):
day = groups[0]
month = groups[1]
year = groups[2]
#convert to int for comparisons
dayNum = int(day)
monthNum = int(month)
yearNum = int(year)
#check if date and month values are valid
if dayNum <= 31 and monthNum > 0 and monthNum <= 12:
#months with 30 days
if month in ('04', '06', '09', '11'):
if not (dayNum > 0 and dayNum <= 30):
continue
#February only
if month == '02':
#February doesn't have more than 29 days
if dayNum > 29:
continue
if yearNum % 4 == 0:
#leap years have 29 days in February
if yearNum % 100 == 0 and yearNum % 400 != 0:
#not a leap year even if divisible by 4
if dayNum > 28:
continue
else:
if dayNum > 28:
continue
#all other months have up to 31 days
if month not in ('02', '04', '06', '09', '11'):
if dayNum <= 0 and dayNum > 31:
continue
else:
continue
date = '/'.join([groups[0],groups[1],groups[2]])
matches.append(date)
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No dates found.')
I've tested it out with various different date strings and it works as far as I can tell. I wanted to know about better ways of doing this though. As a beginner and an amateur, I understand there might be methods of writing the above code that are better and I don't mind being guided in the right direction and learning more about them. What is a better way of doing all of the above without using so many if statements?
2 Answers 2
You can import and use datetime
rather than validating date yourself.
>>> import datetime
>>> datetime.date(2020, 9, 9)
datetime.date(2020, 9, 9)
>>> datetime.date(1990, 2, 29)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: day is out of range for month
As such we can just change all your checks to a call to datetime.date
in a try
except
block.
for groups in dateRegex.findall(text):
try:
datetime.date(int(groups[2]), int(groups[1]), int(groups[0]))
matches.append('/'.join(groups))
except ValueError:
pass
Rather than using '/'.join(groups)
if we use finditer
we can just use groups[0]
.
And we'd need to increment the other indexes by one too.
for groups in dateRegex.finditer(text):
try:
datetime.date(int(groups[3]), int(groups[2]), int(groups[1]))
matches.append(groups[0])
except ValueError:
pass
We can also change the creation of datetime.date
to use a comprehension.
We need to use *
to map all the values of the built list to arguments to the function.
for group in dateRegex.finditer(text):
try:
datetime.date(*(int(part) for part in group.groups()[::-1]))
yield group[0]
except ValueError:
pass
You should also use an if __name__ == '__main__':
guard to prevent your code from running on import.
And I'd define more functions.
import pyperclip
import datetime
import re
DATE_MATCHER = re.compile(r'(\d\d)/(\d\d)/(\d\d\d\d)')
def find_dates(text):
for group in DATE_MATCHER.finditer(text):
try:
datetime.date(*(int(part) for part in group.groups()[::-1]))
yield group[0]
except ValueError:
pass
def main():
text = 'Hello. Your birthday is on 29/02/1990. His birthday is on 40/09/1992 and her birthday is on 09/09/2000.'
dates = list(find_dates(text))
if not dates:
print('No dates found.')
else:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(dates))
if __name__ == "__main__":
main()
-
1\$\begingroup\$ @FMc Sorry, I'm struggling to understand you a little. "if the else-block in main() had a way of running" - the else does run so I'm guessing you meant something else? \$\endgroup\$2021年04月12日 00:28:20 +00:00Commented Apr 12, 2021 at 0:28
Reading into the question somewhat, there is emphasis on not only matching to a date pattern where the digits and separators are in the right places, such as
98/76/5432
which your pattern would accept; but to narrow this - with the pattern itself - knowing what digits are allowable in which positions. One basic improvement is to constrain the range of the leading digits:
date_regex = re.compile(
r'([0-3]\d)'
r'/([01]\d)'
r'/([12]\d{3})'
)
This is not perfect but certainly gets you to "within the neighbourhood" of a pattern that will skip invalid dates.
As an exercise to you: how do you think you would extend the above to disallow dates like
00/09/1980
38/10/2021
?
-
1\$\begingroup\$
(0[1-9]|[1-2]\d|3[01])/(0[1-9]|1[012])/([12]\d{3})
would match the given requirements more precisely. \$\endgroup\$jcaron– jcaron2021年04月12日 13:34:39 +00:00Commented Apr 12, 2021 at 13:34