What I want is to extract min/max range of salary from a text which contains either hourly or annual salary.
import re
# either of the following inputs should work
input1 = "80,000ドル - 90,000ドル per annum"
input2 = "20ドル - 24ドル.99 per hour"
salary_text = re.findall("[\0ドル-9,\. ]*-[\$0-9,\. ]*", input1)
if salary_text and salary_text[0]:
range_list = re.split("-", salary_text[0])
if range_list and len(range_list) == 2:
low = range_list[0].strip(' $').replace(',', '')
high = range_list[1].strip(' $').replace(',', '')
2 Answers 2
The ,
commas are a nice twist.
I feel you are stripping them a bit late,
as they don't really contribute to the desired solution.
Better to lose them from the get go.
Calling .findall
seems to be overkill for your problem specification --
likely .search
would suffice.
salary_text = re.findall("[\0ドル-9,\. ]*-[\$0-9,\. ]*", input1)
The dollar signs, similarly, do not contribute to the solution,
your regex could probably just ignore them if your inputs are fairly sane.
Or even scan for lines starting with $
dollar, and then the regex ignores them.
range_list = re.split("-", salary_text[0])
There is no need for this .split
-- the regex could have done this for you already.
Here is what I recommend:
def find_range(text):
if text.startswith('$'):
m = re.search(r'([\d\.]+) *- *\$?([\d\.]+)', text.replace(",", ""))
if m:
lo, hi = m.groups()
return float(lo), float(hi)
return None, None
print(find_range('80,000ドル - 90,000ドル per annum'))
print(find_range('20ドル - 24ドル.99 per hour'))
The raw string regex r'[\d\.]+'
picks out one or more numeric characters,
which can include decimal point.
And putting (
)
parens around a regex makes it a capturing group --
we have two such groups here.
Finally, *- *\$?
lets us skip a single -
dash with optional whitespace
and at most one optional $
dollar sign.
-
\$\begingroup\$ Thanks a lot, this is an excellent solution. \$\endgroup\$Hooman Bahreini– Hooman Bahreini2020年06月16日 21:54:28 +00:00Commented Jun 16, 2020 at 21:54
-
\$\begingroup\$ What's your reasoning behind doing
if text.startswith('$'): ...
as a separate test? \$\endgroup\$200_success– 200_success2022年04月21日 04:14:20 +00:00Commented Apr 21, 2022 at 4:14 -
\$\begingroup\$ Could probably simplify the regex too
([\d\.]+).+?([\d\.]+)
since you don't really care what's in between the two numbers \$\endgroup\$Polymer– Polymer2022年04月21日 08:22:44 +00:00Commented Apr 21, 2022 at 8:22
Good job
And there is already a good answer here!
I'm only commenting on your regular expression, even though I'm not so sure how your input ranges may look like. But, it would probably miss some edge cases. I'm assuming that these are all acceptable:
80,000,000,000ドル.00 - 90,000,000,000ドル.00 per annum
80,000,000ドル - 90,000,000ドル per annum
80,000ドル - 90,000ドル per annum
20ドル - 24ドル.99 per hour
20ドル - 24ドル.99 per hour
20ドル - 24ドル.99 per hour
20ドル.00 - 24ドル.99 per hour
and these are unacceptable:
20ドル.00 - 24ドル.99 per day
111,120ドル.00 - 11,124ドル.99 per week
111,222,120ドル.00 - 111,111,124ドル.99 per month
You can see your own expression in this link:
Demo
- It would pass some cases that may not be desired, I guess.
- You also do not need to escape
.
and$
inside a character class:
Demo
Code
import re
def find_range(text: str) -> dict:
expression = r'^\s*\$([0-9]{1,3}(?:,[0-9]{1,3})*(?:\.[0-9]{1,2})?)\s*-\s*\$([0-9]{1,3}(?:,[0-9]{1,3})*(?:\.[0-9]{1,2})?)\s*per\s+(?:annum|hour)\s*$'
return re.findall(expression, text)
input_a = '80,000ドル - 90,000ドル per annum'
input_b = '20ドル - 24ドル.99 per hour'
print(find_range(input_a))
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
RegEx Circuit
jex.im visualizes regular expressions:
Demo
-
1\$\begingroup\$ Thanks for this, especially those corner cases are very good points. \$\endgroup\$Hooman Bahreini– Hooman Bahreini2020年06月16日 11:10:04 +00:00Commented Jun 16, 2020 at 11:10
all_examples = [input1, input2]
and runfor input in all_examples:
- this way you can test code with all inputs without changing code. Eventually you could create list with input and expected outputsall_examples = [(input1, "80000", "90000"), (input2, "20", "24.99")]
and automatically check if results are correctfor input, expected_low, expected_hight in all_examples: ... low == expected_low ... high == expected_hight ...
\$\endgroup\$