Extracting min and max salary from string

Question 1

What I want is to extract min/max range of salary from a text which contains either hourly or annual salary.

import re
# either of the following inputs should work
input1 = "80,000ドル - 90,000ドル per annum"
input2 = "20ドル - 24ドル.99 per hour"
salary_text = re.findall("[\0ドル-9,\. ]*-[\$0-9,\. ]*", input1)
if salary_text and salary_text[0]:
 range_list = re.split("-", salary_text[0])
 if range_list and len(range_list) == 2:
 low = range_list[0].strip(' $').replace(',', '')
 high = range_list[1].strip(' $').replace(',', '')

Question 2

you could create list with inputs all_examples = [input1, input2] and run for input in all_examples: - this way you can test code with all inputs without changing code. Eventually you could create list with input and expected outputs all_examples = [(input1, "80000", "90000"), (input2, "20", "24.99")] and automatically check if results are correct for input, expected_low, expected_hight in all_examples: ... low == expected_low ... high == expected_hight ...

Question 3

The , commas are a nice twist. I feel you are stripping them a bit late, as they don't really contribute to the desired solution. Better to lose them from the get go.

Calling .findall seems to be overkill for your problem specification -- likely .search would suffice.

salary_text = re.findall("[\0ドル-9,\. ]*-[\$0-9,\. ]*", input1)

The dollar signs, similarly, do not contribute to the solution, your regex could probably just ignore them if your inputs are fairly sane. Or even scan for lines starting with $ dollar, and then the regex ignores them.

range_list = re.split("-", salary_text[0])

There is no need for this .split -- the regex could have done this for you already. Here is what I recommend:

def find_range(text):
 if text.startswith('$'):
 m = re.search(r'([\d\.]+) *- *\$?([\d\.]+)', text.replace(",", ""))
 if m:
 lo, hi = m.groups()
 return float(lo), float(hi)
 return None, None
print(find_range('80,000ドル - 90,000ドル per annum'))
print(find_range('20ドル - 24ドル.99 per hour'))

The raw string regex r'[\d\.]+' picks out one or more numeric characters, which can include decimal point. And putting ( ) parens around a regex makes it a capturing group -- we have two such groups here. Finally, *- *\$? lets us skip a single - dash with optional whitespace and at most one optional $ dollar sign.

Question 4

Thanks a lot, this is an excellent solution.

Question 5

What's your reasoning behind doing if text.startswith('$'): ... as a separate test?

Question 6

Could probably simplify the regex too ([\d\.]+).+?([\d\.]+) since you don't really care what's in between the two numbers

Question 7

Good job

And there is already a good answer here!

I'm only commenting on your regular expression, even though I'm not so sure how your input ranges may look like. But, it would probably miss some edge cases. I'm assuming that these are all acceptable:

80,000,000,000ドル.00 - 90,000,000,000ドル.00 per annum
80,000,000ドル - 90,000,000ドル per annum
80,000ドル - 90,000ドル per annum
20ドル - 24ドル.99 per hour
 20ドル - 24ドル.99 per hour
20ドル - 24ドル.99 per hour
 20ドル.00 - 24ドル.99 per hour

and these are unacceptable:

 20ドル.00 - 24ドル.99 per day
 111,120ドル.00 - 11,124ドル.99 per week
 111,222,120ドル.00 - 111,111,124ドル.99 per month

You can see your own expression in this link:

Demo

It would pass some cases that may not be desired, I guess.
You also do not need to escape . and $ inside a character class:

Demo

Code

import re
def find_range(text: str) -> dict:
 expression = r'^\s*\$([0-9]{1,3}(?:,[0-9]{1,3})*(?:\.[0-9]{1,2})?)\s*-\s*\$([0-9]{1,3}(?:,[0-9]{1,3})*(?:\.[0-9]{1,2})?)\s*per\s+(?:annum|hour)\s*$'
 return re.findall(expression, text)
input_a = '80,000ドル - 90,000ドル per annum'
input_b = '20ドル - 24ドル.99 per hour'
print(find_range(input_a))

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Demo

Question 8

Thanks for this, especially those corner cases are very good points.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2020-06-15 02:41:23Z

The , commas are a nice twist. I feel you are stripping them a bit late, as they don't really contribute to the desired solution. Better to lose them from the get go.

Calling .findall seems to be overkill for your problem specification -- likely .search would suffice.

salary_text = re.findall("[\0ドル-9,\. ]*-[\$0-9,\. ]*", input1)

The dollar signs, similarly, do not contribute to the solution, your regex could probably just ignore them if your inputs are fairly sane. Or even scan for lines starting with $ dollar, and then the regex ignores them.

range_list = re.split("-", salary_text[0])

There is no need for this .split -- the regex could have done this for you already. Here is what I recommend:

def find_range(text):
 if text.startswith('$'):
 m = re.search(r'([\d\.]+) *- *\$?([\d\.]+)', text.replace(",", ""))
 if m:
 lo, hi = m.groups()
 return float(lo), float(hi)
 return None, None
print(find_range('80,000ドル - 90,000ドル per annum'))
print(find_range('20ドル - 24ドル.99 per hour'))

The raw string regex r'[\d\.]+' picks out one or more numeric characters, which can include decimal point. And putting ( ) parens around a regex makes it a capturing group -- we have two such groups here. Finally, *- *\$? lets us skip a single - dash with optional whitespace and at most one optional $ dollar sign.

What's your reasoning behind doing if text.startswith('$'): ... as a separate test?
Could probably simplify the regex too ([\d\.]+).+?([\d\.]+) since you don't really care what's in between the two numbers

Stack Exchange Network

Extracting min and max salary from string

2 Answers 2

Good job

Demo

Demo

Code

RegEx Circuit

Demo

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

2 Answers 2

Good job

Code

RegEx Circuit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related