The question is pretty self-explanatory. I wrote a regex pattern which should (in theory) match all possible integers and decimal numbers. The pattern is as follows:
import re
pattern = '^[+-]?((\d+(\.\d*)?)|(\.\d+))$'
re.compile(pattern)
How foolproof is this pattern? I tested out quite a few scenarios, and they all worked fine. Am I missing some edge case here? Thanks for any help.
2 Answers 2
Your expression looks just fine, maybe we would slightly modify that to:
^[+-]?((\d+(\.\d+)?)|(\.\d+))$
for failing these samples, 3.
, 4.
, for instance, just in case maybe such samples might be undesired. Other than that, you have some capturing groups that I'm guessing you'd like to keep those.
Test the capturing groups with re.finditer
import re
regex = r"^[+-]?((\d+(\.\d+)?)|(\.\d+))$"
test_str = ("0.00000\n"
"0.00\n"
"-200\n"
"+200\n"
"200\n"
"200.2\n"
"-200.2\n"
"+200.2\n"
".000\n"
".1\n"
".2\n"
"3.\n"
".")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Test with re.findall
import re
regex = r"^[+-]?((\d+(\.\d+)?)|(\.\d+))$"
test_str = ("0.00000\n"
"0.00\n"
"-200\n"
"+200\n"
"200\n"
"200.2\n"
"-200.2\n"
"+200.2\n"
".000\n"
".1\n"
".2\n"
"3.\n"
".")
print(re.findall(regex, test_str, re.MULTILINE))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
-
2\$\begingroup\$
3
is an integer,3.
is a float value. I would expect that latter string to match as well, so I would leave the\.\d*
I modified. \$\endgroup\$AJNeufeld– AJNeufeld2019年07月12日 05:05:53 +00:00Commented Jul 12, 2019 at 5:05
Here's one:
number_regex = re.compile(
r'^[-+]?(?:(?:(?:[1-9](?:_?\d)*|0+(_?0)*)|(?:0[bB](?:_?[01])+)'
r'|(?:0[oO](?:_?[0-7])+)|(?:0[xX](?:_?[0-9a-fA-F])+))'
r'|(?:(?:(?:\d(?:_?\d)*)?(?:\.(?:\d(?:_?\d)*))|(?:\d(?:_?\d)*)\.)'
r'|(?:(?:(?:\d(?:_?\d)*)|(?:(?:\d(?:_?\d)*)?(?:\.(?:\d(?:_?\d)*))'
r'|(?:\d(?:_?\d)*)\.))(?:[eE][-+]?(?:\d(?:_?\d)*)))))$',
re.UNICODE)
But seriously, Python numbers are complicated
If you really a regex that will match ALL valid forms of Python numbers, it will be a complex regex. Integers include decimal, binary, octal, and hexadecimal forms. Floating point numbers can be in exponent form. As of version 3.6 all kinds of numbers can have '_' in them, but it can't be first or last. And integers> 0 can't start with '0' unless it's 0b 0o or 0x
From the Python documentation, here is the BNF for integer
:
integer ::= decinteger | bininteger | octinteger | hexinteger
decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger ::= "0" ("b" | "B") (["_"] bindigit)+
octinteger ::= "0" ("o" | "O") (["_"] octdigit)+
hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
bindigit ::= "0" | "1"
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"
and here is the BNF for floatnumber
:
floatnumber ::= pointfloat | exponentfloat
pointfloat ::= [digitpart] fraction | digitpart "."
exponentfloat ::= (digitpart | pointfloat) exponent
digitpart ::= digit (["_"] digit)*
fraction ::= "." digitpart
exponent ::= ("e" | "E") ["+" | "-"] digitpart
Note that the '+' or '-' isn't technically part of the number; it is a unary operator. But it is easy enough to include an optional sign in the regex.
To create the regex, simply translate the BNF into the corresponding regex patterns. Using non-grouping parenthesis (?: ) and f-strings helps a lot (rf"..." is a raw format string).
Integer:
decint = r"(?:[1-9](?:_?\d)*|0+(_?0)*)"
binint = r"(?:0[bB](?:_?[01])+)"
octint = r"(?:0[oO](?:_?[0-7])+)"
hexint = r"(?:0[xX](?:_?[0-9a-fA-F])+)"
integer = rf"(?:{decint}|{binint}|{octint}|{hexint})"
floatnumber:
digitpart = r"(?:\d(?:_?\d)*)"
exponent = rf"(?:[eE][-+]?{digitpart})"
fraction = rf"(?:\.{digitpart})"
pointfloat = rf"(?:{digitpart}?{fraction}|{digitpart}\.)"
exponentfloat = rf"(?:(?:{digitpart}|{pointfloat}){exponent})"
floatnumber = rf"(?:{pointfloat}|{exponentfloat})"
and put it all together, with an optional sign, to get:
number = re.compile(rf"^[-+]?(?:{integer}|{floatnumber})$")
Which is how I got the regex at the top of this answer. This has not been thoroughly tested, just spot checked:
tests = """
0
1
123
100_000
1_2_3
1000000
1.0
1.
.2
0.2
3.4
1_234.567_89
0o123
0b1111_0000
0X12_34_ab_cd
1e-10
1E001
.2e-2
"""
tests = tests.split()
for s in tests:
m = number.match(s)
print(f"'{s}' => {m[0] if m else 'NOT a number'}")
33.
. Is this intentional? \$\endgroup\$1e15
and1e-15
\$\endgroup\$