I have written a small Python function to extract version info from a string. Input is of format:
- v{Major}.{Minor}[b{beta-num}]
Eg: v1.1, v1.10b100, v1.100b01
Goal is to extract Major, Minor and beta numbers.
import re
def VersionInfo(x):
reg = re.compile('^v([0-9]+)([.])([0-9]+)(b)?([0-9]+)?$')
return reg.findall(x)
if __name__ == '__main__':
print(VersionInfo('v1.01b10'))
print(VersionInfo('v1.01'))
print(VersionInfo('v1.0'))
Output:
[('1', '.', '01', 'b', '10')]
[('1', '.', '01', '', '')]
[('1', '.', '0', '', '')]
Question is:
- Is it sufficient to cover all cases?
- Better regex?
4 Answers 4
- Compile your regexp only once. Doing it each time you call
VersionInfo
kind of defeat its purpose. You can extract it out of the function and give it a proper capitalized name to indicate it is a constant. - Use blank lines to separate logical sections of your code, especially after
import
s and before functions definition. - According to PEP8, function names should be snake_case and not TitleCase.
^v([0-9]+)([.])([0-9]+)(b)?([0-9]+)?$
Not much to say but:
- seems a bit strange to capture the
.
since that's the only value that could be there. Also[.]
could be written mode simply as\.
. - also seems a bit strange to capture the
b
, unlessv1.1b
is specifically allowed. And you're trying to find that last digit group even ifb
isn't present - not a bug, but could be made clearer.
To address the second point, make the whole beta part optional:
^v([0-9]+)\.([0-9]+)(?:(b)([0-9]+)?)?$
If beta with no number isn't allowed, this can be simplified (and drop capturing b
also in this case):
^v([0-9]+)\.([0-9]+)(?:b([0-9]+))?$
(The (?:...)
construct is a non-capturing group.)
-
1\$\begingroup\$ Note that if you want to have
\.
in the regex, then it would be a good idea to user'raw strings'
. \$\endgroup\$200_success– 200_success2016年04月04日 07:15:49 +00:00Commented Apr 4, 2016 at 7:15 -
\$\begingroup\$ Point noted on the capture of "." which is not required. \$\endgroup\$testcoder– testcoder2016年04月04日 07:19:24 +00:00Commented Apr 4, 2016 at 7:19
-
\$\begingroup\$ Interesting: non-capturing group . I didn't know that. I will read into it. \$\endgroup\$testcoder– testcoder2016年04月04日 07:23:50 +00:00Commented Apr 4, 2016 at 7:23
-
\$\begingroup\$ You might also mention that
\d
is shorter than[0-9]
. \$\endgroup\$zondo– zondo2016年04月04日 11:10:15 +00:00Commented Apr 4, 2016 at 11:10 -
1\$\begingroup\$ @zondo:
\d
is shorter, but not 100% equivalent on all regex engines/modes - can match more than0-9
if in Unicode mode. \$\endgroup\$Mat– Mat2016年04月04日 11:25:08 +00:00Commented Apr 4, 2016 at 11:25
I’d suggest using named capturing groups and non-capturing groups to improve the regex – you only get relevant information, and it makes it easier to access the attributes later.
For example:
VERSION_REGEX = re.compile(
r'^' # start of string
r'v' # literal v character
r'(?P<major>[0-9]+)' # major number
r'\.' # literal . character
r'(?:.(?P<minor>[0-9+]))' # minor number
r'(?:b(?P<beta>[0-9+]))?' # literal b character, and beta number
)
The ?P<name>
means that we can access this attribute by name in a match. (named)
The ?:
at the start of a group means its value is ignored. (non-capturing)
Note: The b
for beta is now part of a non-capturing group, which contains the beta
capturing group. This means that the regex will only get the b
if it’s followed by a number; it wouldn’t for example, match v1.0b
. (Which seems to be the desired behaviour, from the comments.)
This gives us the following, IMO nicer-to-work-with output:
versions = ['v1.01b10', 'v1.01', 'v1.0']
for version_str in versions:
match = VERSION_REGEX.search(version_str)
print match.groupdict()
# {'major': '1', 'minor': '01', 'beta': '10'}
# {'major': '1', 'minor': '01', 'beta': None}
# {'major': '1', 'minor': '0', 'beta': None}
Some more minor points:
I’ve broken up the regex across multiple lines, with liberal comments to explain what’s going on. Although this is a fairly simple regex, I think this is good practice. It makes them substantially easier to work with, long-term.
I’ve prefixed my string with
r
. This is a raw string, which means I have to do slightly less work when escaping backslashes and the like. Again, perhaps not necessary in this case, but a good habit to get into.Since your regex binds to the start of the string,
findall()
will only ever get one match. You’d be better off using something likesearch()
, which returns SREMatch objects instead of tuples. This gives you, for example, the nice named-group semantics I used above.Function names should be lowercase_with_underscores, not CamelCase. See PEP 8.
Compile the regex once as a constant at the top of your file, not every time you call the function. This is more efficient.
Whitespace (more newlines) will help break up your code and make it easier to read.
I'd like to propose an alternate solution that just captures the version numbers themselves. If you need the b
and .
as legitimate capture groups, this solution might not work for you.
REG = re.compile('[.bv]')
def VersionInfo(x):
return re.split(REG, x)
if __name__ == '__main__':
print(VersionInfo('v1.01b10'))
print(VersionInfo('v1.01'))
print(VersionInfo('v1.01b'))
print(VersionInfo('v1.0'))
This gives the following result:
['', '1', '01', '10']
['', '1', '01']
['', '1', '01', '']
['', '1', '0']
As you can see, this only captures the numbers themselves, the Regex is far easier to read and if the 4th group is there, but empty, your version number has a b
without a beta number, which is what you said you want to handle in a comment.
One thing I'm not sure about is if the datatype that's returned is equally usable.
-
\$\begingroup\$ Nice and short regex. Thx. \$\endgroup\$testcoder– testcoder2016年04月04日 23:54:24 +00:00Commented Apr 4, 2016 at 23:54
v1.1b
valid? (i.e. beta with no beta number) \$\endgroup\$b
and.
as well? \$\endgroup\$