regex to extract version info

Question 1

I have written a small Python function to extract version info from a string. Input is of format:

v{Major}.{Minor}[b{beta-num}]

Eg: v1.1, v1.10b100, v1.100b01
Goal is to extract Major, Minor and beta numbers.

import re
def VersionInfo(x):
 reg = re.compile('^v([0-9]+)([.])([0-9]+)(b)?([0-9]+)?$')
 return reg.findall(x)
if __name__ == '__main__':
 print(VersionInfo('v1.01b10'))
 print(VersionInfo('v1.01'))
 print(VersionInfo('v1.0'))

Output:

[('1', '.', '01', 'b', '10')]
[('1', '.', '01', '', '')]
[('1', '.', '0', '', '')]

Question is:

Is it sufficient to cover all cases?
Better regex?

Question 2

Is v1.1b valid? (i.e. beta with no beta number)

Question 3

No. But howto handle it?

Question 4

Do you need to capture the b and . as well?

Question 5

Is it intentional that you are ignoring the micro version and most release levels? If not then there are a few versions that your regex won't work with: 3.5.1f0 (3.5.1), 2.1.0b1, 2.2.0a1.

Question 6

I have just assumed that my input will only contain major and minor versions and beta versions only. This was easy to begin with.

Question 7

Compile your regexp only once. Doing it each time you call VersionInfo kind of defeat its purpose. You can extract it out of the function and give it a proper capitalized name to indicate it is a constant.
Use blank lines to separate logical sections of your code, especially after imports and before functions definition.
According to PEP8, function names should be snake_case and not TitleCase.

Question 8

^v([0-9]+)([.])([0-9]+)(b)?([0-9]+)?$

Not much to say but:

seems a bit strange to capture the . since that's the only value that could be there. Also [.] could be written mode simply as \..
also seems a bit strange to capture the b, unless v1.1b is specifically allowed. And you're trying to find that last digit group even if b isn't present - not a bug, but could be made clearer.

To address the second point, make the whole beta part optional:

^v([0-9]+)\.([0-9]+)(?:(b)([0-9]+)?)?$

If beta with no number isn't allowed, this can be simplified (and drop capturing b also in this case):

^v([0-9]+)\.([0-9]+)(?:b([0-9]+))?$

(The (?:...) construct is a non-capturing group.)

Question 9

Note that if you want to have \. in the regex, then it would be a good idea to use r'raw strings'.

Question 10

Point noted on the capture of "." which is not required.

Question 11

Interesting: non-capturing group . I didn't know that. I will read into it.

Question 12

You might also mention that \d is shorter than [0-9].

Question 13

@zondo: \d is shorter, but not 100% equivalent on all regex engines/modes - can match more than 0-9 if in Unicode mode.

Question 14

I’d suggest using named capturing groups and non-capturing groups to improve the regex – you only get relevant information, and it makes it easier to access the attributes later.

For example:

VERSION_REGEX = re.compile(
 r'^' # start of string
 r'v' # literal v character
 r'(?P<major>[0-9]+)' # major number
 r'\.' # literal . character
 r'(?:.(?P<minor>[0-9+]))' # minor number
 r'(?:b(?P<beta>[0-9+]))?' # literal b character, and beta number
)

The ?P<name> means that we can access this attribute by name in a match. (named)
The ?: at the start of a group means its value is ignored. (non-capturing)

Note: The b for beta is now part of a non-capturing group, which contains the beta capturing group. This means that the regex will only get the b if it’s followed by a number; it wouldn’t for example, match v1.0b. (Which seems to be the desired behaviour, from the comments.)

This gives us the following, IMO nicer-to-work-with output:

versions = ['v1.01b10', 'v1.01', 'v1.0']
for version_str in versions:
 match = VERSION_REGEX.search(version_str)
 print match.groupdict()
# {'major': '1', 'minor': '01', 'beta': '10'}
# {'major': '1', 'minor': '01', 'beta': None}
# {'major': '1', 'minor': '0', 'beta': None}

Some more minor points:

I’ve broken up the regex across multiple lines, with liberal comments to explain what’s going on. Although this is a fairly simple regex, I think this is good practice. It makes them substantially easier to work with, long-term.
I’ve prefixed my string with r. This is a raw string, which means I have to do slightly less work when escaping backslashes and the like. Again, perhaps not necessary in this case, but a good habit to get into.
Since your regex binds to the start of the string, findall() will only ever get one match. You’d be better off using something like search(), which returns SREMatch objects instead of tuples. This gives you, for example, the nice named-group semantics I used above.
Function names should be lowercase_with_underscores, not CamelCase. See PEP 8.
Compile the regex once as a constant at the top of your file, not every time you call the function. This is more efficient.
Whitespace (more newlines) will help break up your code and make it easier to read.

Question 15

I'd like to propose an alternate solution that just captures the version numbers themselves. If you need the b and . as legitimate capture groups, this solution might not work for you.

 REG = re.compile('[.bv]')
def VersionInfo(x):
 return re.split(REG, x)
if __name__ == '__main__':
 print(VersionInfo('v1.01b10'))
 print(VersionInfo('v1.01'))
 print(VersionInfo('v1.01b'))
 print(VersionInfo('v1.0'))

This gives the following result:

['', '1', '01', '10']
['', '1', '01']
['', '1', '01', '']
['', '1', '0']

As you can see, this only captures the numbers themselves, the Regex is far easier to read and if the 4th group is there, but empty, your version number has a b without a beta number, which is what you said you want to handle in a comment.

One thing I'm not sure about is if the datatype that's returned is equally usable.

Question 16

Nice and short regex. Thx.

score 5 · Accepted Answer · 2016-04-04 06:31:46Z

Compile your regexp only once. Doing it each time you call VersionInfo kind of defeat its purpose. You can extract it out of the function and give it a proper capitalized name to indicate it is a constant.
Use blank lines to separate logical sections of your code, especially after imports and before functions definition.
According to PEP8, function names should be snake_case and not TitleCase.

Stack Exchange Network

regex to extract version info

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

regex to extract version info

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions