The question comes from How to separate a list of serial numbers into multiple lists with matched prefix? on Stack Overflow.
Input:
sn = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
Intended output:
# string with same prefix will be in the same list, e.g.: sn1 = ['bike-001', 'bike-002'] sn2 = ['car/001', 'car/002'] sn3 = ['bus/for/001', 'bus/for/002']
The original thread already had a brilliant answer using .startswith(<sub_str>)
, however I still want to use regex
to solve the question.
Here's what I've tried: I use re.sub()
to get the prefix and re.search()
to get the 3-digits serial number. I'd like to know if there is a better way (like using one-time regex
function) to get the solution.
import re
sn = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
sn_dict = {}
for item in sn:
category = re.sub(r'\d{3}', "", item)
number = re.search(r'\d{3}', item).group()
if category not in sn_dict.keys():
sn_dict[category] = []
sn_dict[category].append(category + number)
After running the script we will have the following sn_dict
:
{
'bike-': ['bike-001', 'bike-002'],
'car/': ['car/001', 'car/002'],
'bus/for/': ['bus/for/001', 'bus/for/002']
}
3 Answers 3
You can do this with re.findall
. Instead of iterating over each string, you can combine all the serial numbers into one string and use regex
to find all the matches (this implementation assumes there are no spaces in the serial number).
import re
string_list = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
string = ' '.join(string_list)
matches = re.findall(r"([^0-9])", string)
numbers = re.findall(r"([0-9]{3})", string)
prefixes = ''.join(c for c in matches).split()
result_dict = {}
for prefix, number in zip(prefixes, numbers):
if prefix not in result_dict.keys():
result_dict[prefix] = []
result_dict[prefix].append(prefix + number)
The first re.findall
searches for any string that is not a number. The second finds any succession of three numbers. The next line combines the characters in matches
, and since we denoted that we separated each serial number by ' '
, we can split using the same value. Then, we use the same code present in your question to populate the result dictionary.
In terms of what we accept, it might be worth anchoring the digit-string to the end of the item
with $
.
The code looks to be reimplementing quite a lot of itertools.groupby
. Assuming we don't care about order, we could easily re-write to build off that by sorting the input and passing a suitable key
function.
Alternatively, write a more general split_to()
function that accepts a key
function in a similar manner to groupby
, so we can separate the general mechanism from the particular instance we have here.
As an alternative solution, you can avoid concatenating and recombining strings and instead just match against the prefix and add to dict directly from there.
import re
string_list = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
result_dic = {}
for item in string_list:
prefix = re.match("([^0-9]+)", item).group()
if prefix not in result_dic:
result_dic[prefix] = []
result_dic[prefix].append(item)