use regex to separate a list of serial numbers into multiple lists with matched prefix

Question 1

The question comes from How to separate a list of serial numbers into multiple lists with matched prefix? on Stack Overflow.

Input:

sn = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']

Intended output:

# string with same prefix will be in the same list, e.g.:
sn1 = ['bike-001', 'bike-002']
sn2 = ['car/001', 'car/002']
sn3 = ['bus/for/001', 'bus/for/002']

The original thread already had a brilliant answer using .startswith(<sub_str>), however I still want to use regex to solve the question.

Here's what I've tried: I use re.sub() to get the prefix and re.search() to get the 3-digits serial number. I'd like to know if there is a better way (like using one-time regex function) to get the solution.

import re
sn = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
sn_dict = {}
for item in sn:
 category = re.sub(r'\d{3}', "", item)
 number = re.search(r'\d{3}', item).group()
 if category not in sn_dict.keys():
 sn_dict[category] = []
 sn_dict[category].append(category + number)

After running the script we will have the following sn_dict:

{
 'bike-': ['bike-001', 'bike-002'], 
 'car/': ['car/001', 'car/002'], 
 'bus/for/': ['bus/for/001', 'bus/for/002']
}

Question 2

You can do this with re.findall. Instead of iterating over each string, you can combine all the serial numbers into one string and use regex to find all the matches (this implementation assumes there are no spaces in the serial number).

import re
string_list = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
string = ' '.join(string_list)
matches = re.findall(r"([^0-9])", string)
numbers = re.findall(r"([0-9]{3})", string)
prefixes = ''.join(c for c in matches).split()
result_dict = {}
for prefix, number in zip(prefixes, numbers):
 if prefix not in result_dict.keys():
 result_dict[prefix] = []
 result_dict[prefix].append(prefix + number)

The first re.findall searches for any string that is not a number. The second finds any succession of three numbers. The next line combines the characters in matches, and since we denoted that we separated each serial number by ' ', we can split using the same value. Then, we use the same code present in your question to populate the result dictionary.

Question 3

In terms of what we accept, it might be worth anchoring the digit-string to the end of the item with $.

The code looks to be reimplementing quite a lot of itertools.groupby. Assuming we don't care about order, we could easily re-write to build off that by sorting the input and passing a suitable key function.

Alternatively, write a more general split_to() function that accepts a key function in a similar manner to groupby, so we can separate the general mechanism from the particular instance we have here.

Question 4

As an alternative solution, you can avoid concatenating and recombining strings and instead just match against the prefix and add to dict directly from there.

import re
string_list = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
result_dic = {}
for item in string_list:
 prefix = re.match("([^0-9]+)", item).group()
 if prefix not in result_dic:
 result_dic[prefix] = []
 result_dic[prefix].append(item)

Ben A Ben A 10.7k5 gold badges37 silver badges101 bronze badges · Accepted Answer · 2021-03-02 15:28:54Z

You can do this with re.findall. Instead of iterating over each string, you can combine all the serial numbers into one string and use regex to find all the matches (this implementation assumes there are no spaces in the serial number).

import re
string_list = ['bike-001', 'bike-002', 'car/001', 'bus/for/001', 'car/002', 'bus/for/002']
string = ' '.join(string_list)
matches = re.findall(r"([^0-9])", string)
numbers = re.findall(r"([0-9]{3})", string)
prefixes = ''.join(c for c in matches).split()
result_dict = {}
for prefix, number in zip(prefixes, numbers):
 if prefix not in result_dict.keys():
 result_dict[prefix] = []
 result_dict[prefix].append(prefix + number)

The first re.findall searches for any string that is not a number. The second finds any succession of three numbers. The next line combines the characters in matches, and since we denoted that we separated each serial number by ' ', we can split using the same value. Then, we use the same code present in your question to populate the result dictionary.

Stack Exchange Network

use regex to separate a list of serial numbers into multiple lists with matched prefix

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

use regex to separate a list of serial numbers into multiple lists with matched prefix

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions