Separating integers substrings from those with alphabetical letters

Question 1

The goal of my code is given a string, to separate the integer substrings from those containing solely alphabetical letters, with the individual substrings in a list.

For example:

Input: 1h15min
Output: ["1", "h", "15", "min"]

Here is the code:

def separate_text(input_text):
 if not input_text:
 return None
 character_list = []
 current_substring = ""
 string_type = ""
 for character in input_text:
 if character in "0123456789":
 if string_type in ("Int", ""):
 current_substring += character
 if string_type == "":
 string_type = "Int"
 else:
 character_list.append(current_substring)
 string_type = "Int"
 current_substring = character
 elif character not in "0123456789":
 if string_type in ("Str", ""):
 current_substring += character
 if string_type == "":
 string_type = "Str"
 else:
 character_list.append(current_substring)
 string_type = "Str"
 current_substring = character
 character_list.append(current_substring)
 return character_list

The function functions as follows:

Take an input string (if empty, return None)
Begin looping through the string, character by character
If the character is an integer
- If string_type is Int or "", append the character to current_substring and change string_type if needed
- If string_type is Str, append the substring to the character_list, reassign current_substring to whatever the character is, and change the string_type to "Int".
If the character is an string
- If string_type is Str or "", append the character to current_substring and change string_type if needed
- If string_type is Int, append the substring to the character_list, reassign current_substring to whatever the character is, and change the string_type to "Str".
End of the whole string? Append current_substring to the list and return the character_list.

My questions here are:

Is there any way to make the program for Pythonic?
Can I possible shorten my code? The two main if statements in the for loop have nearly identical code.

Note that I do not count floats so 1.15min returns ['1', '.', '15', 'min'].

Question 2

The elif in

 if character in "0123456789":
 // ...
 elif character not in "0123456789":
 // ...

can be replaced by an else.

if string_type in ("Int", ""):
 current_substring += character
 if string_type == "":
 string_type = "Int"
else:
 character_list.append(current_substring)
 string_type = "Int"
 current_substring = character

can be simplified to

if character in "0123456789":
 if string_type == "Str":
 character_list.append(current_substring)
 current_substring = ""
 string_type = "Int"
 current_substring += character

and similarly for the non-digits case. However, using strings ("", "Str", "Int") for the current state is error-prone. An enumeration would be an alternative, or in this case simply a boolean:

in_digits = None
for character in input_text:
 if character in "0123456789":
 if in_digits is False:
 character_list.append(current_substring)
 current_substring = ""
 in_digits = True
 current_substring += character
 else:
 // ...

And now the two cases can be combined into one:

in_digits = None
for character in input_text:
 is_digit = character in "0123456789"
 if in_digits == (not is_digit):
 character_list.append(current_substring)
 current_substring = ""
 in_digits = is_digit
 current_substring += character

More remarks:

if not input_text:
 return None

When called with an empty string, I would expect the function to return an empty array, not None.

character_list = []

The variable name is slightly misleading, this is a list of strings.

Summarizing the suggestions so far, we have

def separate_text(input_text):
 string_list = []
 current_substring = ""
 in_digits = None
 for character in input_text:
 is_digit = character in "0123456789"
 if in_digits == (not is_digit):
 string_list.append(current_substring)
 current_substring = ""
 in_digits = is_digit
 current_substring += character
 string_list.append(current_substring)
 return string_list

Finally note that the same can be simply achieved by splitting the string on a regular expression:

import re
def separate_text(input_text):
 return [item for item in re.split('(\d+)', input_text) if item]

Question 3

Your goal can also easily be achieved using itertools.groupby, which groups an iterable using some function (here str.isdigit):

from itertools import groupby
def separate_text(input_text):
 return ["".join(g) for _, g in groupby(input_text, key=str.isdigit)]

Whether you want to use the one-line regex solution by @MartinR or this one-line itertools solution does not really matter (you should test if the runtime is different for your usual inputs, though) What matters is that both are way easier to read (and write).

You should also add a docstring describing what the function does:

def separate_text(input_text):
 """Separate `input_text` into runs of digits and non-digits.
 Examples:
 >>> separate_text("1h15min")
 ['1', 'h', '15', 'min']
 >>> separate_text("1.15min")
 ['1', '.', '15', 'min']
 """
 ...

The usage examples I added have the advantage that you can use it to have automatic unittests using doctest. Just run the file with the module loaded:

$ python -m doctest -v path/to/separate_text.py
Trying:
 separate_text("1h15min")
Expecting:
 ['1', 'h', '15', 'min']
ok
Trying:
 separate_text("1.15min")
Expecting:
 ['1', '.', '15', 'min']
ok
1 items had no tests:
 separate_text
1 items passed all tests:
 2 tests in separate_text.separate_text
2 tests in 2 items.
2 passed and 0 failed.
Test passed.

Martin R Martin R 24.2k2 gold badges38 silver badges96 bronze badges · Accepted Answer · 2018-04-22 15:35:22Z

The elif in

 if character in "0123456789":
 // ...
 elif character not in "0123456789":
 // ...

can be replaced by an else.

if string_type in ("Int", ""):
 current_substring += character
 if string_type == "":
 string_type = "Int"
else:
 character_list.append(current_substring)
 string_type = "Int"
 current_substring = character

can be simplified to

if character in "0123456789":
 if string_type == "Str":
 character_list.append(current_substring)
 current_substring = ""
 string_type = "Int"
 current_substring += character

and similarly for the non-digits case. However, using strings ("", "Str", "Int") for the current state is error-prone. An enumeration would be an alternative, or in this case simply a boolean:

in_digits = None
for character in input_text:
 if character in "0123456789":
 if in_digits is False:
 character_list.append(current_substring)
 current_substring = ""
 in_digits = True
 current_substring += character
 else:
 // ...

And now the two cases can be combined into one:

in_digits = None
for character in input_text:
 is_digit = character in "0123456789"
 if in_digits == (not is_digit):
 character_list.append(current_substring)
 current_substring = ""
 in_digits = is_digit
 current_substring += character

More remarks:

if not input_text:
 return None

When called with an empty string, I would expect the function to return an empty array, not None.

character_list = []

The variable name is slightly misleading, this is a list of strings.

Summarizing the suggestions so far, we have

def separate_text(input_text):
 string_list = []
 current_substring = ""
 in_digits = None
 for character in input_text:
 is_digit = character in "0123456789"
 if in_digits == (not is_digit):
 string_list.append(current_substring)
 current_substring = ""
 in_digits = is_digit
 current_substring += character
 string_list.append(current_substring)
 return string_list

Finally note that the same can be simply achieved by splitting the string on a regular expression:

import re
def separate_text(input_text):
 return [item for item in re.split('(\d+)', input_text) if item]

Stack Exchange Network

Separating integers substrings from those with alphabetical letters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Separating integers substrings from those with alphabetical letters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions