The goal of my code is given a string, to separate the integer substrings from those containing solely alphabetical letters, with the individual substrings in a list.
For example:
- Input:
1h15min
- Output:
["1", "h", "15", "min"]
Here is the code:
def separate_text(input_text):
if not input_text:
return None
character_list = []
current_substring = ""
string_type = ""
for character in input_text:
if character in "0123456789":
if string_type in ("Int", ""):
current_substring += character
if string_type == "":
string_type = "Int"
else:
character_list.append(current_substring)
string_type = "Int"
current_substring = character
elif character not in "0123456789":
if string_type in ("Str", ""):
current_substring += character
if string_type == "":
string_type = "Str"
else:
character_list.append(current_substring)
string_type = "Str"
current_substring = character
character_list.append(current_substring)
return character_list
The function functions as follows:
- Take an input string (if empty, return
None
) - Begin looping through the string, character by character
- If the character is an integer
- If
string_type
isInt
or""
, append the character tocurrent_substring
and changestring_type
if needed - If
string_type
isStr
, append the substring to thecharacter_list
, reassigncurrent_substring
to whatever the character is, and change thestring_type
to"Int"
.
- If
- If the character is an string
- If
string_type
isStr
or""
, append the character tocurrent_substring
and changestring_type
if needed - If
string_type
isInt
, append the substring to thecharacter_list
, reassigncurrent_substring
to whatever the character is, and change thestring_type
to"Str"
.
- If
- End of the whole string? Append
current_substring
to the list and return thecharacter_list
.
My questions here are:
- Is there any way to make the program for Pythonic?
- Can I possible shorten my code? The two main
if
statements in thefor
loop have nearly identical code.
Note that I do not count floats so 1.15min
returns ['1', '.', '15', 'min']
.
2 Answers 2
The elif
in
if character in "0123456789": // ... elif character not in "0123456789": // ...
can be replaced by an else
.
if string_type in ("Int", ""): current_substring += character if string_type == "": string_type = "Int" else: character_list.append(current_substring) string_type = "Int" current_substring = character
can be simplified to
if character in "0123456789":
if string_type == "Str":
character_list.append(current_substring)
current_substring = ""
string_type = "Int"
current_substring += character
and similarly for the non-digits case. However, using strings
(""
, "Str"
, "Int"
) for the current state is error-prone.
An enumeration would be an alternative, or in this case simply a boolean:
in_digits = None
for character in input_text:
if character in "0123456789":
if in_digits is False:
character_list.append(current_substring)
current_substring = ""
in_digits = True
current_substring += character
else:
// ...
And now the two cases can be combined into one:
in_digits = None
for character in input_text:
is_digit = character in "0123456789"
if in_digits == (not is_digit):
character_list.append(current_substring)
current_substring = ""
in_digits = is_digit
current_substring += character
More remarks:
if not input_text: return None
When called with an empty string, I would expect the function to return
an empty array, not None
.
character_list = []
The variable name is slightly misleading, this is a list of strings.
Summarizing the suggestions so far, we have
def separate_text(input_text):
string_list = []
current_substring = ""
in_digits = None
for character in input_text:
is_digit = character in "0123456789"
if in_digits == (not is_digit):
string_list.append(current_substring)
current_substring = ""
in_digits = is_digit
current_substring += character
string_list.append(current_substring)
return string_list
Finally note that the same can be simply achieved by splitting the string on a regular expression:
import re
def separate_text(input_text):
return [item for item in re.split('(\d+)', input_text) if item]
Your goal can also easily be achieved using itertools.groupby
, which groups an iterable using some function (here str.isdigit
):
from itertools import groupby
def separate_text(input_text):
return ["".join(g) for _, g in groupby(input_text, key=str.isdigit)]
Whether you want to use the one-line regex solution by @MartinR or this one-line itertools
solution does not really matter (you should test if the runtime is different for your usual inputs, though) What matters is that both are way easier to read (and write).
You should also add a docstring
describing what the function does:
def separate_text(input_text):
"""Separate `input_text` into runs of digits and non-digits.
Examples:
>>> separate_text("1h15min")
['1', 'h', '15', 'min']
>>> separate_text("1.15min")
['1', '.', '15', 'min']
"""
...
The usage examples I added have the advantage that you can use it to have automatic unittests using doctest
. Just run the file with the module loaded:
$ python -m doctest -v path/to/separate_text.py
Trying:
separate_text("1h15min")
Expecting:
['1', 'h', '15', 'min']
ok
Trying:
separate_text("1.15min")
Expecting:
['1', '.', '15', 'min']
ok
1 items had no tests:
separate_text
1 items passed all tests:
2 tests in separate_text.separate_text
2 tests in 2 items.
2 passed and 0 failed.
Test passed.