I have this implementation of the split algorithm that different from .split()
method you can use with multiple delimiters. Is this a good way of implementing it (more performance)?
def split(str, delim=" "):
index = 0
string = ""
array = []
while index < len(str):
if str[index] not in delim:
string += str[index]
else:
if string:
array.append(string)
string = ""
index += 1
if string: array.append(string)
return array
Using the standard .split()
method:
>>> print "hello = 20".split()
['hello', '=', '20']
>>> print "one;two; abc; b ".split(";")
['one', 'two', ' abc', ' b ']
Using my implementation:
>>> print split("hello = 20")
['hello', '=', '20']
>>> print split("one;two; abc; b ", ";")
['one', 'two', ' abc', ' b ']
Multiple delimiters:
>>> print split("one;two; abc; b.e. b eeeeee.e.e;;e ;.", " .;")
['one', 'two', 'abc', 'b', 'e', 'b', 'eeeeee', 'e', 'e', 'e']
>>> print split("foo barfoo;bar;foo bar.foo", " .;")
['foo', 'barfoo', 'bar', 'foo', 'bar', 'foo']
>>> print split("foo*bar*foo.foo bar;", "*.")
['foo', 'bar', 'foo', 'foo bar;']
Obs: We can do something like using re.split()
.
2 Answers 2
There's no need to iterate using that while
, a for
is good enough.
Also string concatenation (+=
) is expensive. It's better to use a list and join its elements at the end1.
def split(s, delim=" "):
words = []
word = []
for c in s:
if c not in delim:
word.append(c)
else:
if word:
words.append(''.join(word))
word = []
if word:
words.append(''.join(word))
return words
As Maarten Fabré suggested, you could also ditch the words
list and transform the function into a generator that iterates over (yield
s) each word. This saves some memory if you're examining only one word at a time and don't need all of them in one shot, for example when you're counting word frequency (collections.Counter(isplit(s))
).
def isplit(s, delim=" "): # iterator version
word = []
for c in s:
if c not in delim:
word.append(c)
else:
if word:
yield ''.join(word)
word = []
if word:
yield ''.join(word)
def split(*args, **kwargs): # only converts the iterator to a list
return list(isplit(*args, **kwargs))
There's also a one-liner solution based on itertools.groupby
:
import itertools
def isplit(s, delim=" "): # iterator version
# replace the outer parentheses (...) with brackets [...]
# to transform the generator comprehension into a list comprehension
# and return a list
return (''.join(word)
for is_word, word in itertools.groupby(s, lambda c: c not in delim)
if is_word)
def split(*args, **kwargs): # only converts the iterator to a list
return list(isplit(*args, **kwargs))
1 From https://wiki.python.org/moin/PythonSpeed: "String concatenation is best done with ''.join(seq)
which is an O(n) process. In contrast, using the +
or +=
operators can result in an O(n**2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq)
remains the best practice".
-
1\$\begingroup\$ It does not work properly. ['one', 'two', ' abc', ' b', 'e', [' ', 'b', ' ', 'b', ' ', 'b', ' ']] \$\endgroup\$Victor Martins– Victor Martins2014年04月19日 03:28:49 +00:00Commented Apr 19, 2014 at 3:28
-
\$\begingroup\$ It'd should return: ['one', 'two', ' abc', ' b', 'e', ' b b b '] \$\endgroup\$Victor Martins– Victor Martins2014年04月19日 03:29:31 +00:00Commented Apr 19, 2014 at 3:29
-
\$\begingroup\$ For what input? \$\endgroup\$Cristian Ciupitu– Cristian Ciupitu2014年04月19日 03:29:33 +00:00Commented Apr 19, 2014 at 3:29
-
\$\begingroup\$ For this: "one;two; abc; b.e. b b b " with these delimiters ";.". \$\endgroup\$Victor Martins– Victor Martins2014年04月19日 03:29:59 +00:00Commented Apr 19, 2014 at 3:29
-
1\$\begingroup\$ Even more pythonic would be to replace the
words.append(''.join(word))
withyield ''.join(word)
, and omit thewords
list altogether \$\endgroup\$Maarten Fabré– Maarten Fabré2018年03月12日 14:59:26 +00:00Commented Mar 12, 2018 at 14:59
I would suggest caution if your concerned about the performance vs the built in split. I am fairly sure you would be replacing c code with python code.
A couple of notes about your implementation:
- You use the variable name str which is also a built in type, you should avoid if possible.
- Each time you loop around you add a character which really builds another string, perhaps you could keep going until you find a delimiter and just add all those at 1 time.
- Also might be worth thinking about wrapping the built in.. (ie calling multiple times)
-
\$\begingroup\$ I'd like to add that choosing
string
for a variable name might hide thestring
module. \$\endgroup\$Cristian Ciupitu– Cristian Ciupitu2014年04月19日 09:39:17 +00:00Commented Apr 19, 2014 at 9:39