Function to split strings on multiple delimiters

Question 1

I have this implementation of the split algorithm that different from .split() method you can use with multiple delimiters. Is this a good way of implementing it (more performance)?

def split(str, delim=" "):
 index = 0
 string = ""
 array = []
 while index < len(str):
 if str[index] not in delim: 
 string += str[index]
 else:
 if string: 
 array.append(string)
 string = ""
 index += 1
 if string: array.append(string)
 return array

Using the standard .split() method:

>>> print "hello = 20".split()
['hello', '=', '20']
>>> print "one;two; abc; b ".split(";")
['one', 'two', ' abc', ' b ']

Using my implementation:

>>> print split("hello = 20")
['hello', '=', '20']
>>> print split("one;two; abc; b ", ";")
['one', 'two', ' abc', ' b ']

Multiple delimiters:

>>> print split("one;two; abc; b.e. b eeeeee.e.e;;e ;.", " .;")
['one', 'two', 'abc', 'b', 'e', 'b', 'eeeeee', 'e', 'e', 'e']
>>> print split("foo barfoo;bar;foo bar.foo", " .;")
['foo', 'barfoo', 'bar', 'foo', 'bar', 'foo']
>>> print split("foo*bar*foo.foo bar;", "*.")
['foo', 'bar', 'foo', 'foo bar;']

Obs: We can do something like using re.split().

Question 2

There's no need to iterate using that while, a for is good enough.

Also string concatenation (+=) is expensive. It's better to use a list and join its elements at the end¹.

def split(s, delim=" "):
 words = []
 word = []
 for c in s:
 if c not in delim:
 word.append(c)
 else:
 if word:
 words.append(''.join(word))
 word = []
 if word:
 words.append(''.join(word))
 return words

As Maarten Fabré suggested, you could also ditch the words list and transform the function into a generator that iterates over (yields) each word. This saves some memory if you're examining only one word at a time and don't need all of them in one shot, for example when you're counting word frequency (collections.Counter(isplit(s))).

def isplit(s, delim=" "): # iterator version
 word = []
 for c in s:
 if c not in delim:
 word.append(c)
 else:
 if word:
 yield ''.join(word)
 word = []
 if word:
 yield ''.join(word)
def split(*args, **kwargs): # only converts the iterator to a list
 return list(isplit(*args, **kwargs))

There's also a one-liner solution based on itertools.groupby:

import itertools
def isplit(s, delim=" "): # iterator version
 # replace the outer parentheses (...) with brackets [...]
 # to transform the generator comprehension into a list comprehension
 # and return a list
 return (''.join(word)
 for is_word, word in itertools.groupby(s, lambda c: c not in delim)
 if is_word)
def split(*args, **kwargs): # only converts the iterator to a list
 return list(isplit(*args, **kwargs))

_{¹ From https://wiki.python.org/moin/PythonSpeed: "String concatenation is best done with ''.join(seq) which is an O(n) process. In contrast, using the + or += operators can result in an O(n**2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq) remains the best practice".}

Question 3

It does not work properly. ['one', 'two', ' abc', ' b', 'e', [' ', 'b', ' ', 'b', ' ', 'b', ' ']]

Question 4

It'd should return: ['one', 'two', ' abc', ' b', 'e', ' b b b ']

Question 5

For what input?

Question 6

For this: "one;two; abc; b.e. b b b " with these delimiters ";.".

Question 7

Even more pythonic would be to replace the words.append(''.join(word)) with yield ''.join(word), and omit the words list altogether

Question 8

I would suggest caution if your concerned about the performance vs the built in split. I am fairly sure you would be replacing c code with python code.

A couple of notes about your implementation:

You use the variable name str which is also a built in type, you should avoid if possible.
Each time you loop around you add a character which really builds another string, perhaps you could keep going until you find a delimiter and just add all those at 1 time.
Also might be worth thinking about wrapping the built in.. (ie calling multiple times)

Question 9

I'd like to add that choosing string for a variable name might hide the string module.

Cristian Ciupitu Cristian Ciupitu 3622 silver badges10 bronze badges · Accepted Answer · 2014-04-19 03:23:03Z

There's no need to iterate using that while, a for is good enough.

Also string concatenation (+=) is expensive. It's better to use a list and join its elements at the end¹.

def split(s, delim=" "):
 words = []
 word = []
 for c in s:
 if c not in delim:
 word.append(c)
 else:
 if word:
 words.append(''.join(word))
 word = []
 if word:
 words.append(''.join(word))
 return words

As Maarten Fabré suggested, you could also ditch the words list and transform the function into a generator that iterates over (yields) each word. This saves some memory if you're examining only one word at a time and don't need all of them in one shot, for example when you're counting word frequency (collections.Counter(isplit(s))).

def isplit(s, delim=" "): # iterator version
 word = []
 for c in s:
 if c not in delim:
 word.append(c)
 else:
 if word:
 yield ''.join(word)
 word = []
 if word:
 yield ''.join(word)
def split(*args, **kwargs): # only converts the iterator to a list
 return list(isplit(*args, **kwargs))

There's also a one-liner solution based on itertools.groupby:

import itertools
def isplit(s, delim=" "): # iterator version
 # replace the outer parentheses (...) with brackets [...]
 # to transform the generator comprehension into a list comprehension
 # and return a list
 return (''.join(word)
 for is_word, word in itertools.groupby(s, lambda c: c not in delim)
 if is_word)
def split(*args, **kwargs): # only converts the iterator to a list
 return list(isplit(*args, **kwargs))

_{¹ From https://wiki.python.org/moin/PythonSpeed: "String concatenation is best done with ''.join(seq) which is an O(n) process. In contrast, using the + or += operators can result in an O(n**2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq) remains the best practice".}

It does not work properly. ['one', 'two', ' abc', ' b', 'e', [' ', 'b', ' ', 'b', ' ', 'b', ' ']]
It'd should return: ['one', 'two', ' abc', ' b', 'e', ' b b b ']
For this: "one;two; abc; b.e. b b b " with these delimiters ";.".
Even more pythonic would be to replace the words.append(''.join(word)) with yield ''.join(word), and omit the words list altogether

Stack Exchange Network

Function to split strings on multiple delimiters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Function to split strings on multiple delimiters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions