Separating a String of Text into Separate Words in Python

Question 1

Occasionally, we want to do a rudimentary parsing on English text; we separate the text into separate words.

# INPUT STRING 
istring = "she sat down beside the child and sang a melody like a wind in summer
blowing softly."
# OUTPUT LIST
olist = ["she", "sat", "down", "beside", "the", "child", "and", "sang", "a", "melody", "like", "a", "wind", "in", "summer", "blowing", "softly", "."]

I wrote some code, but the implementation is ugly.

I was hoping to make make_special_iter more readable to human beings.

Code

class StringPartitioner:
 def __init__(self, istring:str):
 self._istring = istring
 self._it = self.make_special_iter(istring, self._char_to_class)
 def __next__(self):
 return next(self._it)
 def __iter__(self):
 return self
 @classmethod
 def make_special_iter(cls, istring:str, char_to_class):
 assert(isinstance(istring, str))
 try:
 it = iter(istring)
 buffer = list()
 everything = list()
 ch = next(it)
 prev_class = char_to_class(ch)
 buffer.append(ch)
 for ch in it:
 everything.append(ch)
 cur_class = char_to_class(ch)
 if cur_class != prev_class:
 word = "".join(buffer)
 buffer = list()
 # print("word".ljust(30), repr(word))
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)
 except StopIteration:
 return
 def _char_to_class(self, ch:str):
 big_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
 sml_letters = big_letters.lower()
 letters = big_letters + sml_letters
 if ch in letters:
 return ord("A")
 return ord(ch)
 def __str__(self):
 return type(self).__name__ + "(" + repr(self._istring) + ")"

A Test Case

# Input for test one 
itest1 = """she sat down beside the child and sang a melody like a wind in summer blowing softly; a melody from wild woods and grassy plain; a melody of valleys loved by children. She sang a melody now lost, but for dreams; a melody
along the edges of oblivion; now flashing like stellars of the night,
a glimpse of some golden moment, now passing swiftly out of remembrance again.
Her melody danced upon the green like little shining feet. Then her song, faded into stillness."""
# correct output for test one 
otest1 = [
 'she',
 ' ',
 'sat',
 ' ',
 'down',
 ' ',
 'beside',
 ' ',
 'the',
 ' ',
 'child',
 ' ',
 'and',
 ' ',
 'sang',
 ' ',
 'a',
 ' ',
 'melody',
 ' ',
 'like',
 ' ',
 'a',
 ' ',
 'wind',
 ' ',
 'in',
 ' ',
 'summer',
 ' ',
 'blowing',
 ' ',
 'softly',
 ';',
 ' ',
 'a',
 ' ',
 'melody',
 ' ',
 'from',
 ' ',
 'wild',
 ' ',
 'woods',
 ' ',
 'and',
 ' ',
 'grassy',
 ' ',
 'plain',
 ';',
 ' ',
 'a',
 ' ',
 'melody',
 ' ',
 'of',
 ' ',
 'valleys',
 ' ',
 'loved',
 ' ',
 'by',
 ' ',
 'children',
 '.',
 ' ',
 'She',
 ' ',
 'sang',
 ' ',
 'a',
 ' ',
 'melody',
 ' ',
 'now',
 ' ',
 'lost',
 ',',
 ' ',
 'but',
 ' ',
 'for',
 ' ',
 'dreams',
 ';',
 ' ',
 'a',
 ' ',
 'melody',
 '\n',
 'along',
 ' ',
 'the',
 ' ',
 'edges',
 ' ',
 'of',
 ' ',
 'oblivion',
 ';',
 ' ',
 'now',
 ' ',
 'flashing',
 ' ',
 'like',
 ' ',
 'stellars',
 ' ',
 'of',
 ' ',
 'the',
 ' ',
 'night',
 ',',
 '\n',
 'a',
 ' ',
 'glimpse',
 ' ',
 'of',
 ' ',
 'some',
 ' ',
 'golden',
 ' ',
 'moment',
 ',',
 ' ',
 'now',
 ' ',
 'passing',
 ' ',
 'swiftly',
 ' ',
 'out',
 ' ',
 'of',
 ' ',
 'remembrance',
 ' ',
 'again',
 '.',
 '\n',
 'Her',
 ' ',
 'melody',
 ' ',
 'danced',
 ' ',
 'upon',
 ' ',
 'the',
 ' ',
 'green',
 ' ',
 'like',
 ' ',
 'little',
 ' ',
 'shining',
 ' ',
 'feet',
 '.',
 ' ',
 'Then',
 ' ',
 'her',
 ' ',
 'song',
 ',',
 ' ',
 'faded',
 ' ',
 'into',
 ' ',
 'stillness',
 '.'
]

Question 2

Is there a reason why you're writing your own instead of using NTLK? (github.com/nltk/nltk) The difference is that NTLK doesn't return the spaces, but, do you need those?

Question 3

@IsmaelMiguel I think you meant to write NLTK, but instead, you wrote the letters out of order as NTLK. Note that NLTK is a sub-sequence formed from the string of text NATURAL LANGUAGE TOOL KIT by deleting 21 letters. In general, the name of every python library every written in a sub-sequence of a fully-formed English phrase. Perhaps, some day, people can write import natural_language_tool_kit or write import nltk. People who are in a hurry write, import nltk. However, import natural_language_tool_kit is more self-documenting or self-explanatory.

Question 4

While your program is functional and solid, there are plenty of improvements to be made.

How I would solve it.

A better way to solve this would probably be:

import re
test_string = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(re.split("(^A-Za-z)", test_string))

Outputs:

['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']

This program splits the string by anything not in the alphabet, and includes the delemeters.

Code review

The many variables

Having lots of variables decreases readability, and some variables, like everything don't seem to be doing much (I assume it was left over from development). I'd try to cut down on the amount of variables you're using, and make sure you're remaining ones are well named (e.g. it to string, etc.). However, the following changes will help you with that.

Your use of _char_to_class()

While it is clever the way you've used _char_to_class() to test if letters are in the alphabet, a more direct approach would be just to do:

ALPHABET = "abcdefghijklmnopqrstuvwxyz"
if ch.lower() in ALPHABET:
 ...

or even just:

if ch.isalpha():
 ...

Your complex structure and generator expressions

for ch in it:
 everything.append(ch)
 cur_class = char_to_class(ch)
 if cur_class != prev_class:
 word = "".join(buffer)
 buffer = list()
 # print("word".ljust(30), repr(word))
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

While your current structure is clever, it may just be easier to do something like:

word = ""
for ch in it:
 if ch.isalpha():
 word += ch
 else:
 yield word
 word = ""
 ...

Your assert block

assert(isinstance(istring, str))

While it is good to have an assert block where you've put it, the correct syntax for assert is assert <bool> or assert <bool>, <error-message> (no parentheses), so you'd have to do:

assert isinstance(istring, str)

However, I would just raise an error here instead. :

if not isinstance(istring, str):
 raise TypeError("Cannot use make_special_iter with non-string.")

Your use of list() In places like:

it = iter(istring)
buffer = list()
everything = list()

while it is technically correct to use list(), it's better to use [] for an empty list:

it = iter(istring)
buffer = []
everything = []

Your use of a class

While you can use a class for your make_special_iter, it isn't recommended. You could just use a method. And as the name make_special_iter wouldn't be as descriptive any more, I'd use something like partition_string instead.

Your use of generators

While it's easy (and slick) to use generators, I'd do something like:

words = []
word = ""
for ch in it:
 if ch.isalpha():
 word += ch
 else:
 words.append(word)
 word = ""
 ...
return words

But you don't need to (you could just use list()).

Full code

With generators

def partition_string(string):
 if not isinstance(string, str):
 raise TypeError("Cannot use make_special_iter with non-string.")
 word = ""
 for ch in string:
 if not ch.isalpha():
 yield word
 yield ch
 word = ""
 word += ch

test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(list(partition_string(test_str)))

With lists

def partition_string(string):
 if not isinstance(string, str):
 raise TypeError("Cannot use make_special_iter with non-string.")
 words = []
 word = ""
 for ch in string:
 if not ch.isalpha():
 words.append(word)
 words.append(ch)
 word = ""
 word += ch
 return words

test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(partition_string(test_str))

Both these give the same output:

['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']

These changes would make your code much easier to read. (I'd still use the code with the regex though.)

I hope this is useful, please comment for any improvements.

Question 5

That regular expression isn't inclusive enough. There are English words with accents, like "naïve" or other non-English letters like "façade". A regex like ([^\p{L}]+) using the regex module would probably be a better idea. The \p{L} indicates an Unicode letter. regular-expressions.info/unicode.html <-- Here's a list of other classes that maybe supported. pypi.org/project/regex <-- This is the regex module.

Question 6

`_char_to_class`

First of all, let's take a look at your _char_to_class method.

 def _char_to_class(self, ch:str):
 big_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
 sml_letters = big_letters.lower()
 letters = big_letters + sml_letters
 if ch in letters:
 return ord("A")
 return ord(ch)

big_letters and sml_letters and indeed letters are all members of the string stdlib module as string.ascii_uppercase, string.ascii_lowercase and string.ascii_letters respectively. So we don't need to define that ourselves.

The method also does not depend on self at all, and would be better as a @staticmethod, which says that this is a function which doesn't depend on its class, but it joined to it out of convenience.

Also, this method seems to be just reproducing the str.isalpha method (possibly in conjunction with str.isascii) so I'm not sure we need it at all.

`make_special_iter`

This method is declared as a @classmethod and yet doesn't rely on the class at all. A class method is a method which uses properties of the class, but not the instance, this uses neither and is probably a @staticmethod as it's currently written. In standard usage, it is taking the istring and char_to_class from self, and therefore should be converted into a regular method. Replacing istring with self._istring and char_to_class with self._char_to_class.

I also think the name make_special_iter is not terribly helpful as it doesn't reflect what's happening. split_words_punct might be more informative or split_by_char_class or similar. Examining it in more detail, however, what is actually happening is a group_by explicitly for strings.

You always want to minimise the scope of try blocks, as this makes debugging easier. In your case the only line which could raise a StopIteration is the call to next at the start. So we can make the try much tighter. In fact, I believe the only reason that that would fail is if the istring is empty. So why not check that first:

if not istring:
 return

Is much simpler and much more clear what we're checking and that also means we don't need a separate next call at all, we can set prev_class initially to an invalid value None, this also means we don't need to explicitly create the iter over string.

I also don't know why you are building everything when it appears to be unused? We can get rid of that.

That leaves us with:

 def make_special_iter(self):
 buffer = []
 prev_class = None
 for ch in self._istring:
 cur_class = self._char_to_class(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

`iter`

Your iter method exhausts itself on its first call, which probably isn't what you want.

>>> X = tmp.StringPartitioner('Hello world')
X = tmp.StringPartitioner('Hello world')
>>> for i in X:
... print(i)
...
Hello
world
>>> for i in X:
... print(i)
>>>

You should probably make your __iter__ method reset your self._it.

 def __iter__(self):
 self._it = self.make_special_iter()
 return self

Though I'm not sure we want that either (see the end)

`init`

Finally, we should move our assert into when we create our StringPartitioner, however in my experience it is more common to do it via an if-raise rather than assert in Python

if not isinstance(istring, str):
 raise TypeError('Cannot construct StringPartitioner from non-string')

Overall

After all this, we're left with:

from string import ascii_letters as alphabet
class StringPartitioner:
 def __init__(self, istring: str):
 if not isinstance(istring, str):
 raise TypeError('Cannot construct StringPartitioner from non-string')
 self._istring = istring
 self._it = self.make_special_iter()
 def __next__(self):
 return next(self._it)
 def __iter__(self):
 self._it = self.make_special_iter()
 return self
 def make_special_iter(self):
 buffer = []
 prev_class = None
 for ch in self._istring:
 cur_class = self._char_to_class(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)
 @staticmethod
 def _char_to_class(ch: str):
 if ch in alphabet:
 return ord("A")
 return ord(ch)
 def __str__(self):
 return type(self).__name__ + "(" + repr(self._istring) + ")"

However, I would argue that this probably doesn't warrant being a class at all. make_special_iter is just a generator which takes a string and a constant function. I would say it stands alone perfectly well.

def split_on_char_class(string, comparison_func=char_to_class):
 buffer = []
 prev_class = None
 for ch in string:
 cur_class = comparison_func(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

Question 7

Samuel, Would you consider an existing library, such as spaCy? SpaCy has a lemmatizer, which may do more then you need, depending on what you want to do with the words. Maybe you could have a poke around kaggle, or other parts of stackexchange, to see what the NLP folks are doing.

Question 8

@Simon Crase When you write, "see what the NLP folks are doing" am I correct in surmising that your sentence is equivalent in meaning to "see what the Natural Language Processing folks are doing"?

Question 9

@SamuelMuldoon Yes

Question 10

What you've implemented seems similar to itertools.groupby. Generally in python you're encouraged not to reinvent the wheel, so maybe you could try [list(g) for k, g in groupby(text,lambda x:x.isalpha())]? This also can ouput generators if needed.

sbottingota sbottingotasbottingota 1,1534 silver badges18 bronze badges · Answer 1 · 2023-04-08 06:05:09Z

While your program is functional and solid, there are plenty of improvements to be made.

How I would solve it.

A better way to solve this would probably be:

import re
test_string = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(re.split("(^A-Za-z)", test_string))

Outputs:

['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']

This program splits the string by anything not in the alphabet, and includes the delemeters.

Code review

The many variables

Having lots of variables decreases readability, and some variables, like everything don't seem to be doing much (I assume it was left over from development). I'd try to cut down on the amount of variables you're using, and make sure you're remaining ones are well named (e.g. it to string, etc.). However, the following changes will help you with that.

Your use of _char_to_class()

While it is clever the way you've used _char_to_class() to test if letters are in the alphabet, a more direct approach would be just to do:

ALPHABET = "abcdefghijklmnopqrstuvwxyz"
if ch.lower() in ALPHABET:
 ...

or even just:

if ch.isalpha():
 ...

Your complex structure and generator expressions

for ch in it:
 everything.append(ch)
 cur_class = char_to_class(ch)
 if cur_class != prev_class:
 word = "".join(buffer)
 buffer = list()
 # print("word".ljust(30), repr(word))
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

While your current structure is clever, it may just be easier to do something like:

word = ""
for ch in it:
 if ch.isalpha():
 word += ch
 else:
 yield word
 word = ""
 ...

Your assert block

assert(isinstance(istring, str))

While it is good to have an assert block where you've put it, the correct syntax for assert is assert <bool> or assert <bool>, <error-message> (no parentheses), so you'd have to do:

assert isinstance(istring, str)

However, I would just raise an error here instead. :

if not isinstance(istring, str):
 raise TypeError("Cannot use make_special_iter with non-string.")

Your use of list() In places like:

it = iter(istring)
buffer = list()
everything = list()

while it is technically correct to use list(), it's better to use [] for an empty list:

it = iter(istring)
buffer = []
everything = []

Your use of a class

While you can use a class for your make_special_iter, it isn't recommended. You could just use a method. And as the name make_special_iter wouldn't be as descriptive any more, I'd use something like partition_string instead.

Your use of generators

While it's easy (and slick) to use generators, I'd do something like:

words = []
word = ""
for ch in it:
 if ch.isalpha():
 word += ch
 else:
 words.append(word)
 word = ""
 ...
return words

But you don't need to (you could just use list()).

Full code

With generators

def partition_string(string):
 if not isinstance(string, str):
 raise TypeError("Cannot use make_special_iter with non-string.")
 word = ""
 for ch in string:
 if not ch.isalpha():
 yield word
 yield ch
 word = ""
 word += ch

test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(list(partition_string(test_str)))

With lists

def partition_string(string):
 if not isinstance(string, str):
 raise TypeError("Cannot use make_special_iter with non-string.")
 words = []
 word = ""
 for ch in string:
 if not ch.isalpha():
 words.append(word)
 words.append(ch)
 word = ""
 word += ch
 return words

test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(partition_string(test_str))

Both these give the same output:

['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']

These changes would make your code much easier to read. (I'd still use the code with the regex though.)

I hope this is useful, please comment for any improvements.

That regular expression isn't inclusive enough. There are English words with accents, like "naïve" or other non-English letters like "façade". A regex like ([^\p{L}]+) using the regex module would probably be a better idea. The \p{L} indicates an Unicode letter. regular-expressions.info/unicode.html <-- Here's a list of other classes that maybe supported. pypi.org/project/regex <-- This is the regex module.

score 5 · Answer 2 · 2023-04-08 11:17:29Z

`_char_to_class`

First of all, let's take a look at your _char_to_class method.

 def _char_to_class(self, ch:str):
 big_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
 sml_letters = big_letters.lower()
 letters = big_letters + sml_letters
 if ch in letters:
 return ord("A")
 return ord(ch)

big_letters and sml_letters and indeed letters are all members of the string stdlib module as string.ascii_uppercase, string.ascii_lowercase and string.ascii_letters respectively. So we don't need to define that ourselves.

The method also does not depend on self at all, and would be better as a @staticmethod, which says that this is a function which doesn't depend on its class, but it joined to it out of convenience.

Also, this method seems to be just reproducing the str.isalpha method (possibly in conjunction with str.isascii) so I'm not sure we need it at all.

`make_special_iter`

This method is declared as a @classmethod and yet doesn't rely on the class at all. A class method is a method which uses properties of the class, but not the instance, this uses neither and is probably a @staticmethod as it's currently written. In standard usage, it is taking the istring and char_to_class from self, and therefore should be converted into a regular method. Replacing istring with self._istring and char_to_class with self._char_to_class.

I also think the name make_special_iter is not terribly helpful as it doesn't reflect what's happening. split_words_punct might be more informative or split_by_char_class or similar. Examining it in more detail, however, what is actually happening is a group_by explicitly for strings.

You always want to minimise the scope of try blocks, as this makes debugging easier. In your case the only line which could raise a StopIteration is the call to next at the start. So we can make the try much tighter. In fact, I believe the only reason that that would fail is if the istring is empty. So why not check that first:

if not istring:
 return

Is much simpler and much more clear what we're checking and that also means we don't need a separate next call at all, we can set prev_class initially to an invalid value None, this also means we don't need to explicitly create the iter over string.

I also don't know why you are building everything when it appears to be unused? We can get rid of that.

That leaves us with:

 def make_special_iter(self):
 buffer = []
 prev_class = None
 for ch in self._istring:
 cur_class = self._char_to_class(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

`iter`

Your iter method exhausts itself on its first call, which probably isn't what you want.

>>> X = tmp.StringPartitioner('Hello world')
X = tmp.StringPartitioner('Hello world')
>>> for i in X:
... print(i)
...
Hello
world
>>> for i in X:
... print(i)
>>>

You should probably make your __iter__ method reset your self._it.

 def __iter__(self):
 self._it = self.make_special_iter()
 return self

Though I'm not sure we want that either (see the end)

`init`

Finally, we should move our assert into when we create our StringPartitioner, however in my experience it is more common to do it via an if-raise rather than assert in Python

if not isinstance(istring, str):
 raise TypeError('Cannot construct StringPartitioner from non-string')

Overall

After all this, we're left with:

from string import ascii_letters as alphabet
class StringPartitioner:
 def __init__(self, istring: str):
 if not isinstance(istring, str):
 raise TypeError('Cannot construct StringPartitioner from non-string')
 self._istring = istring
 self._it = self.make_special_iter()
 def __next__(self):
 return next(self._it)
 def __iter__(self):
 self._it = self.make_special_iter()
 return self
 def make_special_iter(self):
 buffer = []
 prev_class = None
 for ch in self._istring:
 cur_class = self._char_to_class(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)
 @staticmethod
 def _char_to_class(ch: str):
 if ch in alphabet:
 return ord("A")
 return ord(ch)
 def __str__(self):
 return type(self).__name__ + "(" + repr(self._istring) + ")"

However, I would argue that this probably doesn't warrant being a class at all. make_special_iter is just a generator which takes a string and a constant function. I would say it stands alone perfectly well.

def split_on_char_class(string, comparison_func=char_to_class):
 buffer = []
 prev_class = None
 for ch in string:
 cur_class = comparison_func(ch)
 if cur_class != prev_class and prev_class is not None:
 word = "".join(buffer)
 buffer = []
 yield word
 buffer.append(ch)
 prev_class = cur_class
 yield "".join(buffer)

Simon Crase Simon CraseSimon Crase 1312 bronze badges · Answer 3 · 2023-04-08 21:17:57Z

1

\$\begingroup\$

Samuel, Would you consider an existing library, such as spaCy? SpaCy has a lemmatizer, which may do more then you need, depending on what you want to do with the words. Maybe you could have a poke around kaggle, or other parts of stackexchange, to see what the NLP folks are doing.

Share

answered Apr 8, 2023 at 21:17

Simon Crase's user avatar

Simon Crase Simon CraseSimon Crase

1312 bronze badges

\$\endgroup\$

2

\$\begingroup\$ @Simon Crase When you write, "see what the NLP folks are doing" am I correct in surmising that your sentence is equivalent in meaning to "see what the Natural Language Processing folks are doing"? \$\endgroup\$

Samuel Muldoon
– Samuel Muldoon

2023年11月03日 14:26:24 +00:00
Commented Nov 3, 2023 at 14:26
\$\begingroup\$ @SamuelMuldoon Yes \$\endgroup\$

Simon Crase
– Simon Crase

2023年11月03日 18:59:27 +00:00
Commented Nov 3, 2023 at 18:59

Add a comment |

slow-jaunter slow-jaunterslow-jaunter 111 bronze badge · Answer 4 · 2023-04-09 09:38:17Z

What you've implemented seems similar to itertools.groupby. Generally in python you're encouraged not to reinvent the wheel, so maybe you could try [list(g) for k, g in groupby(text,lambda x:x.isalpha())]? This also can ouput generators if needed.

Stack Exchange Network

Separating a String of Text into Separate Words in Python

Code

A Test Case

4 Answers 4

How I would solve it.

Code review

Full code

`_char_to_class`

`make_special_iter`

`iter`

`init`

Overall

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Separating a String of Text into Separate Words in Python

Code

A Test Case

4 Answers 4

How I would solve it.

Code review

Full code

_char_to_class

make_special_iter

__iter__

__init__

Overall

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`_char_to_class`

`make_special_iter`

`iter`

`init`