Occasionally, we want to do a rudimentary parsing on English text; we separate the text into separate words.
# INPUT STRING
istring = "she sat down beside the child and sang a melody like a wind in summer
blowing softly."
# OUTPUT LIST
olist = ["she", "sat", "down", "beside", "the", "child", "and", "sang", "a", "melody", "like", "a", "wind", "in", "summer", "blowing", "softly", "."]
I wrote some code, but the implementation is ugly.
I was hoping to make make_special_iter
more readable to human beings.
Code
class StringPartitioner:
def __init__(self, istring:str):
self._istring = istring
self._it = self.make_special_iter(istring, self._char_to_class)
def __next__(self):
return next(self._it)
def __iter__(self):
return self
@classmethod
def make_special_iter(cls, istring:str, char_to_class):
assert(isinstance(istring, str))
try:
it = iter(istring)
buffer = list()
everything = list()
ch = next(it)
prev_class = char_to_class(ch)
buffer.append(ch)
for ch in it:
everything.append(ch)
cur_class = char_to_class(ch)
if cur_class != prev_class:
word = "".join(buffer)
buffer = list()
# print("word".ljust(30), repr(word))
yield word
buffer.append(ch)
prev_class = cur_class
yield "".join(buffer)
except StopIteration:
return
def _char_to_class(self, ch:str):
big_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
sml_letters = big_letters.lower()
letters = big_letters + sml_letters
if ch in letters:
return ord("A")
return ord(ch)
def __str__(self):
return type(self).__name__ + "(" + repr(self._istring) + ")"
A Test Case
# Input for test one
itest1 = """she sat down beside the child and sang a melody like a wind in summer blowing softly; a melody from wild woods and grassy plain; a melody of valleys loved by children. She sang a melody now lost, but for dreams; a melody
along the edges of oblivion; now flashing like stellars of the night,
a glimpse of some golden moment, now passing swiftly out of remembrance again.
Her melody danced upon the green like little shining feet. Then her song, faded into stillness."""
# correct output for test one
otest1 = [
'she',
' ',
'sat',
' ',
'down',
' ',
'beside',
' ',
'the',
' ',
'child',
' ',
'and',
' ',
'sang',
' ',
'a',
' ',
'melody',
' ',
'like',
' ',
'a',
' ',
'wind',
' ',
'in',
' ',
'summer',
' ',
'blowing',
' ',
'softly',
';',
' ',
'a',
' ',
'melody',
' ',
'from',
' ',
'wild',
' ',
'woods',
' ',
'and',
' ',
'grassy',
' ',
'plain',
';',
' ',
'a',
' ',
'melody',
' ',
'of',
' ',
'valleys',
' ',
'loved',
' ',
'by',
' ',
'children',
'.',
' ',
'She',
' ',
'sang',
' ',
'a',
' ',
'melody',
' ',
'now',
' ',
'lost',
',',
' ',
'but',
' ',
'for',
' ',
'dreams',
';',
' ',
'a',
' ',
'melody',
'\n',
'along',
' ',
'the',
' ',
'edges',
' ',
'of',
' ',
'oblivion',
';',
' ',
'now',
' ',
'flashing',
' ',
'like',
' ',
'stellars',
' ',
'of',
' ',
'the',
' ',
'night',
',',
'\n',
'a',
' ',
'glimpse',
' ',
'of',
' ',
'some',
' ',
'golden',
' ',
'moment',
',',
' ',
'now',
' ',
'passing',
' ',
'swiftly',
' ',
'out',
' ',
'of',
' ',
'remembrance',
' ',
'again',
'.',
'\n',
'Her',
' ',
'melody',
' ',
'danced',
' ',
'upon',
' ',
'the',
' ',
'green',
' ',
'like',
' ',
'little',
' ',
'shining',
' ',
'feet',
'.',
' ',
'Then',
' ',
'her',
' ',
'song',
',',
' ',
'faded',
' ',
'into',
' ',
'stillness',
'.'
]
4 Answers 4
While your program is functional and solid, there are plenty of improvements to be made.
How I would solve it.
A better way to solve this would probably be:
import re
test_string = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(re.split("(^A-Za-z)", test_string))
Outputs:
['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']
This program splits the string by anything not in the alphabet, and includes the delemeters.
Code review
The many variables
Having lots of variables decreases readability, and some variables, like everything
don't seem to be doing much (I assume it was left over from development). I'd try to cut down on the amount of variables you're using, and make sure you're remaining ones are well named (e.g. it
to string
, etc.). However, the following changes will help you with that.
Your use of _char_to_class()
While it is clever the way you've used _char_to_class()
to test if letters are in the alphabet, a more direct approach would be just to do:
ALPHABET = "abcdefghijklmnopqrstuvwxyz"
if ch.lower() in ALPHABET:
...
or even just:
if ch.isalpha():
...
Your complex structure and generator expressions
for ch in it:
everything.append(ch)
cur_class = char_to_class(ch)
if cur_class != prev_class:
word = "".join(buffer)
buffer = list()
# print("word".ljust(30), repr(word))
yield word
buffer.append(ch)
prev_class = cur_class
yield "".join(buffer)
While your current structure is clever, it may just be easier to do something like:
word = ""
for ch in it:
if ch.isalpha():
word += ch
else:
yield word
word = ""
...
Your assert block
assert(isinstance(istring, str))
While it is good to have an assert block where you've put it, the correct syntax for assert is assert <bool>
or assert <bool>, <error-message>
(no parentheses), so you'd have to do:
assert isinstance(istring, str)
However, I would just raise an error here instead. :
if not isinstance(istring, str):
raise TypeError("Cannot use make_special_iter with non-string.")
Your use of list()
In places like:
it = iter(istring)
buffer = list()
everything = list()
while it is technically correct to use list()
, it's better to use []
for an empty list:
it = iter(istring)
buffer = []
everything = []
Your use of a class
While you can use a class for your make_special_iter
, it isn't recommended. You could just use a method. And as the name make_special_iter
wouldn't be as descriptive any more, I'd use something like partition_string
instead.
Your use of generators
While it's easy (and slick) to use generators, I'd do something like:
words = []
word = ""
for ch in it:
if ch.isalpha():
word += ch
else:
words.append(word)
word = ""
...
return words
But you don't need to (you could just use list()
).
Full code
With generators
def partition_string(string):
if not isinstance(string, str):
raise TypeError("Cannot use make_special_iter with non-string.")
word = ""
for ch in string:
if not ch.isalpha():
yield word
yield ch
word = ""
word += ch
test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(list(partition_string(test_str)))
With lists
def partition_string(string):
if not isinstance(string, str):
raise TypeError("Cannot use make_special_iter with non-string.")
words = []
word = ""
for ch in string:
if not ch.isalpha():
words.append(word)
words.append(ch)
word = ""
word += ch
return words
test_str = "she sat down beside the child and sang a melody like a wind in summer blowing softly."
print(partition_string(test_str))
Both these give the same output:
['she', ' ', ' sat', ' ', ' down', ' ', ' beside', ' ', ' the', ' ', ' child', ' ', ' and', ' ', ' sang', ' ', ' a', ' ', ' melody', ' ', ' like', ' ', ' a', ' ', ' wind', ' ', ' in', ' ', ' summer', ' ', ' blowing', ' ', ' softly', '.']
These changes would make your code much easier to read. (I'd still use the code with the regex though.)
I hope this is useful, please comment for any improvements.
-
\$\begingroup\$ That regular expression isn't inclusive enough. There are English words with accents, like "naïve" or other non-English letters like "façade". A regex like
([^\p{L}]+)
using theregex
module would probably be a better idea. The\p{L}
indicates an Unicode letter. regular-expressions.info/unicode.html <-- Here's a list of other classes that maybe supported. pypi.org/project/regex <-- This is theregex
module. \$\endgroup\$Ismael Miguel– Ismael Miguel2023年04月10日 19:53:04 +00:00Commented Apr 10, 2023 at 19:53
_char_to_class
First of all, let's take a look at your _char_to_class
method.
def _char_to_class(self, ch:str):
big_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
sml_letters = big_letters.lower()
letters = big_letters + sml_letters
if ch in letters:
return ord("A")
return ord(ch)
big_letters
and sml_letters
and indeed letters
are all members of the string
stdlib module as string.ascii_uppercase
, string.ascii_lowercase
and string.ascii_letters
respectively. So we don't need to define that ourselves.
The method also does not depend on self
at all, and would be better as a @staticmethod
, which says that this is a function which doesn't depend on its class, but it joined to it out of convenience.
Also, this method seems to be just reproducing the str.isalpha
method (possibly in conjunction with str.isascii
) so I'm not sure we need it at all.
make_special_iter
This method is declared as a @classmethod
and yet doesn't rely on the class at all. A class method is a method which uses properties of the class, but not the instance, this uses neither and is probably a @staticmethod
as it's currently written. In standard usage, it is taking the istring
and char_to_class
from self
, and therefore should be converted into a regular method. Replacing istring
with self._istring
and char_to_class
with self._char_to_class
.
I also think the name make_special_iter
is not terribly helpful as it doesn't reflect what's happening. split_words_punct
might be more informative or split_by_char_class
or similar. Examining it in more detail, however, what is actually happening is a group_by
explicitly for strings.
You always want to minimise the scope of try
blocks, as this makes debugging easier. In your case the only line which could raise a StopIteration
is the call to next
at the start. So we can make the try
much tighter. In fact, I believe the only reason that that would fail is if the istring
is empty. So why not check that first:
if not istring:
return
Is much simpler and much more clear what we're checking and that also means we don't need a separate next
call at all, we can set prev_class
initially to an invalid value None
, this also means we don't need to explicitly create the iter
over string.
I also don't know why you are building everything
when it appears to be unused? We can get rid of that.
That leaves us with:
def make_special_iter(self):
buffer = []
prev_class = None
for ch in self._istring:
cur_class = self._char_to_class(ch)
if cur_class != prev_class and prev_class is not None:
word = "".join(buffer)
buffer = []
yield word
buffer.append(ch)
prev_class = cur_class
yield "".join(buffer)
__iter__
Your iter
method exhausts itself on its first call, which probably isn't what you want.
>>> X = tmp.StringPartitioner('Hello world')
X = tmp.StringPartitioner('Hello world')
>>> for i in X:
... print(i)
...
Hello
world
>>> for i in X:
... print(i)
>>>
You should probably make your __iter__
method reset your self._it
.
def __iter__(self):
self._it = self.make_special_iter()
return self
Though I'm not sure we want that either (see the end)
__init__
Finally, we should move our assert
into when we create our StringPartitioner
, however in my experience it is more common to do it via an if-raise
rather than assert
in Python
if not isinstance(istring, str):
raise TypeError('Cannot construct StringPartitioner from non-string')
Overall
After all this, we're left with:
from string import ascii_letters as alphabet
class StringPartitioner:
def __init__(self, istring: str):
if not isinstance(istring, str):
raise TypeError('Cannot construct StringPartitioner from non-string')
self._istring = istring
self._it = self.make_special_iter()
def __next__(self):
return next(self._it)
def __iter__(self):
self._it = self.make_special_iter()
return self
def make_special_iter(self):
buffer = []
prev_class = None
for ch in self._istring:
cur_class = self._char_to_class(ch)
if cur_class != prev_class and prev_class is not None:
word = "".join(buffer)
buffer = []
yield word
buffer.append(ch)
prev_class = cur_class
yield "".join(buffer)
@staticmethod
def _char_to_class(ch: str):
if ch in alphabet:
return ord("A")
return ord(ch)
def __str__(self):
return type(self).__name__ + "(" + repr(self._istring) + ")"
However, I would argue that this probably doesn't warrant being a class at all. make_special_iter
is just a generator which takes a str
ing and a constant function. I would say it stands alone perfectly well.
def split_on_char_class(string, comparison_func=char_to_class):
buffer = []
prev_class = None
for ch in string:
cur_class = comparison_func(ch)
if cur_class != prev_class and prev_class is not None:
word = "".join(buffer)
buffer = []
yield word
buffer.append(ch)
prev_class = cur_class
yield "".join(buffer)
Samuel, Would you consider an existing library, such as spaCy? SpaCy has a lemmatizer, which may do more then you need, depending on what you want to do with the words. Maybe you could have a poke around kaggle, or other parts of stackexchange, to see what the NLP folks are doing.
-
\$\begingroup\$
@Simon Crase
When you write, "see what the NLP folks are doing
" am I correct in surmising that your sentence is equivalent in meaning to "see what the Natural Language Processing folks are doing
"? \$\endgroup\$Samuel Muldoon– Samuel Muldoon2023年11月03日 14:26:24 +00:00Commented Nov 3, 2023 at 14:26 -
\$\begingroup\$ @SamuelMuldoon Yes \$\endgroup\$Simon Crase– Simon Crase2023年11月03日 18:59:27 +00:00Commented Nov 3, 2023 at 18:59
What you've implemented seems similar to itertools.groupby
. Generally in python you're encouraged not to reinvent the wheel, so maybe you could try [list(g) for k, g in groupby(text,lambda x:x.isalpha())]
? This also can ouput generators if needed.
Explore related questions
See similar questions with these tags.
NLTK
, but instead, you wrote the letters out of order asNTLK
. Note thatNLTK
is a sub-sequence formed from the string of textNATURAL LANGUAGE TOOL KIT
by deleting 21 letters. In general, the name of every python library every written in a sub-sequence of a fully-formed English phrase. Perhaps, some day, people can writeimport natural_language_tool_kit
or writeimport nltk
. People who are in a hurry write,import nltk
. However,import natural_language_tool_kit
is more self-documenting or self-explanatory. \$\endgroup\$