I have been provided a text splitter class that will take a text input and use re.sub to make replacements when matches are found and also splits sentences up and stores them in a list.
I had an idea to improve the code by storing the expressions in a list and iterating over the list and for each expression looking for a match in the given text. However different replacements are made depending on the expression so not sure how that would work.
What improvements could be made to the code to simplify it? Also am I missing any limitations to the code I could improve?
import re
alphabets = "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|uk)"
digits = "([0-9])"
urls = "((http|https)\:\/\/)[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*"
class TextSplitter:
@classmethod
def split_into_sentences(self, text):
text = " " + text + " "
text = text.replace("\n"," ")
for match in re.compile(urls).finditer(text):
text = text.replace(match.group(0), match.group(0).replace('.', '<prd>'))
text = re.sub(prefixes,"\1円<prd>",text)
text = re.sub(websites,"<prd>\1円",text)
text = re.sub(digits + "[.]" + digits,"\1円<prd>\2円",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \1円<prd> ",text)
text = re.sub(acronyms+" "+starters,"\1円<stop> \2円",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\1円<prd>\2円<prd>\3円<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\1円<prd>\2円<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \1円<stop> \2円",text)
text = re.sub(" "+suffixes+"[.]"," \1円<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \1円<prd>",text)
if """ in text: text = text.replace("."","".")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
# ellipses
for match in re.compile(r"(\.)(\.+)").finditer(text):
text = text.replace(match.group(0), (len(match.group(0)) * '<prd>') + '<stop>')
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = [s.strip() for s in sentences]
sentences = [s for s in sentences if not len(s) == 0]
print(text)
print(sentences)
return sentences
text = "This is a test sentence! This is sentence two. I am sentence three?"
TextSplitter.split_into_sentences(text)
The code produces the following output:
This is a test sentence!<stop> This is sentence two.<stop> I am sentence three?<stop>
['This is a test sentence!', 'This is sentence two.', 'I am sentence three?']
1 Answer 1
Include spaces after commas. It's a virtually costless habit and it helps with code readability and editability.
Constants are typically named with uppercase. Your program begins with
several constants. Format their names accordingly: ALPHABETS
, PREFIXES
,
etc.
If you use the same value repeatedly, declare it as a constant. For
example STOP = '<stop>'
or PRD = '<prd>'
. This is a good typo/bug avoidance
habit to adopt.
Don't use classes without a reason. The simplest, most natural way to
organize programs is with well-designed functions. Don't add classes to the
picture unless you have a good reason. In the current code, TextSplitter
isn't doing anything, as illustrated by the fact that self
is never used. [A
separate issue: the first argument to a classmethod
is usually named cls
,
not self
, because the method receives the class, not an instance, as its
argument. In your case, staticmethod
would have been more appropriate, since
you never use self
or cls
. But since TextSplitter
isn't doing anything,
none of that matters.]
Functions should be focused. Your function is doing two major things: (1)
a bunch of search-replace operations, and (2) splitting into sentences. Split
that behavior into two separate functions: for example, normalize_text()
and split_into_sentences()
.
Functions should return data, not print. Whenever it's practical to do so, functions should return data, not cause side-effects like printing. Do the printing elsewhere -- typically the outer edge of the program where there is as little algorithmic complexity as possible. Data-oriented functions can be conveniently tested and debugged. Side-effect-oriented functions are a hassle in those regards.
def main():
text = '...'
normalized = normalize_text(text)
sentences = split_into_sentences(normalized)
print(normalized)
print(sentences)
if __name__ == '__main__':
main()
Your conditional checks are not doing anything. They all check for
a prerequisite that is already embedded in the replacement operation.
They are the moral equivalent of if X in TEXT, replace X with Y
.
Just try to perform the replacement directly.
Alternative ways to split into sentences. Because empty strings
are false, you don't need to check their length to reject them.
Instead, you can just write [s for s in sentences if s]
. Alternatively,
and this is neither better nor worse, you could use the built-in
filter()
:
STOP = '<stop>'
def split_into_sentences(text):
m = map(str.strip, text.split(STOP))
return list(filter(None, m))
Disclaimer. I did not think about the details of your regular expressions or the various text replacements. I don't have enough context about your broader goals to give specific advice on that front.
Should you try to generalize the replacement behavior? You asked about this and your question is a good one. For
example, one could define a little Replacer
class that would take various
optional parameters like regex
, search
, and replace
. The
class would define its __call__(self, text)
method to perform the appropriate
type of replacing operation, depending on which parameters were provided. Then
you would define a REPLACERS
constant consisting of a sequence of those
Replacer
instances. And the normalize_text()
function would just iterate
over REPLACERS
, calling each one with text
as its argument: for r in REPLACERS: text = r(text)
. Note that with this callable-based implementation,
any ordinary function that takes and returns text can also
operate successfully as a "Replacer", so any unusual replacement
needs can be defined via tiny utility functions.
Is all of that worth the trouble? Maybe, if this
code is part of a bigger project and you anticipate further growth and
elaboration of the kinds of replacements occurring. Or
maybe, if you want to try to implement it as a learning exercise.
But absent those motivations, the normalize_text()
function (after the other edits suggested above) is just a linear sequence of
about 25 steps that is fairly easy to understand, so I probably would not
bother with such a refactoring.