Text splitter using Regular Expressions in Python

Question 1

I have been provided a text splitter class that will take a text input and use re.sub to make replacements when matches are found and also splits sentences up and stores them in a list.

I had an idea to improve the code by storing the expressions in a list and iterating over the list and for each expression looking for a match in the given text. However different replacements are made depending on the expression so not sure how that would work.

What improvements could be made to the code to simplify it? Also am I missing any limitations to the code I could improve?

import re
alphabets = "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|uk)"
digits = "([0-9])"
urls = "((http|https)\:\/\/)[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*"
class TextSplitter:
 @classmethod
 def split_into_sentences(self, text):
 text = " " + text + " "
 text = text.replace("\n"," ")
 for match in re.compile(urls).finditer(text):
 text = text.replace(match.group(0), match.group(0).replace('.', '<prd>'))
 text = re.sub(prefixes,"\1円<prd>",text)
 text = re.sub(websites,"<prd>\1円",text)
 text = re.sub(digits + "[.]" + digits,"\1円<prd>\2円",text)
 if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
 text = re.sub("\s" + alphabets + "[.] "," \1円<prd> ",text)
 text = re.sub(acronyms+" "+starters,"\1円<stop> \2円",text)
 text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\1円<prd>\2円<prd>\3円<prd>",text)
 text = re.sub(alphabets + "[.]" + alphabets + "[.]","\1円<prd>\2円<prd>",text)
 text = re.sub(" "+suffixes+"[.] "+starters," \1円<stop> \2円",text)
 text = re.sub(" "+suffixes+"[.]"," \1円<prd>",text)
 text = re.sub(" " + alphabets + "[.]"," \1円<prd>",text)
 if """ in text: text = text.replace("."","".")
 if "\"" in text: text = text.replace(".\"","\".")
 if "!" in text: text = text.replace("!\"","\"!")
 if "?" in text: text = text.replace("?\"","\"?")
 # ellipses
 for match in re.compile(r"(\.)(\.+)").finditer(text):
 text = text.replace(match.group(0), (len(match.group(0)) * '<prd>') + '<stop>')
 text = text.replace(".",".<stop>")
 text = text.replace("?","?<stop>")
 text = text.replace("!","!<stop>")
 text = text.replace("<prd>",".")
 sentences = text.split("<stop>")
 sentences = [s.strip() for s in sentences]
 sentences = [s for s in sentences if not len(s) == 0]
 print(text)
 print(sentences)
 return sentences
text = "This is a test sentence! This is sentence two. I am sentence three?"
TextSplitter.split_into_sentences(text)

The code produces the following output:

This is a test sentence!<stop> This is sentence two.<stop> I am sentence three?<stop>
['This is a test sentence!', 'This is sentence two.', 'I am sentence three?']

Question 2

Include spaces after commas. It's a virtually costless habit and it helps with code readability and editability.

Constants are typically named with uppercase. Your program begins with several constants. Format their names accordingly: ALPHABETS, PREFIXES, etc.

If you use the same value repeatedly, declare it as a constant. For example STOP = '<stop>' or PRD = '<prd>'. This is a good typo/bug avoidance habit to adopt.

Don't use classes without a reason. The simplest, most natural way to organize programs is with well-designed functions. Don't add classes to the picture unless you have a good reason. In the current code, TextSplitter isn't doing anything, as illustrated by the fact that self is never used. [A separate issue: the first argument to a classmethod is usually named cls, not self, because the method receives the class, not an instance, as its argument. In your case, staticmethod would have been more appropriate, since you never use self or cls. But since TextSplitter isn't doing anything, none of that matters.]

Functions should be focused. Your function is doing two major things: (1) a bunch of search-replace operations, and (2) splitting into sentences. Split that behavior into two separate functions: for example, normalize_text() and split_into_sentences().

Functions should return data, not print. Whenever it's practical to do so, functions should return data, not cause side-effects like printing. Do the printing elsewhere -- typically the outer edge of the program where there is as little algorithmic complexity as possible. Data-oriented functions can be conveniently tested and debugged. Side-effect-oriented functions are a hassle in those regards.

def main():
 text = '...'
 normalized = normalize_text(text)
 sentences = split_into_sentences(normalized)
 print(normalized)
 print(sentences)
if __name__ == '__main__':
 main()

Your conditional checks are not doing anything. They all check for a prerequisite that is already embedded in the replacement operation. They are the moral equivalent of if X in TEXT, replace X with Y. Just try to perform the replacement directly.

Alternative ways to split into sentences. Because empty strings are false, you don't need to check their length to reject them. Instead, you can just write [s for s in sentences if s]. Alternatively, and this is neither better nor worse, you could use the built-in filter():

STOP = '<stop>'
def split_into_sentences(text):
 m = map(str.strip, text.split(STOP))
 return list(filter(None, m))

Disclaimer. I did not think about the details of your regular expressions or the various text replacements. I don't have enough context about your broader goals to give specific advice on that front.

Should you try to generalize the replacement behavior? You asked about this and your question is a good one. For example, one could define a little Replacer class that would take various optional parameters like regex, search, and replace. The class would define its __call__(self, text) method to perform the appropriate type of replacing operation, depending on which parameters were provided. Then you would define a REPLACERS constant consisting of a sequence of those Replacer instances. And the normalize_text() function would just iterate over REPLACERS, calling each one with text as its argument: for r in REPLACERS: text = r(text). Note that with this callable-based implementation, any ordinary function that takes and returns text can also operate successfully as a "Replacer", so any unusual replacement needs can be defined via tiny utility functions. Is all of that worth the trouble? Maybe, if this code is part of a bigger project and you anticipate further growth and elaboration of the kinds of replacements occurring. Or maybe, if you want to try to implement it as a learning exercise. But absent those motivations, the normalize_text() function (after the other edits suggested above) is just a linear sequence of about 25 steps that is fairly easy to understand, so I probably would not bother with such a refactoring.

FMc FMc 13.1k2 gold badges29 silver badges40 bronze badges · Accepted Answer · 2022-05-11 18:05:26Z

Include spaces after commas. It's a virtually costless habit and it helps with code readability and editability.

Constants are typically named with uppercase. Your program begins with several constants. Format their names accordingly: ALPHABETS, PREFIXES, etc.

If you use the same value repeatedly, declare it as a constant. For example STOP = '<stop>' or PRD = '<prd>'. This is a good typo/bug avoidance habit to adopt.

Don't use classes without a reason. The simplest, most natural way to organize programs is with well-designed functions. Don't add classes to the picture unless you have a good reason. In the current code, TextSplitter isn't doing anything, as illustrated by the fact that self is never used. [A separate issue: the first argument to a classmethod is usually named cls, not self, because the method receives the class, not an instance, as its argument. In your case, staticmethod would have been more appropriate, since you never use self or cls. But since TextSplitter isn't doing anything, none of that matters.]

Functions should be focused. Your function is doing two major things: (1) a bunch of search-replace operations, and (2) splitting into sentences. Split that behavior into two separate functions: for example, normalize_text() and split_into_sentences().

Functions should return data, not print. Whenever it's practical to do so, functions should return data, not cause side-effects like printing. Do the printing elsewhere -- typically the outer edge of the program where there is as little algorithmic complexity as possible. Data-oriented functions can be conveniently tested and debugged. Side-effect-oriented functions are a hassle in those regards.

def main():
 text = '...'
 normalized = normalize_text(text)
 sentences = split_into_sentences(normalized)
 print(normalized)
 print(sentences)
if __name__ == '__main__':
 main()

Your conditional checks are not doing anything. They all check for a prerequisite that is already embedded in the replacement operation. They are the moral equivalent of if X in TEXT, replace X with Y. Just try to perform the replacement directly.

Alternative ways to split into sentences. Because empty strings are false, you don't need to check their length to reject them. Instead, you can just write [s for s in sentences if s]. Alternatively, and this is neither better nor worse, you could use the built-in filter():

STOP = '<stop>'
def split_into_sentences(text):
 m = map(str.strip, text.split(STOP))
 return list(filter(None, m))

Disclaimer. I did not think about the details of your regular expressions or the various text replacements. I don't have enough context about your broader goals to give specific advice on that front.

Should you try to generalize the replacement behavior? You asked about this and your question is a good one. For example, one could define a little Replacer class that would take various optional parameters like regex, search, and replace. The class would define its __call__(self, text) method to perform the appropriate type of replacing operation, depending on which parameters were provided. Then you would define a REPLACERS constant consisting of a sequence of those Replacer instances. And the normalize_text() function would just iterate over REPLACERS, calling each one with text as its argument: for r in REPLACERS: text = r(text). Note that with this callable-based implementation, any ordinary function that takes and returns text can also operate successfully as a "Replacer", so any unusual replacement needs can be defined via tiny utility functions. Is all of that worth the trouble? Maybe, if this code is part of a bigger project and you anticipate further growth and elaboration of the kinds of replacements occurring. Or maybe, if you want to try to implement it as a learning exercise. But absent those motivations, the normalize_text() function (after the other edits suggested above) is just a linear sequence of about 25 steps that is fairly easy to understand, so I probably would not bother with such a refactoring.

Stack Exchange Network

Text splitter using Regular Expressions in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Text splitter using Regular Expressions in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions