Cleaning up an utterance using multiple regex substitutions

Question 1

I need to use regex for some stuff (mentioned in code comments) and wrote the following code, which works perfectly fine for my need, but I wanted to know if there was a way to improve this code?

Am I using re.compile correctly in this case?

import re
def regex(utterance):
 utterance = utterance.lower()
 # Replacing non ASCII characters with space
 message_ascii = re.compile(r'[^\x00-\x7F]+')
 message_ascii = message_ascii.sub(r' ', utterance)
 # If comma after number, replace comma with space
 message_comma_no = re.compile(r'(?<=[0-9]),')
 message_comma_no = message_comma_no.sub(r' ',message_ascii)
 # If comma after words, add space before and after
 message_comma_word = re.compile(r'(?<=[a-z]),')
 message_comma_word = message_comma_word.sub(r' , ',message_comma_no)
 # If "Dot and space" after word or number put space before and after
 message_dot = re.compile(r'(?<=[a-z0-9])[.] ')
 message_dot = message_dot.sub(r' . ',message_comma_word)
 # If any other punctuation found after word or number put space before and after
 message_punct = re.compile(r"(?<=[a-zA-Z0-9])(?=[?;!()'\"])|(?<=[?;!()'\"])(?=[a-zA-Z0-9])")
 message_punct = message_punct.sub(r' ', message_dot)
 # Remove Excess whitespaces
 message = ' '.join(message_punct.split())
 return message

Question 2

If you use a regular expression once, you don't get any performance improvement from compiling it. You could just use re.sub directly.
If a string doesn't contain any special characters, there's no point in using a raw literal.
r' ' could be just ' '.
Using the same variable to represent different things is a bad practice. It confuses the people who read your code. It's not a good idea to do things like:
```
message_ascii = re.compile(r'[^\x00-\x7F]+')
message_ascii = message_ascii.sub(r' ', utterance)
```
because the same variable holds a compiled regex in the first line and it's reassigned to a string later on.

If you call this function multiple times and want to benefit from pre-compiled regular expressions, you could create a new class that compiles the expressions in its constructor and reuses them:

class TextProcessor:
 def __init__(self):
 # Initializes regular expressions here
 self.ascii_regex = re.compile(...)
 # Other expressions go here
 def process_text(self, text):
 ascii_text = self.ascii_regex.sub(' ', text)
 # The rest of the substitions go here

Question 3

thanks, regarding point 1 this function will be called again and again, will it be right to use re.compile in such a case or should I just use re.sub?

Question 4

@InheritedGeek You recompile it every time the function is called, anyway. If you compile it once (say, in an object constructor) and then used the existing compiled object in a method every time it's called, it can make your code faster.

Question 5

@InheritedGeek Caching might also help you out here, see the note at docs.python.org/3/library/re.html#re.compile (no need to use re.compile for this though)

Question 6

I would create an list with regex_pattern and the iterate over it like this.

import re
def regex(utterance):
 utterance = utterance.lower()
 regex_pattern = ["[^\x00-\x7F]+", "(?<=[0-9]),", "..."]
 for pattern in regex_pattern:
 message = re.compile.(pattern)
 msg = message.sub(" ", utterance)
 ...
 return message

Do you know what i mean? But if you want to replace also with other pattern i would create an dictionary like this:

regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}

and then iterate over the regex_dict:

import re
def regex(utterance):
 utterance = utterance.lower()
 regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}
 for key in regex_dict:
 message = re.compile(key)
 msg = message.sub(regex_dict[key], utterance)
 ...
 ...

I would be helpfull for me to test it for 100% if I had some examples for utterance. Thanks

Question 7

Here you will be iterating over "utterance" ie initial string each time, while I want to iterate over the updated string each time.

kraskevich kraskevich 5,66018 silver badges21 bronze badges · Accepted Answer · 2017-06-26 09:24:19Z

If you use a regular expression once, you don't get any performance improvement from compiling it. You could just use re.sub directly.
If a string doesn't contain any special characters, there's no point in using a raw literal.
r' ' could be just ' '.
Using the same variable to represent different things is a bad practice. It confuses the people who read your code. It's not a good idea to do things like:
```
message_ascii = re.compile(r'[^\x00-\x7F]+')
message_ascii = message_ascii.sub(r' ', utterance)
```
because the same variable holds a compiled regex in the first line and it's reassigned to a string later on.

If you call this function multiple times and want to benefit from pre-compiled regular expressions, you could create a new class that compiles the expressions in its constructor and reuses them:

class TextProcessor:
 def __init__(self):
 # Initializes regular expressions here
 self.ascii_regex = re.compile(...)
 # Other expressions go here
 def process_text(self, text):
 ascii_text = self.ascii_regex.sub(' ', text)
 # The rest of the substitions go here

thanks, regarding point 1 this function will be called again and again, will it be right to use re.compile in such a case or should I just use re.sub?
@InheritedGeek You recompile it every time the function is called, anyway. If you compile it once (say, in an object constructor) and then used the existing compiled object in a method every time it's called, it can make your code faster.
@InheritedGeek Caching might also help you out here, see the note at docs.python.org/3/library/re.html#re.compile (no need to use re.compile for this though)

Stack Exchange Network

Cleaning up an utterance using multiple regex substitutions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Cleaning up an utterance using multiple regex substitutions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions