I need to use regex for some stuff (mentioned in code comments) and wrote the following code, which works perfectly fine for my need, but I wanted to know if there was a way to improve this code?
Am I using re.compile correctly in this case?
import re
def regex(utterance):
utterance = utterance.lower()
# Replacing non ASCII characters with space
message_ascii = re.compile(r'[^\x00-\x7F]+')
message_ascii = message_ascii.sub(r' ', utterance)
# If comma after number, replace comma with space
message_comma_no = re.compile(r'(?<=[0-9]),')
message_comma_no = message_comma_no.sub(r' ',message_ascii)
# If comma after words, add space before and after
message_comma_word = re.compile(r'(?<=[a-z]),')
message_comma_word = message_comma_word.sub(r' , ',message_comma_no)
# If "Dot and space" after word or number put space before and after
message_dot = re.compile(r'(?<=[a-z0-9])[.] ')
message_dot = message_dot.sub(r' . ',message_comma_word)
# If any other punctuation found after word or number put space before and after
message_punct = re.compile(r"(?<=[a-zA-Z0-9])(?=[?;!()'\"])|(?<=[?;!()'\"])(?=[a-zA-Z0-9])")
message_punct = message_punct.sub(r' ', message_dot)
# Remove Excess whitespaces
message = ' '.join(message_punct.split())
return message
2 Answers 2
If you use a regular expression once, you don't get any performance improvement from compiling it. You could just use re.sub directly.
If a string doesn't contain any special characters, there's no point in using a raw literal.
r' '
could be just' '
.Using the same variable to represent different things is a bad practice. It confuses the people who read your code. It's not a good idea to do things like:
message_ascii = re.compile(r'[^\x00-\x7F]+') message_ascii = message_ascii.sub(r' ', utterance)
because the same variable holds a compiled regex in the first line and it's reassigned to a string later on.
If you call this function multiple times and want to benefit from pre-compiled regular expressions, you could create a new class that compiles the expressions in its constructor and reuses them:
class TextProcessor:
def __init__(self):
# Initializes regular expressions here
self.ascii_regex = re.compile(...)
# Other expressions go here
def process_text(self, text):
ascii_text = self.ascii_regex.sub(' ', text)
# The rest of the substitions go here
-
\$\begingroup\$ thanks, regarding point 1 this function will be called again and again, will it be right to use re.compile in such a case or should I just use re.sub? \$\endgroup\$Inherited Geek– Inherited Geek2017年06月26日 09:26:41 +00:00Commented Jun 26, 2017 at 9:26
-
\$\begingroup\$ @InheritedGeek You recompile it every time the function is called, anyway. If you compile it once (say, in an object constructor) and then used the existing compiled object in a method every time it's called, it can make your code faster. \$\endgroup\$kraskevich– kraskevich2017年06月26日 09:29:22 +00:00Commented Jun 26, 2017 at 9:29
-
1\$\begingroup\$ @InheritedGeek Caching might also help you out here, see the note at docs.python.org/3/library/re.html#re.compile (no need to use re.compile for this though) \$\endgroup\$Sebastian Proske– Sebastian Proske2017年06月26日 09:31:53 +00:00Commented Jun 26, 2017 at 9:31
I would create an list with regex_pattern and the iterate over it like this.
import re
def regex(utterance):
utterance = utterance.lower()
regex_pattern = ["[^\x00-\x7F]+", "(?<=[0-9]),", "..."]
for pattern in regex_pattern:
message = re.compile.(pattern)
msg = message.sub(" ", utterance)
...
return message
Do you know what i mean? But if you want to replace also with other pattern i would create an dictionary like this:
regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}
and then iterate over the regex_dict:
import re
def regex(utterance):
utterance = utterance.lower()
regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}
for key in regex_dict:
message = re.compile(key)
msg = message.sub(regex_dict[key], utterance)
...
...
I would be helpfull for me to test it for 100% if I had some examples for utterance. Thanks
-
1\$\begingroup\$ Here you will be iterating over "utterance" ie initial string each time, while I want to iterate over the updated string each time. \$\endgroup\$Inherited Geek– Inherited Geek2017年06月27日 14:45:10 +00:00Commented Jun 27, 2017 at 14:45