3
\$\begingroup\$

I need to use regex for some stuff (mentioned in code comments) and wrote the following code, which works perfectly fine for my need, but I wanted to know if there was a way to improve this code?

Am I using re.compile correctly in this case?

import re
def regex(utterance):
 utterance = utterance.lower()
 # Replacing non ASCII characters with space
 message_ascii = re.compile(r'[^\x00-\x7F]+')
 message_ascii = message_ascii.sub(r' ', utterance)
 # If comma after number, replace comma with space
 message_comma_no = re.compile(r'(?<=[0-9]),')
 message_comma_no = message_comma_no.sub(r' ',message_ascii)
 # If comma after words, add space before and after
 message_comma_word = re.compile(r'(?<=[a-z]),')
 message_comma_word = message_comma_word.sub(r' , ',message_comma_no)
 # If "Dot and space" after word or number put space before and after
 message_dot = re.compile(r'(?<=[a-z0-9])[.] ')
 message_dot = message_dot.sub(r' . ',message_comma_word)
 # If any other punctuation found after word or number put space before and after
 message_punct = re.compile(r"(?<=[a-zA-Z0-9])(?=[?;!()'\"])|(?<=[?;!()'\"])(?=[a-zA-Z0-9])")
 message_punct = message_punct.sub(r' ', message_dot)
 # Remove Excess whitespaces
 message = ' '.join(message_punct.split())
 return message
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jun 26, 2017 at 9:04
\$\endgroup\$

2 Answers 2

7
\$\begingroup\$
  1. If you use a regular expression once, you don't get any performance improvement from compiling it. You could just use re.sub directly.

  2. If a string doesn't contain any special characters, there's no point in using a raw literal.
    r' ' could be just ' '.

  3. Using the same variable to represent different things is a bad practice. It confuses the people who read your code. It's not a good idea to do things like:

    message_ascii = re.compile(r'[^\x00-\x7F]+')
    message_ascii = message_ascii.sub(r' ', utterance)
    

    because the same variable holds a compiled regex in the first line and it's reassigned to a string later on.

If you call this function multiple times and want to benefit from pre-compiled regular expressions, you could create a new class that compiles the expressions in its constructor and reuses them:

class TextProcessor:
 def __init__(self):
 # Initializes regular expressions here
 self.ascii_regex = re.compile(...)
 # Other expressions go here
 def process_text(self, text):
 ascii_text = self.ascii_regex.sub(' ', text)
 # The rest of the substitions go here 
answered Jun 26, 2017 at 9:24
\$\endgroup\$
3
  • \$\begingroup\$ thanks, regarding point 1 this function will be called again and again, will it be right to use re.compile in such a case or should I just use re.sub? \$\endgroup\$ Commented Jun 26, 2017 at 9:26
  • \$\begingroup\$ @InheritedGeek You recompile it every time the function is called, anyway. If you compile it once (say, in an object constructor) and then used the existing compiled object in a method every time it's called, it can make your code faster. \$\endgroup\$ Commented Jun 26, 2017 at 9:29
  • 1
    \$\begingroup\$ @InheritedGeek Caching might also help you out here, see the note at docs.python.org/3/library/re.html#re.compile (no need to use re.compile for this though) \$\endgroup\$ Commented Jun 26, 2017 at 9:31
3
\$\begingroup\$

I would create an list with regex_pattern and the iterate over it like this.

import re
def regex(utterance):
 utterance = utterance.lower()
 regex_pattern = ["[^\x00-\x7F]+", "(?<=[0-9]),", "..."]
 for pattern in regex_pattern:
 message = re.compile.(pattern)
 msg = message.sub(" ", utterance)
 ...
 return message

Do you know what i mean? But if you want to replace also with other pattern i would create an dictionary like this:

regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}

and then iterate over the regex_dict:

import re
def regex(utterance):
 utterance = utterance.lower()
 regex_dict = {'[^\x00-\x7F]+': ' ', '(?<=[a-z]),': ' , '}
 for key in regex_dict:
 message = re.compile(key)
 msg = message.sub(regex_dict[key], utterance)
 ...
 ...

I would be helpfull for me to test it for 100% if I had some examples for utterance. Thanks

answered Jun 26, 2017 at 9:20
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Here you will be iterating over "utterance" ie initial string each time, while I want to iterate over the updated string each time. \$\endgroup\$ Commented Jun 27, 2017 at 14:45

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.