For educational purpose I am preprocessing multiple short texts containing the description of the symptoms of cars fault. The text is written by humans and is rich in misspelling, capital letters and other stuff.
I wanted to write a short pre-processing function and I have three questions:
Why I get two different results based on how I format the
re.escape()
(the first one is the correct piece of code)Can I adapt to f-string formatting in this section
re.compile('[%s]' % ?re.escape(string.punctuation)).sub(' ', text)
There is any way I improve readability and performance of this code?
example = "This, is just an example! Nothing serious :) " #convert to lowercase, strip and remove punctuations def preprocess(text): """convert to lowercase, strip and remove punctuations""" text = text.lower() text=text.strip() text=re.compile('<.*?>').sub('', text) text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) text = re.sub('\s+', ' ', text) text = re.sub(r'\[[0-9]*\]',' ',text) text=re.sub(r'[^\w\s]', '', str(text).lower().strip()) text = re.sub(r'\d',' ',text) text = re.sub(r'\s+',' ',text) return text
The wrong one:
#convert to lowercase, strip and remove punctuations
def preprocess(text):
"""convert to lowercase, strip and remove punctuations"""
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
escaping = re.escape(string.punctuation)
test = re.compile('[{}s]'.format(escaping)).sub(' ',text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
1 Answer 1
To be PEP-8 compliant, you may wish to review your spacing. Specifically, text=text.strip()
into text = text.strip()
with spaces surrounding the assignment operator. This is done in some locations within your code, but not others - I would recommend consistency.
Some parts of your code are redundant - in this statement text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
you are removing square brackets (and additional characters). In a following line text = re.sub(r'\[[0-9]*\]',' ',text)
you are removing digits which are surrounded by square brackets. But since you have already removed square brackets, it will never find anything which matches this condition!
Also, be aware that \
is an escape character. When you wish to use it as itself within a regular expression, it must be escaped itself \\
or a raw string must be used. This occurs in this line of code: text = re.sub('\s+', ' ', text)
re.compile()
followed by .sub()
could just be re.sub()
. You are not saving the compiled regular expression to use again.
Characters can be replaced with spaces - if your text ended in a number, it would be replaced by a space at the end of your string. You want text = text.strip()
to be one of the last things your code does.
text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
has redundancy - you already converted everything to lowercase, so you do not need another .lower()
here. Your variable, text
, is already a string, and so str(text)
is converting it unnecessarily. As mentioned, you want .strip()
at the end - if doing so, the one in this block of code is not needed.
You should use type hints: def preprocess(text: str) -> str:
to document that the function takes type string, and returns type string.
Reworked code:
import string
import re
def preprocess(text: str) -> str:
"""convert to lowercase, strip and remove punctuations"""
text = text.lower()
text = re.sub('<.*?>', '', text)
text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'[\d\s]+', ' ', text)
text = text.strip()
return text
Explore related questions
See similar questions with these tags.
string.punctuation
leads to NameError: name 'string' is not defined. Do I miss some imports? \$\endgroup\$import string
\$\endgroup\$print(preprocess(example + string.punctuation))
to see what's wrong in the 2nd output... \$\endgroup\$