Short Text Pre-processing

Question 1

For educational purpose I am preprocessing multiple short texts containing the description of the symptoms of cars fault. The text is written by humans and is rich in misspelling, capital letters and other stuff.

I wanted to write a short pre-processing function and I have three questions:

Why I get two different results based on how I format the re.escape() (the first one is the correct piece of code)
Can I adapt to f-string formatting in this section re.compile('[%s]' % ?re.escape(string.punctuation)).sub(' ', text)

There is any way I improve readability and performance of this code?

example = "This, is just an example! Nothing serious :) "
#convert to lowercase, strip and remove punctuations
def preprocess(text):
 """convert to lowercase, strip and remove punctuations"""
 text = text.lower() 
 text=text.strip() 
 text=re.compile('<.*?>').sub('', text) 
 text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) 
 text = re.sub('\s+', ' ', text) 
 text = re.sub(r'\[[0-9]*\]',' ',text) 
 text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
 text = re.sub(r'\d',' ',text) 
 text = re.sub(r'\s+',' ',text) 
 return text

The wrong one:

 #convert to lowercase, strip and remove punctuations
 def preprocess(text):
 """convert to lowercase, strip and remove punctuations"""
 text = text.lower() 
 text=text.strip() 
 text=re.compile('<.*?>').sub('', text) 
 escaping = re.escape(string.punctuation)
 test = re.compile('[{}s]'.format(escaping)).sub(' ',text)
 text = re.sub('\s+', ' ', text) 
 text = re.sub(r'\[[0-9]*\]',' ',text) 
 text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
 text = re.sub(r'\d',' ',text) 
 text = re.sub(r'\s+',' ',text) 
 return text

Question 2

string.punctuation leads to NameError: name 'string' is not defined. Do I miss some imports?

Question 3

import string

Question 4

I'm afraid this question does not match what this site is about. Code Review is about improving existing, working code. Code Review is not the site to ask for help in fixing or changing what your code does. Anyway, try print(preprocess(example + string.punctuation)) to see what's wrong in the 2nd output...

Question 5

Probably I shared in the wrong way my question, but Point 2 and 3 are for Code Review because the code works :) I didn't share all the libraries to be more concise, but I can add them. On the point 1 I agree it is off-topic and more suitable for Stack Overflow

Question 6

To be PEP-8 compliant, you may wish to review your spacing. Specifically, text=text.strip() into text = text.strip() with spaces surrounding the assignment operator. This is done in some locations within your code, but not others - I would recommend consistency.

Some parts of your code are redundant - in this statement text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) you are removing square brackets (and additional characters). In a following line text = re.sub(r'\[[0-9]*\]',' ',text) you are removing digits which are surrounded by square brackets. But since you have already removed square brackets, it will never find anything which matches this condition!

Also, be aware that \ is an escape character. When you wish to use it as itself within a regular expression, it must be escaped itself \\ or a raw string must be used. This occurs in this line of code: text = re.sub('\s+', ' ', text)

re.compile() followed by .sub() could just be re.sub(). You are not saving the compiled regular expression to use again.

Characters can be replaced with spaces - if your text ended in a number, it would be replaced by a space at the end of your string. You want text = text.strip() to be one of the last things your code does.

text=re.sub(r'[^\w\s]', '', str(text).lower().strip()) has redundancy - you already converted everything to lowercase, so you do not need another .lower() here. Your variable, text, is already a string, and so str(text) is converting it unnecessarily. As mentioned, you want .strip() at the end - if doing so, the one in this block of code is not needed.

You should use type hints: def preprocess(text: str) -> str: to document that the function takes type string, and returns type string.

Reworked code:

import string
import re
def preprocess(text: str) -> str:
 """convert to lowercase, strip and remove punctuations"""
 text = text.lower()
 text = re.sub('<.*?>', '', text)
 text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
 text = re.sub(r'[^\w\s]', '', text)
 text = re.sub(r'[\d\s]+', ' ', text)
 text = text.strip()
 return text

Polar Shift Polar Shift 1514 bronze badges · Accepted Answer · 2022-12-23 06:09:41Z

To be PEP-8 compliant, you may wish to review your spacing. Specifically, text=text.strip() into text = text.strip() with spaces surrounding the assignment operator. This is done in some locations within your code, but not others - I would recommend consistency.

Some parts of your code are redundant - in this statement text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) you are removing square brackets (and additional characters). In a following line text = re.sub(r'\[[0-9]*\]',' ',text) you are removing digits which are surrounded by square brackets. But since you have already removed square brackets, it will never find anything which matches this condition!

Also, be aware that \ is an escape character. When you wish to use it as itself within a regular expression, it must be escaped itself \\ or a raw string must be used. This occurs in this line of code: text = re.sub('\s+', ' ', text)

re.compile() followed by .sub() could just be re.sub(). You are not saving the compiled regular expression to use again.

Characters can be replaced with spaces - if your text ended in a number, it would be replaced by a space at the end of your string. You want text = text.strip() to be one of the last things your code does.

text=re.sub(r'[^\w\s]', '', str(text).lower().strip()) has redundancy - you already converted everything to lowercase, so you do not need another .lower() here. Your variable, text, is already a string, and so str(text) is converting it unnecessarily. As mentioned, you want .strip() at the end - if doing so, the one in this block of code is not needed.

You should use type hints: def preprocess(text: str) -> str: to document that the function takes type string, and returns type string.

Reworked code:

import string
import re
def preprocess(text: str) -> str:
 """convert to lowercase, strip and remove punctuations"""
 text = text.lower()
 text = re.sub('<.*?>', '', text)
 text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
 text = re.sub(r'[^\w\s]', '', text)
 text = re.sub(r'[\d\s]+', ' ', text)
 text = text.strip()
 return text

Stack Exchange Network

Short Text Pre-processing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Short Text Pre-processing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions