Compensating for typos in strings by removing spaces and making uppercase for comparison

Question 1

I'm working with a database that has typos in the values. To compensate for them I've made a function called standardizer() which removes spaces and converts all letters to uppercase. I do this so the value red in from the database can correctly be interpreted by the program. Also in values starting with the prefix 'sp_etc_', I noticed a common mistake that the 't' is left out, giving 'sp_ec_' so I compensate for this as well. Bellow is an example:

import sys
#remove all spaces and convert to uppercase, for comparision purposes
def standardizer(str):
 str = str.replace("sp_ec_", "sp_etc_")#this is a common typo in the db, the 't' is left out
 str = str.replace(" ", "")
 str = str.upper()
 return str
#this function is for testing purposes and would actually read in values from db
def getUserInput():
 return ' this is a test'
def main():
 str1 = "I'mok"
 str2 = 'this is a test'
 str3 = 'two spaces'
 str4 = ' spaces at front and end '
 str5 = 'sp_ec_blah'
 print(str1, standardizer(str1))
 print(str2, standardizer(str2))
 print(str3, standardizer(str3))
 print(str4, standardizer(str4))
 print(str5, standardizer(str5))
 #this is an example of how the function would actually be used
 print(standardizer(str2), standardizer(getUserInput()))
 if standardizer(str2) == standardizer(getUserInput()):
 print('matched')
 else:
 print('not a match')
if __name__ == '__main__':
 main()

Any suggestions on the standardizer() function? First off I think it needs a better name. I'm wondering if I should break it into two functions, one being for the missing 't' and the other being for removing spaces and making upper case (by the way, from what I've seen it's more common to convert everything to upper case than lower case, for comparison purposes). Also, how would you comment something like this?

Question 2

I think the term that is commonly used is "canonical" instead of "standard".

Question 3

You can simplify avoiding reassignement:

def standardizer(str):
 return str.replace("sp_ec_", "sp_etc_").replace(" ", "").upper()

You should use a for loop to avoid repetition in the following lines:

print(str1, standardizer(str1))
print(str2, standardizer(str2))
print(str3, standardizer(str3))
print(str4, standardizer(str4))
print(str5, standardizer(str5))

Question 4

Yes that's clear. Question about python and indentation, why doesn't having it on a new line mess it up? If you continue the command on a new line do you just need to add 1 more indentation?

Question 5

@Celeritas The only sensible answer is that it works with newlines because the grammar says so. Newlines are for readibility only.

Question 6

oh well then wouldn't you say that's less readable having all those functions chained together?

Question 7

@Celeritas Reassigning a variable like that is noise. You should write in code exactly what You want. DRY

Question 8

Don't test like that: with a main function printing stuff. To verify it works correctly, you have to read and understand the output. Doc tests are perfect for this task:

def sanitize(text):
 """
 >>> sanitize("I'mok")
 "I'MOK"
 >>> sanitize('this is a test')
 'THISISATEST'
 >>> sanitize('two spaces')
 'TWOSPACES'
 >>> sanitize(' spaces at front and end ')
 'SPACESATFRONTANDEND'
 >>> sanitize('sp_ec_blah')
 'SP_ETC_BLAH'
 """
 text = text.replace("sp_ec_", "sp_etc_")
 text = text.replace(" ", "")
 text = text.upper()
 return text

If your script is in a file called sanitizer.py, you can run the doc tests with:

python -m doctest sanitizer.py

In this implementation, there is no noise, no messiness, a doc string that explains nicely what the function is expected to do, and the doctest verifies that it actually does it.

Other improvements:

str shadows the name of a built-in. Better rename that variable to something else.
"standardizer" is not a good name for a function, because it's a noun. Verbs are better, for example "standardize". I went further and used "sanitize", which is more common for this kind of purpose.

Caridorc Caridorc 28.1k7 gold badges54 silver badges137 bronze badges · Answer 1 · 2015-08-06 06:36:25Z

1

\$\begingroup\$

You can simplify avoiding reassignement:

def standardizer(str):
 return str.replace("sp_ec_", "sp_etc_").replace(" ", "").upper()

You should use a for loop to avoid repetition in the following lines:

print(str1, standardizer(str1))
print(str2, standardizer(str2))
print(str3, standardizer(str3))
print(str4, standardizer(str4))
print(str5, standardizer(str5))

Share

edited Aug 6, 2015 at 15:07

answered Aug 6, 2015 at 6:36

Caridorc's user avatar

Caridorc Caridorc

28.1k7 gold badges54 silver badges137 bronze badges

\$\endgroup\$

4

\$\begingroup\$ Yes that's clear. Question about python and indentation, why doesn't having it on a new line mess it up? If you continue the command on a new line do you just need to add 1 more indentation? \$\endgroup\$

Celeritas
– Celeritas

2015年08月06日 07:02:30 +00:00
Commented Aug 6, 2015 at 7:02
\$\begingroup\$ @Celeritas The only sensible answer is that it works with newlines because the grammar says so. Newlines are for readibility only. \$\endgroup\$

Caridorc
– Caridorc

2015年08月06日 07:04:43 +00:00
Commented Aug 6, 2015 at 7:04
\$\begingroup\$ oh well then wouldn't you say that's less readable having all those functions chained together? \$\endgroup\$

Celeritas
– Celeritas

2015年08月06日 16:50:29 +00:00
Commented Aug 6, 2015 at 16:50
\$\begingroup\$ @Celeritas Reassigning a variable like that is noise. You should write in code exactly what You want. DRY \$\endgroup\$

Caridorc
– Caridorc

2015年08月06日 19:38:12 +00:00
Commented Aug 6, 2015 at 19:38

Add a comment |

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 2 · 2015-08-06 17:55:36Z

Don't test like that: with a main function printing stuff. To verify it works correctly, you have to read and understand the output. Doc tests are perfect for this task:

def sanitize(text):
 """
 >>> sanitize("I'mok")
 "I'MOK"
 >>> sanitize('this is a test')
 'THISISATEST'
 >>> sanitize('two spaces')
 'TWOSPACES'
 >>> sanitize(' spaces at front and end ')
 'SPACESATFRONTANDEND'
 >>> sanitize('sp_ec_blah')
 'SP_ETC_BLAH'
 """
 text = text.replace("sp_ec_", "sp_etc_")
 text = text.replace(" ", "")
 text = text.upper()
 return text

If your script is in a file called sanitizer.py, you can run the doc tests with:

python -m doctest sanitizer.py

In this implementation, there is no noise, no messiness, a doc string that explains nicely what the function is expected to do, and the doctest verifies that it actually does it.

Other improvements:

str shadows the name of a built-in. Better rename that variable to something else.
"standardizer" is not a good name for a function, because it's a noun. Verbs are better, for example "standardize". I went further and used "sanitize", which is more common for this kind of purpose.

Stack Exchange Network

Compensating for typos in strings by removing spaces and making uppercase for comparison

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compensating for typos in strings by removing spaces and making uppercase for comparison

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions