Replace characters using multiple dependent regular expression substitutions

Question 1

I've written the following Python module to iterate over field values in the following two fields: "NAME_LABEL", "CATEGORY". I'm using regular expressions in the following function: "correct_invalid_char", to substitute characters over three expressions that are dependent on each other. The first expression replaces the following characters "/-space" with an underscore, the result is then passed onto the second expression to remove non-alphanumeric and non-numeric characters. The last regular expression removes duplicate underscores that were originally in the field values or generated due to the first regular expression.

Is there more Pythonic way of writing the following regular expressions to achieve the same result, that would allow me to write better unit tests for the following function?

enter image description here Original field values:

OBJECTID,NAME,NAME_LABEL,FACILITY_TYPE,CATEGORY,SHAPE_Length,SHAPE_Area
1,Ward 27 - New 1,Ward 27 - New 1,Settlements,Settlements 2,533.176039669,12288.746516
2,429 (Block R),(Block R) 429,Settlements,Settlements 4,508.411033555,9622.22635621
3,Kondelelani (Block 1),Kondelelani (Block 1),Settlements,Settlements 4,738.203751902,22815.0234794
4,Yebo Gogo,Yebo \ Gogo,Settlements,Settlements 1,674.979301727,23413.6988572
5,Tebogo Bottle Store,Tebogo / Bottle Store,Settlements,Settlements 1,329.239037836,7157.39741934
6,Block 1 Y,Block 1 [Y],Settlements,Settlements 2,1893.89651205,82883.9076782
7,Stand 1427, Ga-Rankuwa X25,Stand_ 1427, Ga-Rankuwa X25,Settlements,Settlements 3,1209.46585836,66852.9597381
8,Stand 1719, Ga-Rankuwa X23,Stand 1719, Ga-Rankuwa X23,Settlements,Settlements 3,997.901714538,51299.0275644

Original values: CSV

enter image description here Updated field values:

"""
Created on 23 Oct 2018
Remove invalid characters
found in SAA fields values:
NAME_LABEL; CATEGORY
@author: Peter Wilson
"""
# import site-packages and modules
import re
import argparse
import arcpy
# set environment settings
arcpy.env.overwriteOutput = True
def sites_fields_list(sites):
 """
 Validate fields found
 and create list for
 update cursor.
 """
 fields = ['NAME_LABEL', 'CATEGORY']
 sites_fields = [f.name for f in arcpy.ListFields(sites) if f.name in fields]
 return sites_fields
def correct_invalid_char(field_value):
 """
 Correct field values
 by replacing characters,
 removing non-alphanumeric,
 non-numeric characters,
 duplicate underscores,
 and changing to title case.
 """
 try:
 underscore = re.sub('[/\- ]', '_', field_value)
 illegal_char = re.sub('[^0-9a-zA-Z_]+', '', underscore)
 dup_underscore = re.sub(r'(_)+', r'1円', illegal_char)
 update_value = dup_underscore.title()
 return update_value
 except TypeError as e:
 print("There's no value in the field: {0}".format(e))
 raise
def update_field_values(sites):
 """
 Iterate over field values
 in: NAME_LABEL; CATEGORY
 and correct field values
 by replacing characters.
 """
 sites_fields = sites_fields_list(sites)
 with arcpy.da.UpdateCursor(sites, sites_fields) as cursor:
 for row in cursor:
 for idx, val in enumerate(row):
 row[idx] = correct_invalid_char(val)
 cursor.updateRow(row)
if __name__ == '__main__':
 description = 'Remove invalid characters in SAA fields NAME_LABEL and CATEGORY'
 parser = argparse.ArgumentParser(description)
 parser.add_argument('--sites', metavar='path', required=True,
 help='path to input sites feature class')
 args = parser.parse_args()
 update_field_values(sites=args.sites)

Question 2

can you copy-paste the values instead of putting screen shots?

Question 3

@MaartenFabré, I've attached the original values as a csv, as a code snippit.

Question 4

I can't say it would be more Pythonic, but we can reduce your number of re.sub() calls from 3 down to 2.

First, we just eliminate all of the invalid letters:

valid_chars = re.sub('[^-/_ 0-9a-zA-Z]', '', field_value)

Then, replace occurrences of one or more '-', '/', '_' and ' ' characters with a single underscore, convert to title case, and return:

return re.sub('[-/_ ]+', '_', valid_chars).title()

Since the try ... except block is unconditionally re-raising the raised exception, it is not really adding much, and could probably be eliminated.

When using a regex over and over, it is usually more efficient to compile the regular expression once, and then reuse the compiled regular expression object.

INVALID_CHARS = re.compile('[^-/_ 0-9a-zA-Z]')
UNDERSCORE_CHARS = re.compile('[-/_ ]+')
def correct_invalid_char(field_value):
 """
 Correct field values by replacing characters, removing non-alphanumeric,
 non-numeric characters, duplicate underscores, and changing to title case.
 """
 valid_chars = INVALID_CHARS.sub('', field_value)
 return UNDERSCORE_CHARS.sub('_', valid_chars).title()

A quick "Hello world"-ish test:

>>> correct_invalid_char('Hello EMPEROR / ***WoRlD_-=-_lEaDeR***')
'Hello_Emperor_World_Leader'

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Answer 1 · 2018-10-25 18:38:00Z

I can't say it would be more Pythonic, but we can reduce your number of re.sub() calls from 3 down to 2.

First, we just eliminate all of the invalid letters:

valid_chars = re.sub('[^-/_ 0-9a-zA-Z]', '', field_value)

Then, replace occurrences of one or more '-', '/', '_' and ' ' characters with a single underscore, convert to title case, and return:

return re.sub('[-/_ ]+', '_', valid_chars).title()

Since the try ... except block is unconditionally re-raising the raised exception, it is not really adding much, and could probably be eliminated.

When using a regex over and over, it is usually more efficient to compile the regular expression once, and then reuse the compiled regular expression object.

INVALID_CHARS = re.compile('[^-/_ 0-9a-zA-Z]')
UNDERSCORE_CHARS = re.compile('[-/_ ]+')
def correct_invalid_char(field_value):
 """
 Correct field values by replacing characters, removing non-alphanumeric,
 non-numeric characters, duplicate underscores, and changing to title case.
 """
 valid_chars = INVALID_CHARS.sub('', field_value)
 return UNDERSCORE_CHARS.sub('_', valid_chars).title()

A quick "Hello world"-ish test:

>>> correct_invalid_char('Hello EMPEROR / ***WoRlD_-=-_lEaDeR***')
'Hello_Emperor_World_Leader'

Stack Exchange Network

Replace characters using multiple dependent regular expression substitutions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Replace characters using multiple dependent regular expression substitutions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions