I've written the following Python module to iterate over field values in the following two fields: "NAME_LABEL", "CATEGORY". I'm using regular expressions in the following function: "correct_invalid_char", to substitute characters over three expressions that are dependent on each other. The first expression replaces the following characters "/-space" with an underscore, the result is then passed onto the second expression to remove non-alphanumeric and non-numeric characters. The last regular expression removes duplicate underscores that were originally in the field values or generated due to the first regular expression.
Is there more Pythonic way of writing the following regular expressions to achieve the same result, that would allow me to write better unit tests for the following function?
enter image description here Original field values:
OBJECTID,NAME,NAME_LABEL,FACILITY_TYPE,CATEGORY,SHAPE_Length,SHAPE_Area
1,Ward 27 - New 1,Ward 27 - New 1,Settlements,Settlements 2,533.176039669,12288.746516
2,429 (Block R),(Block R) 429,Settlements,Settlements 4,508.411033555,9622.22635621
3,Kondelelani (Block 1),Kondelelani (Block 1),Settlements,Settlements 4,738.203751902,22815.0234794
4,Yebo Gogo,Yebo \ Gogo,Settlements,Settlements 1,674.979301727,23413.6988572
5,Tebogo Bottle Store,Tebogo / Bottle Store,Settlements,Settlements 1,329.239037836,7157.39741934
6,Block 1 Y,Block 1 [Y],Settlements,Settlements 2,1893.89651205,82883.9076782
7,Stand 1427, Ga-Rankuwa X25,Stand_ 1427, Ga-Rankuwa X25,Settlements,Settlements 3,1209.46585836,66852.9597381
8,Stand 1719, Ga-Rankuwa X23,Stand 1719, Ga-Rankuwa X23,Settlements,Settlements 3,997.901714538,51299.0275644
Original values: CSV
enter image description here Updated field values:
"""
Created on 23 Oct 2018
Remove invalid characters
found in SAA fields values:
NAME_LABEL; CATEGORY
@author: Peter Wilson
"""
# import site-packages and modules
import re
import argparse
import arcpy
# set environment settings
arcpy.env.overwriteOutput = True
def sites_fields_list(sites):
"""
Validate fields found
and create list for
update cursor.
"""
fields = ['NAME_LABEL', 'CATEGORY']
sites_fields = [f.name for f in arcpy.ListFields(sites) if f.name in fields]
return sites_fields
def correct_invalid_char(field_value):
"""
Correct field values
by replacing characters,
removing non-alphanumeric,
non-numeric characters,
duplicate underscores,
and changing to title case.
"""
try:
underscore = re.sub('[/\- ]', '_', field_value)
illegal_char = re.sub('[^0-9a-zA-Z_]+', '', underscore)
dup_underscore = re.sub(r'(_)+', r'1円', illegal_char)
update_value = dup_underscore.title()
return update_value
except TypeError as e:
print("There's no value in the field: {0}".format(e))
raise
def update_field_values(sites):
"""
Iterate over field values
in: NAME_LABEL; CATEGORY
and correct field values
by replacing characters.
"""
sites_fields = sites_fields_list(sites)
with arcpy.da.UpdateCursor(sites, sites_fields) as cursor:
for row in cursor:
for idx, val in enumerate(row):
row[idx] = correct_invalid_char(val)
cursor.updateRow(row)
if __name__ == '__main__':
description = 'Remove invalid characters in SAA fields NAME_LABEL and CATEGORY'
parser = argparse.ArgumentParser(description)
parser.add_argument('--sites', metavar='path', required=True,
help='path to input sites feature class')
args = parser.parse_args()
update_field_values(sites=args.sites)
-
\$\begingroup\$ can you copy-paste the values instead of putting screen shots? \$\endgroup\$Maarten Fabré– Maarten Fabré2018年10月25日 08:36:03 +00:00Commented Oct 25, 2018 at 8:36
-
\$\begingroup\$ @MaartenFabré, I've attached the original values as a csv, as a code snippit. \$\endgroup\$Peter Wilson– Peter Wilson2018年10月25日 08:48:07 +00:00Commented Oct 25, 2018 at 8:48
1 Answer 1
I can't say it would be more Pythonic, but we can reduce your number of re.sub()
calls from 3 down to 2.
First, we just eliminate all of the invalid letters:
valid_chars = re.sub('[^-/_ 0-9a-zA-Z]', '', field_value)
Then, replace occurrences of one or more '-'
, '/'
, '_'
and ' '
characters with a single underscore, convert to title case, and return:
return re.sub('[-/_ ]+', '_', valid_chars).title()
Since the try ... except
block is unconditionally re-raising the raised exception, it is not really adding much, and could probably be eliminated.
When using a regex over and over, it is usually more efficient to compile the regular expression once, and then reuse the compiled regular expression object.
INVALID_CHARS = re.compile('[^-/_ 0-9a-zA-Z]')
UNDERSCORE_CHARS = re.compile('[-/_ ]+')
def correct_invalid_char(field_value):
"""
Correct field values by replacing characters, removing non-alphanumeric,
non-numeric characters, duplicate underscores, and changing to title case.
"""
valid_chars = INVALID_CHARS.sub('', field_value)
return UNDERSCORE_CHARS.sub('_', valid_chars).title()
A quick "Hello world"-ish test:
>>> correct_invalid_char('Hello EMPEROR / ***WoRlD_-=-_lEaDeR***')
'Hello_Emperor_World_Leader'