1
\$\begingroup\$

I've written the following Python module to iterate over field values in the following two fields: "NAME_LABEL", "CATEGORY". I'm using regular expressions in the following function: "correct_invalid_char", to substitute characters over three expressions that are dependent on each other. The first expression replaces the following characters "/-space" with an underscore, the result is then passed onto the second expression to remove non-alphanumeric and non-numeric characters. The last regular expression removes duplicate underscores that were originally in the field values or generated due to the first regular expression.

Is there more Pythonic way of writing the following regular expressions to achieve the same result, that would allow me to write better unit tests for the following function?

enter image description here Original field values:

OBJECTID,NAME,NAME_LABEL,FACILITY_TYPE,CATEGORY,SHAPE_Length,SHAPE_Area
1,Ward 27 - New 1,Ward 27 - New 1,Settlements,Settlements 2,533.176039669,12288.746516
2,429 (Block R),(Block R) 429,Settlements,Settlements 4,508.411033555,9622.22635621
3,Kondelelani (Block 1),Kondelelani (Block 1),Settlements,Settlements 4,738.203751902,22815.0234794
4,Yebo Gogo,Yebo \ Gogo,Settlements,Settlements 1,674.979301727,23413.6988572
5,Tebogo Bottle Store,Tebogo / Bottle Store,Settlements,Settlements 1,329.239037836,7157.39741934
6,Block 1 Y,Block 1 [Y],Settlements,Settlements 2,1893.89651205,82883.9076782
7,Stand 1427, Ga-Rankuwa X25,Stand_ 1427, Ga-Rankuwa X25,Settlements,Settlements 3,1209.46585836,66852.9597381
8,Stand 1719, Ga-Rankuwa X23,Stand 1719, Ga-Rankuwa X23,Settlements,Settlements 3,997.901714538,51299.0275644

Original values: CSV

enter image description here Updated field values:

"""
Created on 23 Oct 2018
Remove invalid characters
found in SAA fields values:
NAME_LABEL; CATEGORY
@author: Peter Wilson
"""
# import site-packages and modules
import re
import argparse
import arcpy
# set environment settings
arcpy.env.overwriteOutput = True
def sites_fields_list(sites):
 """
 Validate fields found
 and create list for
 update cursor.
 """
 fields = ['NAME_LABEL', 'CATEGORY']
 sites_fields = [f.name for f in arcpy.ListFields(sites) if f.name in fields]
 return sites_fields
def correct_invalid_char(field_value):
 """
 Correct field values
 by replacing characters,
 removing non-alphanumeric,
 non-numeric characters,
 duplicate underscores,
 and changing to title case.
 """
 try:
 underscore = re.sub('[/\- ]', '_', field_value)
 illegal_char = re.sub('[^0-9a-zA-Z_]+', '', underscore)
 dup_underscore = re.sub(r'(_)+', r'1円', illegal_char)
 update_value = dup_underscore.title()
 return update_value
 except TypeError as e:
 print("There's no value in the field: {0}".format(e))
 raise
def update_field_values(sites):
 """
 Iterate over field values
 in: NAME_LABEL; CATEGORY
 and correct field values
 by replacing characters.
 """
 sites_fields = sites_fields_list(sites)
 with arcpy.da.UpdateCursor(sites, sites_fields) as cursor:
 for row in cursor:
 for idx, val in enumerate(row):
 row[idx] = correct_invalid_char(val)
 cursor.updateRow(row)
if __name__ == '__main__':
 description = 'Remove invalid characters in SAA fields NAME_LABEL and CATEGORY'
 parser = argparse.ArgumentParser(description)
 parser.add_argument('--sites', metavar='path', required=True,
 help='path to input sites feature class')
 args = parser.parse_args()
 update_field_values(sites=args.sites)
asked Oct 25, 2018 at 7:31
\$\endgroup\$
2
  • \$\begingroup\$ can you copy-paste the values instead of putting screen shots? \$\endgroup\$ Commented Oct 25, 2018 at 8:36
  • \$\begingroup\$ @MaartenFabré, I've attached the original values as a csv, as a code snippit. \$\endgroup\$ Commented Oct 25, 2018 at 8:48

1 Answer 1

2
\$\begingroup\$

I can't say it would be more Pythonic, but we can reduce your number of re.sub() calls from 3 down to 2.

First, we just eliminate all of the invalid letters:

valid_chars = re.sub('[^-/_ 0-9a-zA-Z]', '', field_value)

Then, replace occurrences of one or more '-', '/', '_' and ' ' characters with a single underscore, convert to title case, and return:

return re.sub('[-/_ ]+', '_', valid_chars).title()

Since the try ... except block is unconditionally re-raising the raised exception, it is not really adding much, and could probably be eliminated.

When using a regex over and over, it is usually more efficient to compile the regular expression once, and then reuse the compiled regular expression object.

INVALID_CHARS = re.compile('[^-/_ 0-9a-zA-Z]')
UNDERSCORE_CHARS = re.compile('[-/_ ]+')
def correct_invalid_char(field_value):
 """
 Correct field values by replacing characters, removing non-alphanumeric,
 non-numeric characters, duplicate underscores, and changing to title case.
 """
 valid_chars = INVALID_CHARS.sub('', field_value)
 return UNDERSCORE_CHARS.sub('_', valid_chars).title()

A quick "Hello world"-ish test:

>>> correct_invalid_char('Hello EMPEROR / ***WoRlD_-=-_lEaDeR***')
'Hello_Emperor_World_Leader'
answered Oct 25, 2018 at 18:38
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.