I want to replace all instances of a location with just the location keyword itself, eg "Foxclore Road" with "road", "Chesture Avenue" with "avenue". The file is several GBs, with many million rows. Current working code, three methods:
startTime = time.time()
mergedAllCrimes['crime_location_approx'] = mergedAllCrimes.crime_location_approx.str.replace(r'.*(?i)road$', 'road')
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if 'road' in str.lower(x) else x)
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
allCrimes.loc[allCrimes['crime_location_approx'].str.contains('Road', case=False), 'crime_location_approx'] = 'road'
endTime = time.time()
print(endTime - startTime)
my times are:
14.287408590316772
1.9554557800292969
5.129802942276001
respectively
Problem is, the second two methods (while faster), replace "Broadway" with "road", hence the need for a regex to search at the end of a string.
Is there any way to make the regex conditional method much faster? If I have a large list of replacements, it could end up taking a long time.
1 Answer 1
There is not much to say about your code then, Regex is slow.
A non-regex solution could be to use Python's endswidth, this works the same as r"road$"
mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if x.lower().endswith('road') else x)
I'm assuming all the conditional words are at the end of the string
-
\$\begingroup\$ thank you very much, this has sped it up by ~6 times. for posterity, i had to change it to
str.lower(x)
, but otherwise its perfect. there are a few edge cases where i will have to use the other methods, but this should work for the vast bulk of the data. have a good day! \$\endgroup\$Zulfiqaar– Zulfiqaar2017年11月24日 12:47:39 +00:00Commented Nov 24, 2017 at 12:47 -
\$\begingroup\$ I I have changed it slightly to
x.lower()
doingstr.lower(x)
is not the correct format \$\endgroup\$Ludisposed– Ludisposed2017年11月24日 12:51:16 +00:00Commented Nov 24, 2017 at 12:51 -
\$\begingroup\$ well isnt that neat, another 20% faster. fantastic :) \$\endgroup\$Zulfiqaar– Zulfiqaar2017年11月24日 12:54:31 +00:00Commented Nov 24, 2017 at 12:54