3
\$\begingroup\$

I want to replace all instances of a location with just the location keyword itself, eg "Foxclore Road" with "road", "Chesture Avenue" with "avenue". The file is several GBs, with many million rows. Current working code, three methods:

startTime = time.time()
mergedAllCrimes['crime_location_approx'] = mergedAllCrimes.crime_location_approx.str.replace(r'.*(?i)road$', 'road')
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if 'road' in str.lower(x) else x)
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
allCrimes.loc[allCrimes['crime_location_approx'].str.contains('Road', case=False), 'crime_location_approx'] = 'road'
endTime = time.time()
print(endTime - startTime)

my times are:

14.287408590316772
1.9554557800292969
5.129802942276001

respectively

Problem is, the second two methods (while faster), replace "Broadway" with "road", hence the need for a regex to search at the end of a string.

Is there any way to make the regex conditional method much faster? If I have a large list of replacements, it could end up taking a long time.

asked Nov 24, 2017 at 11:43
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

There is not much to say about your code then, Regex is slow.

A non-regex solution could be to use Python's endswidth, this works the same as r"road$"

mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if x.lower().endswith('road') else x)

I'm assuming all the conditional words are at the end of the string

answered Nov 24, 2017 at 12:13
\$\endgroup\$
3
  • \$\begingroup\$ thank you very much, this has sped it up by ~6 times. for posterity, i had to change it to str.lower(x), but otherwise its perfect. there are a few edge cases where i will have to use the other methods, but this should work for the vast bulk of the data. have a good day! \$\endgroup\$ Commented Nov 24, 2017 at 12:47
  • \$\begingroup\$ I I have changed it slightly to x.lower() doing str.lower(x) is not the correct format \$\endgroup\$ Commented Nov 24, 2017 at 12:51
  • \$\begingroup\$ well isnt that neat, another 20% faster. fantastic :) \$\endgroup\$ Commented Nov 24, 2017 at 12:54

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.