Faster way of replacing strings in large pandas dataframe with regex

Question 1

I want to replace all instances of a location with just the location keyword itself, eg "Foxclore Road" with "road", "Chesture Avenue" with "avenue". The file is several GBs, with many million rows. Current working code, three methods:

startTime = time.time()
mergedAllCrimes['crime_location_approx'] = mergedAllCrimes.crime_location_approx.str.replace(r'.*(?i)road$', 'road')
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if 'road' in str.lower(x) else x)
endTime = time.time()
print(endTime - startTime)
startTime = time.time()
allCrimes.loc[allCrimes['crime_location_approx'].str.contains('Road', case=False), 'crime_location_approx'] = 'road'
endTime = time.time()
print(endTime - startTime)

my times are:

14.287408590316772
1.9554557800292969
5.129802942276001

respectively

Problem is, the second two methods (while faster), replace "Broadway" with "road", hence the need for a regex to search at the end of a string.

Is there any way to make the regex conditional method much faster? If I have a large list of replacements, it could end up taking a long time.

Question 2

There is not much to say about your code then, Regex is slow.

A non-regex solution could be to use Python's endswidth, this works the same as r"road$"

mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if x.lower().endswith('road') else x)

I'm assuming all the conditional words are at the end of the string

Question 3

thank you very much, this has sped it up by ~6 times. for posterity, i had to change it to str.lower(x), but otherwise its perfect. there are a few edge cases where i will have to use the other methods, but this should work for the vast bulk of the data. have a good day!

Question 4

I I have changed it slightly to x.lower() doing str.lower(x) is not the correct format

Question 5

well isnt that neat, another 20% faster. fantastic :)

score 1 · Accepted Answer · 2017-11-24 12:13:09Z

1

\$\begingroup\$

There is not much to say about your code then, Regex is slow.

A non-regex solution could be to use Python's endswidth, this works the same as r"road$"

mergedAllCrimes.crime_location_approx = mergedAllCrimes.crime_location_approx.apply(lambda x: 'road' if x.lower().endswith('road') else x)

I'm assuming all the conditional words are at the end of the string

Share

edited Nov 24, 2017 at 12:49

answered Nov 24, 2017 at 12:13

Ludisposed's user avatar

Ludisposed LudisposedLudisposed

11.8k2 gold badges41 silver badges91 bronze badges

\$\endgroup\$

3

\$\begingroup\$ thank you very much, this has sped it up by ~6 times. for posterity, i had to change it to str.lower(x), but otherwise its perfect. there are a few edge cases where i will have to use the other methods, but this should work for the vast bulk of the data. have a good day! \$\endgroup\$

Zulfiqaar
– Zulfiqaar

2017年11月24日 12:47:39 +00:00
Commented Nov 24, 2017 at 12:47
\$\begingroup\$ I I have changed it slightly to x.lower() doing str.lower(x) is not the correct format \$\endgroup\$

Ludisposed
– Ludisposed

2017年11月24日 12:51:16 +00:00
Commented Nov 24, 2017 at 12:51
\$\begingroup\$ well isnt that neat, another 20% faster. fantastic :) \$\endgroup\$

Zulfiqaar
– Zulfiqaar

2017年11月24日 12:54:31 +00:00
Commented Nov 24, 2017 at 12:54

Add a comment |

Stack Exchange Network

Faster way of replacing strings in large pandas dataframe with regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Faster way of replacing strings in large pandas dataframe with regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions