4

I am parsing a csv with multi char delimiters in pandas as follows

big_df = pd.read_csv(os.path.expanduser('~/path/to/csv/with/special/delimiters.csv'), 
 encoding='utf8', 
 sep='\$\$><\$\$', 
 decimal=',', 
 engine='python')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)
big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)

this worked fine until I recently upgrade my pandas installation. Now I see a lot of deprecation warnings:

<input>:3: DeprecationWarning: invalid escape sequence \$
<input>:3: DeprecationWarning: invalid escape sequence \$
<input>:3: DeprecationWarning: invalid escape sequence \$
<input>:3: DeprecationWarning: invalid escape sequence \$
<input>:3: DeprecationWarning: invalid escape sequence \$
<ipython-input-6-1ba5b58b9e9e>:3: DeprecationWarning: invalid escape sequence \$
 sep='\$\$><\$\$',
<ipython-input-6-1ba5b58b9e9e>:7: DeprecationWarning: invalid escape sequence \$
 big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')

As I need the special delimiters with the $ symbols I am unsure how to properly handle these warnings

asked Jun 2, 2017 at 9:52
3
  • Use raw strings: r'\$\$><\$\$' etc. That way string escaping and regex escaping don't interfere. Commented Jun 2, 2017 at 10:05
  • Thanks, this is already the answer. If you want to feel free to post it as an answer. Commented Jun 2, 2017 at 10:13
  • Thanks. I was going to refuse, but this deprecation seems to be a pretty new thing, I mostly find github issues for libraries such as jinja, scikit, sympy, etc; all from the past week or so. Commented Jun 2, 2017 at 10:21

1 Answer 1

13

The problem is that escaping in strings can interfere with escaping in regular expressions. While '\s' is a valid regex token, for python this would represent a special character which doesn't exist (the string literal '\s' automatically gets converted to '\\s' i.e. r'\s', and I suspect that this process is what's been deprecated, apparently, from python 3.6).

The point is to always use raw string literals when constructing regular expressions, in order to make sure that python doesn't get confused by the backslashes. While most frameworks used to handle this ambiguity just fine (I assume by ignoring invalid escape sequences), apparently newer versions of certain libraries are trying to force programmers to be explicit and unambiguous (which I fully support).

In you specific case, your patterns should be changed from, say, '\$\$><\$\$' to r'\$\$><\$\$':

big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace(r'\$\$>$', '')

What actually happens is that the backslashes themselves have to escaped for python, in order to have a literal length-2 '\$' string in your regex pattern:

>>> r'\$\$><\$\$'
'\\$\\$><\\$\\$'
answered Jun 2, 2017 at 10:19
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.