Python function to strip inline repr'd unicode strings

Question 1

I have to write a python function that takes a string that contains raw unicode strings (e.g. "u'hello' there") and transform it into a string that strips the u'' identifiers from it. For example:

strip_inline_unicode("u'hello' there") # -> 'hello there'

Without getting into too much detail as to why I have to do this, sufficed to say I cannot simply replace the logic that generates these strings, and I have to strip these because it is being returned as human-readable output to a user.

Constraints:

Strings can have "malformed" unicode identifiers which should be handled correctly (e.g. "hello u'there" -> "hello there")
Strings can have empty unicode identifiers (e.g. "hello u''there" -> "hello there")
Strings will never have "nested" unicode identifiers (e.g. 'u"u\'foo\'"')
Quotes will always be single-quotes (so never u"<stuff>")

Here is what I came up with:

def strip_inline_unicode(stupid_string):
 """Takes a string that looks like "u'hello' there" and returns
 "hello there" """
 in_unicode = False
 pos = 0
 new_str = ''
 while (pos < len(stupid_string)):
 if pos + 1 >= len(stupid_string):
 if in_unicode and stupid_string[-1] == "'":
 new_str += stupid_string[pos:-1]
 else:
 new_str += stupid_string[pos:]
 break
 cur = stupid_string[pos]
 nxt = stupid_string[pos + 1]
 if cur == 'u' and nxt == "'" and not in_unicode:
 in_unicode = True
 pos += 1
 elif in_unicode and cur == "'":
 in_unicode = False
 else:
 new_str += cur
 pos += 1
 return new_str

When I plug it into the interpreter it seems to work correctly:

In [12]: strip_inline_unicode("u'hello' there")
Out[12]: 'hello there'
In [14]: strip_inline_unicode("hello there")
Out[14]: 'hello there'
In [15]: strip_inline_unicode("hello u'there")
Out[15]: 'hello there' 
In [16]: strip_inline_unicode("hello u''there")
Out[16]: 'hello there'
In [17]: strip_inline_unicode("au'b'")
Out[17]: 'ab'
In [18]: strip_inline_unicode("u'abc'")
Out[18]: 'abc'

However, I am by no means a Python expert, and it seems like I could accomplish something similar in a simpler and more robust manner, perhaps by using regexes. I was hoping to get some feedback on the implementation and maybe simplify it/make it better.

Question 2

Yes, regexps were my first thought:

re.sub(r"u'([^']*)'?", r'1円', string)

Dissected:

A literal u'.
Then anything that is not a ' zero or more times: [^']*.
Store that for later retrieval: ([^']*).
End with an optional '.

Question 3

Infinitely better. I missed the part in the re.sub docs that mentions that backreferences can be used to specify replacements. Really cool stuff. Thanks so much for your help!

Boldewyn Boldewyn 1666 bronze badges · Accepted Answer · 2015-03-18 21:37:46Z

5

\$\begingroup\$

Yes, regexps were my first thought:

re.sub(r"u'([^']*)'?", r'1円', string)

Dissected:

A literal u'.
Then anything that is not a ' zero or more times: [^']*.
Store that for later retrieval: ([^']*).
End with an optional '.

Share

answered Mar 18, 2015 at 21:37

Boldewyn's user avatar

Boldewyn Boldewyn

1666 bronze badges

\$\endgroup\$

1

\$\begingroup\$ Infinitely better. I missed the part in the re.sub docs that mentions that backreferences can be used to specify replacements. Really cool stuff. Thanks so much for your help! \$\endgroup\$

Travis Kaufman
– Travis Kaufman

2015年03月19日 00:52:26 +00:00
Commented Mar 19, 2015 at 0:52

Add a comment |

Stack Exchange Network

Python function to strip inline repr'd unicode strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python function to strip inline repr'd unicode strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions