2
\$\begingroup\$

I have to write a python function that takes a string that contains raw unicode strings (e.g. "u'hello' there") and transform it into a string that strips the u'' identifiers from it. For example:

strip_inline_unicode("u'hello' there") # -> 'hello there'

Without getting into too much detail as to why I have to do this, sufficed to say I cannot simply replace the logic that generates these strings, and I have to strip these because it is being returned as human-readable output to a user.

Constraints:

  • Strings can have "malformed" unicode identifiers which should be handled correctly (e.g. "hello u'there" -> "hello there")
  • Strings can have empty unicode identifiers (e.g. "hello u''there" -> "hello there")
  • Strings will never have "nested" unicode identifiers (e.g. 'u"u\'foo\'"')
  • Quotes will always be single-quotes (so never u"<stuff>")

Here is what I came up with:

def strip_inline_unicode(stupid_string):
 """Takes a string that looks like "u'hello' there" and returns
 "hello there" """
 in_unicode = False
 pos = 0
 new_str = ''
 while (pos < len(stupid_string)):
 if pos + 1 >= len(stupid_string):
 if in_unicode and stupid_string[-1] == "'":
 new_str += stupid_string[pos:-1]
 else:
 new_str += stupid_string[pos:]
 break
 cur = stupid_string[pos]
 nxt = stupid_string[pos + 1]
 if cur == 'u' and nxt == "'" and not in_unicode:
 in_unicode = True
 pos += 1
 elif in_unicode and cur == "'":
 in_unicode = False
 else:
 new_str += cur
 pos += 1
 return new_str

When I plug it into the interpreter it seems to work correctly:

In [12]: strip_inline_unicode("u'hello' there")
Out[12]: 'hello there'
In [14]: strip_inline_unicode("hello there")
Out[14]: 'hello there'
In [15]: strip_inline_unicode("hello u'there")
Out[15]: 'hello there' 
In [16]: strip_inline_unicode("hello u''there")
Out[16]: 'hello there'
In [17]: strip_inline_unicode("au'b'")
Out[17]: 'ab'
In [18]: strip_inline_unicode("u'abc'")
Out[18]: 'abc'

However, I am by no means a Python expert, and it seems like I could accomplish something similar in a simpler and more robust manner, perhaps by using regexes. I was hoping to get some feedback on the implementation and maybe simplify it/make it better.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Mar 18, 2015 at 21:25
\$\endgroup\$

1 Answer 1

5
\$\begingroup\$

Yes, regexps were my first thought:

re.sub(r"u'([^']*)'?", r'1円', string)

Dissected:

  1. A literal u'.
  2. Then anything that is not a ' zero or more times: [^']*.
  3. Store that for later retrieval: ([^']*).
  4. End with an optional '.
answered Mar 18, 2015 at 21:37
\$\endgroup\$
1
  • \$\begingroup\$ Infinitely better. I missed the part in the re.sub docs that mentions that backreferences can be used to specify replacements. Really cool stuff. Thanks so much for your help! \$\endgroup\$ Commented Mar 19, 2015 at 0:52

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.