Python unicode string rejected by psycopg

Question 1

I've received a unicode string from the wild that causes some of our psycopg2 statements to fail.

I have reduced the problem down to a SSCE:

import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
x = u'\ud837'
cur.execute("SELECT %s", (x,))
print cur.fetchone()

Running this gives the following exception:

Traceback (most recent call last):
 File ".../run.py", line 65, in <module>
 cur.execute("SELECT %s AS test", (x,))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb7

Based on some of the comments, it has become clear that this particular character is one half of a surrogate pair, making it invalid to live on its own.

Specifically then, I am looking for a mechanism to detect when a string contains an incomplete surrogate pair in Python 2.

One such method I have found that leads to an exception is trying x.encode('utf16').decode('utf16'), however, since I don't totally understand the risks associated, I would be somewhat concerned here.

Edit: Reduced SSCE string to single character causing the problem, added information based on comments.

Question 2

The character represents one half of a surrogate pair and doesn't represent a code point of its own. Presumably you obtained it through an API that split a UTF-16-encoded string without paying attention to character boundaries.

Question 3

@user4815162342 so how can I detect whether a given string in python contains any such incomplete surrogate pairs?

Question 4

Just curious, has my answer helped with the question?

Question 5

The string u'\ud837' consists of a lone member of a surrogate pair, two physical characters that appear in sequence to form a logical character. As such, it does not define a Unicode code point - instead, it is an implementation detail of the UTF-16 encoding which uses it to pack the full code point range into 16-bit code units. Python 3 correctly rejects attempts to encode lone surrogates in any byte encoding, including the UTF-* variants.

The string probably originated from a system that internally uses UTF-16 (such as Java, C#, Windows, or Python 2 built with 16-bit Py_UNICODE) that naively shortened the string without taking care of surrogates.

Taking the regex from this answer, it should be possible to efficiently detect such strings using code such as:

import re
lone = re.compile(
 ur'''(?x) # verbose expression (allows comments)
 ( # begin group
 [\ud800-\udbff] # match leading surrogate
 (?![\udc00-\udfff]) # but only if not followed by trailing surrogate
 ) # end group
 | # OR
 ( # begin group
 (?<![\ud800-\udbff]) # if not preceded by leading surrogate
 [\udc00-\udfff] # match trailing surrogate
 ) # end group
 ''')
def invalid_unicode(s):
 assert isinstance(s, unicode)
 return lone.search(s) is not None

Question 6

To detect that the string is invalid utf-8, just wrap an attempt to encode it inside a try/except before executing it in psycopg2.

As for what caused the problem, there is a specific character in the middle of the string that is utf-16 encoded: \U000d8a85. So it's not that Postgres does not consider it utf-8, it really isn't.

Question 7

Thanks for the explanation, but x.encode('utf-8') does not cause an exception. Neither does x.encode('utf-8').decode('utf-8'). Which leads me to believe either: python believes this to be valid utf-8, or python has fallbacks to decode utf-8 in a non-strict way.

Question 8

Also, after further tinkering, it appears the specific character causing the problem is \ud837 -- any idea what's going on there?

user4815162342 159k22 gold badges351 silver badges420 bronze badges · Accepted Answer · 2016-11-15 12:40:46Z

The string u'\ud837' consists of a lone member of a surrogate pair, two physical characters that appear in sequence to form a logical character. As such, it does not define a Unicode code point - instead, it is an implementation detail of the UTF-16 encoding which uses it to pack the full code point range into 16-bit code units. Python 3 correctly rejects attempts to encode lone surrogates in any byte encoding, including the UTF-* variants.

The string probably originated from a system that internally uses UTF-16 (such as Java, C#, Windows, or Python 2 built with 16-bit Py_UNICODE) that naively shortened the string without taking care of surrogates.

Taking the regex from this answer, it should be possible to efficiently detect such strings using code such as:

import re
lone = re.compile(
 ur'''(?x) # verbose expression (allows comments)
 ( # begin group
 [\ud800-\udbff] # match leading surrogate
 (?![\udc00-\udfff]) # but only if not followed by trailing surrogate
 ) # end group
 | # OR
 ( # begin group
 (?<![\ud800-\udbff]) # if not preceded by leading surrogate
 [\udc00-\udfff] # match trailing surrogate
 ) # end group
 ''')
def invalid_unicode(s):
 assert isinstance(s, unicode)
 return lone.search(s) is not None

CollectivesTM on Stack Overflow

Python unicode string rejected by psycopg

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related