[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues

Thu Dec 8 10:17:52 CET 2011

Victor Stinner <victor.stinner at haypocalc.com> wrote:
> For localeconv(), it is the b'\xA0' byte string decoded from an encoding 
> looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like 
> a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 
> locale anymore, only UTF-8 locales (which is much better!). I'm unable to 
> reproduce the issue on my OpenIndiana VM.

I'm think that b'\xA0' is a valid thousands separator. The 'fi_FI' locale also
uses that. Decimal.__format__() has to handle the 'n' specifier, which takes the
thousands separator directly from localeconv(). Currently I have this horrible
function to deal with the problem:
/* Convert decimal_point or thousands_sep, which may be multibyte or in
 the range [128, 255], to a UTF8 string. */
static PyObject *
dotsep_as_utf8(const char *s)
{
 PyObject *utf8;
 PyObject *tmp;
 wchar_t buf[2];
 size_t n;
 n = mbstowcs(buf, s, 2);
 if (n != 1) { /* Issue #7442 */
 PyErr_SetString(PyExc_ValueError,
 "invalid decimal point or unsupported "
 "combination of LC_CTYPE and LC_NUMERIC");
 return NULL;
 }
 tmp = PyUnicode_FromWideChar(buf, n);
 if (tmp == NULL) {
 return NULL;
 }
 utf8 = PyUnicode_AsUTF8String(tmp);
 Py_DECREF(tmp);
 return utf8;
}
The main issue is that there is no portable function mbst_to_utf8()
that uses the current locale. If possible, it would be great to have
such a thing in the C-API.
I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems
have this thousands separator.
Stefan Krah