2

I have Postgresql database that internally uses UTF-8 encoding. Some clients that connect to the database use LATIN 2 (ISO 8859-2) client encoding when they connect to database. I can't change the encoding on clients. The clients are selecting values from a table that contains text. The problem is that some (not all but a small percentage) rows in this table contain characters that can not be converted to LATIN 2 so there is an error when clients try to select this rows.

I can simulate this with psql:

database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table;
ERROR: character with byte sequence 0xc4 0x9f in encoding "UTF8" has no equivalent in encoding "LATIN2"

Since only a small percentage of the rows contain 'non convertible' characters I can select rows that have 'convertible' if i put the limit 10 because first 10 rows have supported characters.

database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table limit 10;
..... works fine, I get 10 rows with data since all of them can be converted.

Can I somehow select the problematic rows that are causing errors with UTF-8 to LATIN2 conversion? I would like to isolate and modify (update or delete, whatever...) this rows but I can't find a way to select only this rows.

I tried the following that is not working:

With '^[[:ascii:]]*$' - not OK because this select non ascii characters (hex value > 127) that have equivalents in LATIN2 so the conversion for them works:

select * from some_table where text_field !~ '^[[:ascii:]]*$';

Comparing char_length and octet_length - not OK since it also selects rows with characters that have equivalents in LATIN2.

select * from some_table where WHERE char_length(text_field)!=octet_length(text_field);
asked Dec 13, 2022 at 14:32

1 Answer 1

2

I'd use this SQL function:

create function is_latin2(text) returns boolean as
$$
 select 1ドル ~ '^[ !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆٤ĽŚ§ ̈ŠŞŤŹ­ŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇]*$'
$$ language sql strict immutable;

How it's built

According to https://en.wikipedia.org/wiki/ISO/IEC_8859-2, characters in LATIN2 have codes from 32 to 126 and 160 to 255, so a string containing all the valid characters can be generated with the following query (that should be run either in an UTF-8 or LATIN2-encoded database):

select convert_from(string_agg(set_byte('000円',0,i),'' order by i), 'LATIN2')
 from 
(select i from generate_series(32, 126) as i
 union select i from generate_series(160,255) as i) s;

It produces this result :

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆٤ĽŚ§ ̈ŠŞŤŹ­ŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇

This can be injected into a regexp like your '^[[:ascii:]]*$' except that it has all the characters explicitly listed in the bracket expression. You might also compress the series of ascii letters into ranges.

4 characters must be quoted:

  • the single quote must be doubled
  • [ ] \ must be quoted with a backslash
answered Dec 14, 2022 at 20:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.