How to speed up string cleanup function?

Question 1

I need to cleanup a string, so that certain ASCII code characters are left out of the string, and others are replaced.

I am new to Postgres. My function ufn_cie_easy() performs way too slow:

DECLARE
 letter char = '';
 str_result TEXT = '';
 x integer;
 y integer;
 asc_code int;
BEGIN
 y:=1;
 x:=char_length(arg);
 LOOP
 letter=substring(arg from y for 1);
 asc_code=ascii(letter);
 IF (asc_code BETWEEN 47 and 58) or (asc_code BETWEEN 65 and 90) or (
 asc_code BETWEEN 97 and 122) THEN
 str_result := str_result || letter;
 ELSIF (asc_code BETWEEN 192 and 197) THEN
 str_result := str_result || 'A';
 ELSIF (asc_code BETWEEN 200 and 203) THEN
 str_result := str_result || 'E';
 ELSIF (asc_code BETWEEN 204 and 207) THEN
 str_result := str_result || 'I';
 ELSIF (asc_code BETWEEN 210 and 214) OR (asc_code=216) THEN
 str_result := str_result || 'O';
 ELSIF (asc_code BETWEEN 217 and 220) THEN
 str_result := str_result || 'U';
 ELSIF (asc_code BETWEEN 224 and 229) THEN
 str_result := str_result || 'a';
 ELSIF (asc_code BETWEEN 232 and 235) THEN
 str_result := str_result || 'e';
 ELSIF (asc_code BETWEEN 236 and 239) THEN
 str_result := str_result || 'i';
 ELSIF (asc_code BETWEEN 242 and 246) OR (asc_code=248) THEN
 str_result := str_result || 'o';
 ELSIF (asc_code BETWEEN 249 and 252) THEN
 str_result := str_result || 'u';
 ELSE
 CASE asc_code
 WHEN 352 THEN str_result := str_result || 'S';
 WHEN 338 THEN str_result := str_result || 'OE';
 WHEN 381 THEN str_result := str_result || 'Z';
 WHEN 353 THEN str_result := str_result || 's';
 WHEN 339 THEN str_result := str_result || 'oe';
 WHEN 382 THEN str_result := str_result || 'z';
 WHEN 162 THEN str_result := str_result || 'c';
 WHEN 198 THEN str_result := str_result || 'AE';
 WHEN 199 THEN str_result := str_result || 'C';
 WHEN 208 THEN str_result := str_result || 'D';
 WHEN 209 THEN str_result := str_result || 'N';
 WHEN 223 THEN str_result := str_result || 'ss';
 WHEN 230 THEN str_result := str_result || 'ae';
 WHEN 231 THEN str_result := str_result || 'c';
 WHEN 241 THEN str_result := str_result || 'n';
 WHEN 376 THEN str_result := str_result || 'Y';
 WHEN 221 THEN str_result := str_result || 'Y';
 WHEN 253 THEN str_result := str_result || 'y';
 WHEN 255 THEN str_result := str_result || 'y';
 ELSE str_result := str_result;
 END CASE;
 END IF; 
 y:=y+1;
 exit when y=x+1;
 END LOOP;
 return str_result;
END;

Question 2

Thx for your reply. Data is imported from MS Access and into a PG db with UTF-8 character set. A conversion from one char set to another (e.g. LATIN5 to UTF-8) will not help, because this is a user defined function in which certain characters are simply omitted from the string, and others replaced. (The asc_code values are not a problem I'm coping with btw, they have been tested). The function works fine for a few records, but I need to use it for multiple calculated fields on a few million records. Takes forever.

Question 3

Postgres 9.6 or later

After this update to unaccent:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Using unaccent() like instructed in the linked answer above:

CREATE OR REPLACE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT trim(regexp_replace(regexp_replace(public.unaccent('public.unaccent', 1ドル)
 , '[^a-zA-Z\d\s]', '', 'g')
 , '\s+', ' ', 'g'));
$func$;

PARALLEL SAFE to label compliance with parallel query plans - in Postgres 9.6 or later.

Almost, but not quite 100 % identical to what you had. Better IMHO:

Replace all accents and expands all ligatures with unaccent().
Remove all noise characters (everything except ASCII letters, digits and white space).
Fold whitespace to a single space (' ').
Trim leading and trailing spaces.

Detailed explanation and more related issues for an IMMUTABLE function with unaccent():

Does PostgreSQL support "accent insensitive" collations?

Postgres 9.5

The function is noticeably faster with fewer assignments (none, actually, in the updated version). Assignments are comparatively expensive in PL/pgSQL:

CREATE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE plpgsql PARALLEL SAFE AS
$func$
BEGIN
 RETURN replace(replace(replace(replace(replace(replace(
 translate(1,ドル'ŠŽšžŸÝÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ'
 ,'SZszYYAAAAAACEEEEIIIINOOOOOOUUUUaaaaaaceeeeiiiinoooooouuuuyy')
 ,'Œ','OE')
 ,'Æ','AE')
 ,'œ','oe')
 ,'æ','ae')
 ,'ß','ss')
 ,'''','');
END
$func$;

But I suspect you really want to remove all accents (diacritic signs). Postgres provides the function unaccent() with the additional module unaccent:

CREATE OR REPLACE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
 SELECT translate(ext.unaccent(regdictionary 'ext.unaccent'
 , replace(replace(replace(replace(replace(
 1ドル
 ,'Œ','OE')
 ,'Æ','AE')
 ,'œ','oe')
 ,'æ','ae')
 ,'ß','ss')
 ), '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ''', '');
$func$;

Substantially faster, yet. And superior in every aspect to remove all diacritic signs.

In addition to unaccent():

Include an escaped ' ('') in translate() directly.
Make it a simple SQL function.
Make the function IMMUTABLE.

In Postgres 9.5 or older we need to expand ligatures like 'Œ' or 'ß' manually, since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S

Question 4

That's briliant help, Erwin, thx so much.

Question 5

@W.Smets: It gets better, yet. Consider the updates.

Question 6

From what I understand, you could use translate() to achieve what you want.

A small demonstration:

SELECT translate('Some ţext with ひ inţereşt平ing chăracters', 'ţşăひ平', 'tsa');
 translate 
────────────────────────────────────────
 Some text with interesting characters

So, first you put the characters you want to replace, each matching (in the given order!) their replacement in the third parameter to the function. Then just list all characters that has to be removed, with no match in the replacement list.

For those characters that need a multi-character replacement, you can still use replace() in a similar fashion, but one call per character:

SELECT replace(replace('Æ small blæblæ', 'Æ', 'AE'), 'æ', 'ae');
 replace 
───────────────────
 AE small blaeblae

Question 7

Right! That looks very promising. Especially if I could build up the second parameter of the replace function using code. In VB(A) you would have a static variable calculated once. Any suggestions as how to get PostgreSQL (using SQL Manager for PostgreSQL) to calculate a value once and reuse it as a variable in functions? Thanks a lot for the help!

Question 8

FYI, reworked version of the function (incoming: text_to_clean text). Doesn't look nice, but works:

DECLARE
result text;
BEGIN
result=translate(text_to_clean,'ŠŽšžŸÝÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ','SZszYYAAAAAACEEEEIIIINOOOOOOUUUUaaaaaaceeeeiiiinoooooouuuuyy');
result=replace(result,'Œ','OE');
result=replace(result,'Æ','AE');
result=replace(result,'œ','oe');
result=replace(result,'æ','ae');
result=replace(result,'ß','ss');
result=replace(result,'''','');
return result;
END;

Thanks again for the help.

score 2 · Answer 1 · 2016-08-02 01:08:41Z

Postgres 9.6 or later

After this update to unaccent:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Using unaccent() like instructed in the linked answer above:

CREATE OR REPLACE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT trim(regexp_replace(regexp_replace(public.unaccent('public.unaccent', 1ドル)
 , '[^a-zA-Z\d\s]', '', 'g')
 , '\s+', ' ', 'g'));
$func$;

PARALLEL SAFE to label compliance with parallel query plans - in Postgres 9.6 or later.

Almost, but not quite 100 % identical to what you had. Better IMHO:

Replace all accents and expands all ligatures with unaccent().
Remove all noise characters (everything except ASCII letters, digits and white space).
Fold whitespace to a single space (' ').
Trim leading and trailing spaces.

Detailed explanation and more related issues for an IMMUTABLE function with unaccent():

Does PostgreSQL support "accent insensitive" collations?

Postgres 9.5

The function is noticeably faster with fewer assignments (none, actually, in the updated version). Assignments are comparatively expensive in PL/pgSQL:

CREATE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE plpgsql PARALLEL SAFE AS
$func$
BEGIN
 RETURN replace(replace(replace(replace(replace(replace(
 translate(1,ドル'ŠŽšžŸÝÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ'
 ,'SZszYYAAAAAACEEEEIIIINOOOOOOUUUUaaaaaaceeeeiiiinoooooouuuuyy')
 ,'Œ','OE')
 ,'Æ','AE')
 ,'œ','oe')
 ,'æ','ae')
 ,'ß','ss')
 ,'''','');
END
$func$;

But I suspect you really want to remove all accents (diacritic signs). Postgres provides the function unaccent() with the additional module unaccent:

CREATE OR REPLACE FUNCTION f_cie_easy(text)
 RETURNS text
 LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
 SELECT translate(ext.unaccent(regdictionary 'ext.unaccent'
 , replace(replace(replace(replace(replace(
 1ドル
 ,'Œ','OE')
 ,'Æ','AE')
 ,'œ','oe')
 ,'æ','ae')
 ,'ß','ss')
 ), '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ''', '');
$func$;

Substantially faster, yet. And superior in every aspect to remove all diacritic signs.

In addition to unaccent():

Include an escaped ' ('') in translate() directly.
Make it a simple SQL function.
Make the function IMMUTABLE.

In Postgres 9.5 or older we need to expand ligatures like 'Œ' or 'ß' manually, since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S

That's briliant help, Erwin, thx so much.

W. Smets
– W. Smets

2016年08月02日 09:53:13 +00:00
Commented Aug 2, 2016 at 9:53
@W.Smets: It gets better, yet. Consider the updates.

Erwin Brandstetter
– Erwin Brandstetter

2016年08月02日 12:41:06 +00:00
Commented Aug 2, 2016 at 12:41

score 1 · Answer 2 · 2016-07-28 11:20:44Z

From what I understand, you could use translate() to achieve what you want.

A small demonstration:

SELECT translate('Some ţext with ひ inţereşt平ing chăracters', 'ţşăひ平', 'tsa');
 translate 
────────────────────────────────────────
 Some text with interesting characters

So, first you put the characters you want to replace, each matching (in the given order!) their replacement in the third parameter to the function. Then just list all characters that has to be removed, with no match in the replacement list.

For those characters that need a multi-character replacement, you can still use replace() in a similar fashion, but one call per character:

SELECT replace(replace('Æ small blæblæ', 'Æ', 'AE'), 'æ', 'ae');
 replace 
───────────────────
 AE small blaeblae

Right! That looks very promising. Especially if I could build up the second parameter of the replace function using code. In VB(A) you would have a static variable calculated once. Any suggestions as how to get PostgreSQL (using SQL Manager for PostgreSQL) to calculate a value once and reuse it as a variable in functions? Thanks a lot for the help!

W. Smets W. Smets 113 bronze badges · Answer 3 · 2016-07-31 16:55:37Z

FYI, reworked version of the function (incoming: text_to_clean text). Doesn't look nice, but works:

DECLARE
result text;
BEGIN
result=translate(text_to_clean,'ŠŽšžŸÝÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ!"#$%&()*+,-./:;<=>?@[\]^_`{|}~€‚ƒ„...†‡ˆ‰‹‘’""•–— ̃TM›¢£¤\¦§ ̈©a«¬® ̄°±23 ́μ¶· ×ばつÞð÷þÐ','SZszYYAAAAAACEEEEIIIINOOOOOOUUUUaaaaaaceeeeiiiinoooooouuuuyy');
result=replace(result,'Œ','OE');
result=replace(result,'Æ','AE');
result=replace(result,'œ','oe');
result=replace(result,'æ','ae');
result=replace(result,'ß','ss');
result=replace(result,'''','');
return result;
END;

Thanks again for the help.

Stack Exchange Network

How to speed up string cleanup function?

3 Answers 3

Postgres 9.6 or later

Postgres 9.5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to speed up string cleanup function?

3 Answers 3

Postgres 9.6 or later

Postgres 9.5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions