I have an PostgreSQL server with an existing table which has two fixed-width-non-unique-string (variable size) columns such as this:
| ID_STRING_A | ID_STRING_B |
| 'AAAA' | 'BBBB' |
| 'BBBB' | 'CCCC' |
| 'AAAA' | 'DDDD' |
Now I want to compute an integer representation for the both-column-elements and store them into additional columns. The result should look like this:
| ID_STRING_A | ID_STRING_B | ID_INT_A | ID_INT_B |
| 'AAAA' | 'BBBB' | 1 | 2 |
| 'BBBB' | 'CCCC' | 2 | 3 |
| 'AAAA' | 'DDDD' | 1 | 4 |
My frist approach based on the answers is:
Unfortunately, my update part seems to be highly iniefficient although there are indices on ID_STRING_A/B. While the query itself is done in minutes, the update part seems not to end. Here's the code:
ALTER TABLE mytable ADD COLUMN ID_INT_B integer;
ALTER TABLE mytable ADD COLUMN ID_INT_A integer;
UPDATE mytable SET ID_INT_A = g.ID_INT_A , ID_INT_B = g.ID_INT_B FROM
(
WITH T( n , s ) AS
(
SELECT ROW_NUMBER() OVER ( ORDER BY s ) , s
FROM
(
SELECT ID_STRING_A FROM mytable
UNION
SELECT ID_STRING_B FROM mytable
) AS X( s )
)
SELECT m.ctid AS id_ , m.ID_STRING_A AS ID_STRING_A , m.ID_STRING_B AS ID_STRING_B , T1.n AS ID_INT_A , T2.n AS ID_INT_B FROM mytable AS m
JOIN T AS T1 ON m.ID_STRING_A = T1.s
JOIN T AS T2 ON m.ID_STRING_B = T2.s
) AS g
WHERE mytable.ctid = g.id_
-
1What amount of rows are we talking about?Lennart - Slava Ukraini– Lennart - Slava Ukraini2019年07月16日 07:55:35 +00:00Commented Jul 16, 2019 at 7:55
-
@Lennart, roughly 13mins for 5.5M rows.nali– nali2019年07月16日 07:58:06 +00:00Commented Jul 16, 2019 at 7:58
-
1What are the columns ctid and g.id_ in your update query?Lennart - Slava Ukraini– Lennart - Slava Ukraini2019年07月16日 10:42:46 +00:00Commented Jul 16, 2019 at 10:42
-
@Lennart: Since this table does not have any primary key I'm using the postgresql internal ctid field.nali– nali2019年07月16日 10:54:32 +00:00Commented Jul 16, 2019 at 10:54
-
1I did a small test, and it appears to be working. I don't understand how the order between mytable and g is preserved though, what mechanism guarantees that n1 > n2 iff s1 > s2?Lennart - Slava Ukraini– Lennart - Slava Ukraini2019年07月16日 11:14:15 +00:00Commented Jul 16, 2019 at 11:14
2 Answers 2
I guess you can use the ASCII function:
SELECT ID_STRING_A,ID_STRING_B
, ASCII(ID_INT_A) - 64 AS ID_INT_A
, ASCII(ID_INT_B) - 64 AS ID_INT_B
FROM ...
Perhaps the intention's more clear using:
, ASCII(ID_INT_A) - ASCII('A') + 1 AS ID_INT_A
EDIT, since the question where changed something like this is possible:
WITH T (n, s) as (
SELECT row_number() over (order by s), s
FROM (
SELECT ID_STRING_A FROM mytable
UNION
SELECT ID_STRING_B FROM mytable
) as X (s)
)
SELECT m.ID_STRING_A, m.ID_STRING_B, T1.n, T2.n
FROM mytable as m
JOIN T as T1
ON m.ID_STRING_A = T1.s
JOIN T as T2
ON m.ID_STRING_B = T2.s
EDIT, updating table
I have a gut feeling that this can be done in a simpler way, but I cross joined the cte with itself and filtered with WHERE to update both columns at once:
ALTER TABLE mytable
ADD ID_INT_A INT;
ALTER TABLE mytable
ADD ID_INT_B INT;
WITH cte (n, s) as (
SELECT row_number() over (order by s), s
FROM (
SELECT ID_STRING_A FROM mytable
UNION
SELECT ID_STRING_B FROM mytable
) as X (s)
), cte2 (n1,s1,n2,s2) as (
SELECT c1.n, c1.s, c2.n, c2.s
FROM cte c1
CROSS JOIN cte c2
)
UPDATE mytable
SET ID_INT_A = cte2.n1
, ID_INT_B = cte2.n2
FROM cte2
WHERE mytable.ID_STRING_A = cte2.s1
AND mytable.ID_STRING_B = cte2.s2
;
It should be noted that this is a 1-time operation. If you decide to add AABB later on, the enumeration will be wrong
-
Will it work if the strings are like 'abcd-efgh-ijkl-mnop'?nali– nali2019年07月15日 13:45:15 +00:00Commented Jul 15, 2019 at 13:45
-
1No, ASCII will return the ASCII value for the first character in the string. How will you encode concatenated characters? I assume AB is different from BA?Lennart - Slava Ukraini– Lennart - Slava Ukraini2019年07月15日 13:47:13 +00:00Commented Jul 15, 2019 at 13:47
-
Yes AB is different from BA. I was expecting AB = 1, BA = 2. I've asked a similar question for one column here: dba.stackexchange.com/questions/242718/…. It turned out i need both columns though :/nali– nali2019年07月15日 13:49:15 +00:00Commented Jul 15, 2019 at 13:49
-
Thanks, this looks really neat and does exactly what I've expected. Is it possible to modify the code to update both columns id_int_a, id_int_b directly with the same query?nali– nali2019年07月16日 05:57:43 +00:00Commented Jul 16, 2019 at 5:57
-
I apprechiate your help. The String columns are arbitrary uuid-strings with fixed size (36 characters).nali– nali2019年07月16日 08:02:32 +00:00Commented Jul 16, 2019 at 8:02
CREATE TEMP TABLE map (
id serial PRIMARY KEY,
str text NOT NULL
);
INSERT INTO map (str)
SELECT DISTINCT id_string_a
FROM mytab;
ALTER TABLE mytab ADD id_int_a integer;
UPDATE mytab
SET id_int_a = map.id
FROM map
WHERE mytab.id_string_a = map.str;
DROP TABLE map;
id_string_b
is left as an exercise to the reader.