Create integer id columns from existing string columns (integer coding?)

Question 1

I have an PostgreSQL server with an existing table which has two fixed-width-non-unique-string (variable size) columns such as this:

| ID_STRING_A | ID_STRING_B |
| 'AAAA' | 'BBBB' | 
| 'BBBB' | 'CCCC' | 
| 'AAAA' | 'DDDD' |

Now I want to compute an integer representation for the both-column-elements and store them into additional columns. The result should look like this:

| ID_STRING_A | ID_STRING_B | ID_INT_A | ID_INT_B |
| 'AAAA' | 'BBBB' | 1 | 2 |
| 'BBBB' | 'CCCC' | 2 | 3 |
| 'AAAA' | 'DDDD' | 1 | 4 |

My frist approach based on the answers is:

Unfortunately, my update part seems to be highly iniefficient although there are indices on ID_STRING_A/B. While the query itself is done in minutes, the update part seems not to end. Here's the code:

ALTER TABLE mytable ADD COLUMN ID_INT_B integer;
ALTER TABLE mytable ADD COLUMN ID_INT_A integer;
UPDATE mytable SET ID_INT_A = g.ID_INT_A , ID_INT_B = g.ID_INT_B FROM
(
 WITH T( n , s ) AS 
 ( 
 SELECT ROW_NUMBER() OVER ( ORDER BY s ) , s
 FROM 
 ( 
 SELECT ID_STRING_A FROM mytable
 UNION 
 SELECT ID_STRING_B FROM mytable
 ) AS X( s )
 )
 SELECT m.ctid AS id_ , m.ID_STRING_A AS ID_STRING_A , m.ID_STRING_B AS ID_STRING_B , T1.n AS ID_INT_A , T2.n AS ID_INT_B FROM mytable AS m
 JOIN T AS T1 ON m.ID_STRING_A = T1.s
 JOIN T AS T2 ON m.ID_STRING_B = T2.s
) AS g
WHERE mytable.ctid = g.id_

Question 2

What amount of rows are we talking about?

Question 3

@Lennart, roughly 13mins for 5.5M rows.

Question 4

What are the columns ctid and g.id_ in your update query?

Question 5

@Lennart: Since this table does not have any primary key I'm using the postgresql internal ctid field.

Question 6

I did a small test, and it appears to be working. I don't understand how the order between mytable and g is preserved though, what mechanism guarantees that n1 > n2 iff s1 > s2?

Question 7

I guess you can use the ASCII function:

SELECT ID_STRING_A,ID_STRING_B
 , ASCII(ID_INT_A) - 64 AS ID_INT_A
 , ASCII(ID_INT_B) - 64 AS ID_INT_B
FROM ...

Perhaps the intention's more clear using:

 , ASCII(ID_INT_A) - ASCII('A') + 1 AS ID_INT_A

EDIT, since the question where changed something like this is possible:

WITH T (n, s) as ( 
 SELECT row_number() over (order by s), s
 FROM ( 
 SELECT ID_STRING_A FROM mytable
 UNION 
 SELECT ID_STRING_B FROM mytable
 ) as X (s)
)
SELECT m.ID_STRING_A, m.ID_STRING_B, T1.n, T2.n
FROM mytable as m
JOIN T as T1
 ON m.ID_STRING_A = T1.s
JOIN T as T2
 ON m.ID_STRING_B = T2.s

EDIT, updating table

I have a gut feeling that this can be done in a simpler way, but I cross joined the cte with itself and filtered with WHERE to update both columns at once:

ALTER TABLE mytable
 ADD ID_INT_A INT;
ALTER TABLE mytable
 ADD ID_INT_B INT;
WITH cte (n, s) as ( 
 SELECT row_number() over (order by s), s
 FROM ( 
 SELECT ID_STRING_A FROM mytable
 UNION 
 SELECT ID_STRING_B FROM mytable
 ) as X (s)
), cte2 (n1,s1,n2,s2) as (
 SELECT c1.n, c1.s, c2.n, c2.s
 FROM cte c1
 CROSS JOIN cte c2
)
UPDATE mytable
 SET ID_INT_A = cte2.n1
 , ID_INT_B = cte2.n2
FROM cte2
WHERE mytable.ID_STRING_A = cte2.s1
 AND mytable.ID_STRING_B = cte2.s2
;

It should be noted that this is a 1-time operation. If you decide to add AABB later on, the enumeration will be wrong

Question 8

Will it work if the strings are like 'abcd-efgh-ijkl-mnop'?

Question 9

No, ASCII will return the ASCII value for the first character in the string. How will you encode concatenated characters? I assume AB is different from BA?

Question 10

Yes AB is different from BA. I was expecting AB = 1, BA = 2. I've asked a similar question for one column here: dba.stackexchange.com/questions/242718/…. It turned out i need both columns though :/

Question 11

Thanks, this looks really neat and does exactly what I've expected. Is it possible to modify the code to update both columns id_int_a, id_int_b directly with the same query?

Question 12

I apprechiate your help. The String columns are arbitrary uuid-strings with fixed size (36 characters).

Question 13

CREATE TEMP TABLE map (
 id serial PRIMARY KEY,
 str text NOT NULL
);
INSERT INTO map (str)
SELECT DISTINCT id_string_a
FROM mytab;
ALTER TABLE mytab ADD id_int_a integer;
UPDATE mytab
SET id_int_a = map.id
FROM map
WHERE mytab.id_string_a = map.str;
DROP TABLE map;

id_string_b is left as an exercise to the reader.

score 2 · Accepted Answer · 2019-07-15 13:43:00Z

I guess you can use the ASCII function:

SELECT ID_STRING_A,ID_STRING_B
 , ASCII(ID_INT_A) - 64 AS ID_INT_A
 , ASCII(ID_INT_B) - 64 AS ID_INT_B
FROM ...

Perhaps the intention's more clear using:

 , ASCII(ID_INT_A) - ASCII('A') + 1 AS ID_INT_A

EDIT, since the question where changed something like this is possible:

WITH T (n, s) as ( 
 SELECT row_number() over (order by s), s
 FROM ( 
 SELECT ID_STRING_A FROM mytable
 UNION 
 SELECT ID_STRING_B FROM mytable
 ) as X (s)
)
SELECT m.ID_STRING_A, m.ID_STRING_B, T1.n, T2.n
FROM mytable as m
JOIN T as T1
 ON m.ID_STRING_A = T1.s
JOIN T as T2
 ON m.ID_STRING_B = T2.s

EDIT, updating table

I have a gut feeling that this can be done in a simpler way, but I cross joined the cte with itself and filtered with WHERE to update both columns at once:

ALTER TABLE mytable
 ADD ID_INT_A INT;
ALTER TABLE mytable
 ADD ID_INT_B INT;
WITH cte (n, s) as ( 
 SELECT row_number() over (order by s), s
 FROM ( 
 SELECT ID_STRING_A FROM mytable
 UNION 
 SELECT ID_STRING_B FROM mytable
 ) as X (s)
), cte2 (n1,s1,n2,s2) as (
 SELECT c1.n, c1.s, c2.n, c2.s
 FROM cte c1
 CROSS JOIN cte c2
)
UPDATE mytable
 SET ID_INT_A = cte2.n1
 , ID_INT_B = cte2.n2
FROM cte2
WHERE mytable.ID_STRING_A = cte2.s1
 AND mytable.ID_STRING_B = cte2.s2
;

It should be noted that this is a 1-time operation. If you decide to add AABB later on, the enumeration will be wrong

No, ASCII will return the ASCII value for the first character in the string. How will you encode concatenated characters? I assume AB is different from BA?
Yes AB is different from BA. I was expecting AB = 1, BA = 2. I've asked a similar question for one column here: dba.stackexchange.com/questions/242718/…. It turned out i need both columns though :/
Thanks, this looks really neat and does exactly what I've expected. Is it possible to modify the code to update both columns id_int_a, id_int_b directly with the same query?
I apprechiate your help. The String columns are arbitrary uuid-strings with fixed size (36 characters).

Stack Exchange Network

Create integer id columns from existing string columns (integer coding?)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Create integer id columns from existing string columns (integer coding?)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions