I have 2 tables that looks like this:
Table A:
CREATE TEMP TABLE table_a (
Column_1 text,
ID_number int
);
INSERT INTO table_a VALUES
('foo,bar,baz', 123),
('qux,quux,quuz',456),
('corge,grault,garply',789),
('qux,bar,grault', 101);
Table B:
CREATE TEMP TABLE table_b (
Column_1 text,
Column_2 text,
ID_number int
);
INSERT INTO table_b VALUES
('foo','baz',null),
('qux','quzz',null),
('corge','garply',null);
I'm trying to copy across values from the ID_number column in Table A, where the values in Column 1 & 2 of table B can be found in the same row of Column 1 in Table A.
This is the kind of thing I was thinking of:
UPDATE table_b AS B
SET id_number = A.id_number
FROM table_a AS A
WHERE A.column_1 LIKE B.column_1
AND A.column_1 LIKE B.column_2
.. but obviously this doesn't work.
How can I translate this into a proper query?
Additional info
table_a.Column_1
contains UK addresses, for example:
'47 BOWERS PLACE, GREAT YARMOUTH, NORFOLK, NR20 4AN'
In table_b
I have the first line of the address in Column_1
(so, '47 BOWERS PLACE'
) and the postcode ('NR20 4AN'
) in Column_2
.
I thought it would be best to simplify things, but maybe the actual data has some relevance in this situation.
table_a
has about 30 million addresses. table_b
has around 60k rows.
Performance is relevant, the faster this runs the better, and it will likely be repeated in the future.
3 Answers 3
Assuming Postgres 9.6, performance is relevant, big tables, "words" composed of characters, no whitespace or punctuation, no stemming or stop words, no phrases, all columns NOT NULL
.
Full Text search backed by an index should be among the fastest solutions:
UPDATE table_b b
SET id_number = a.id_number
FROM table_a a
WHERE to_tsvector('simple', a.column_1)
@@ plainto_tsquery('simple', concat_ws(' ', b.column_1, b.column_2))
AND b.id_number = a.id_number; -- prevent empty UPDATEs
With a matching expression index on a.column_1
:
CREATE INDEX table_a_column_1_idx ON table_a USING GIN (to_tsvector('simple', column_1));
-
Thank you for your answer @Erwin. It set me on the right path (I think!). I actually had a tsvector column already created, so used that instead of creating a new index. My query ended up looking like: UPDATE table_b b SET id_number = a.id_number FROM table_a a WHERE a.tsvector_column @@ plainto_tsquery ('simple', concat_ws(' ', b.column_1, b.column_1)); - But this did leave me with some rows that didn't match up for any visible reason. Any ideas why that might have been?Matt– Matt2017年02月01日 16:04:43 +00:00Commented Feb 1, 2017 at 16:04
-
@Matt. Many ideas. Depends on the questions and assumptions I mentioned. Plus, what exactly is in that
tsvector
column. My educated guess: Yourtsvector
column might have been created with a different ts config (not'simple'
). Post a new question with exact information for the mismatch case.Erwin Brandstetter– Erwin Brandstetter2017年02月01日 20:51:00 +00:00Commented Feb 1, 2017 at 20:51 -
1Thanks again Erwin, king of postgres! You were right on the ts config, it was set as 'english' not 'simple'. Changing this has increased the match rate to its highest possible level. Now I just have to fix some messy data...Matt– Matt2017年02月02日 10:38:22 +00:00Commented Feb 2, 2017 at 10:38
The key here is that Column_1
, represents three possible values for the JOIN. So what you want to use is string_to_array()
(so long as those values are comma-separated and can not themselves include a comma).
Run this query,
SELECT id_number, string_to_array(column_1, ',') AS column_1
FROM table_a;
id_number | column_1
-----------+-----------------------
123 | {foo,bar,baz}
456 | {qux,quux,quuz}
789 | {corge,grault,garply}
101 | {qux,bar,grault}
Now, we can run our UPDATE
using = ANY()
,
UPDATE table_b
SET id_number = A.id_number
FROM (
SELECT id_number, string_to_array(column_1, ',') AS column_1
FROM table_a
) AS A
WHERE table_b.column_1 = ANY(A.column_1)
AND table_b.column_2 = ANY(A.column_1);
You can alternatively use <@
WHERE ARRAY[table_b.column_1, table_b.column_2] <@ A.column_1;
That even makes it a bit more compact..
UPDATE table_b
SET id_number = A.id_number
FROM table_a AS A
WHERE ARRAY[table_b.column_1, table_b.column_2] <@ string_to_array(A.column_1, ',');
-
I came to this answer whilst trying to solve the same problem, where as the example has table_a.col1 as a string my data (also uk address) is already split into series of cols in table_a . ANY needs an array as i understand should i convert cols to array or is there a better way?mapping dom– mapping dom2017年07月17日 10:40:00 +00:00Commented Jul 17, 2017 at 10:40
-
Converting columns into an array for search is usually not a good idea, as the array won't use the underlying column's indexes.Evan Carroll– Evan Carroll2017年07月17日 15:24:57 +00:00Commented Jul 17, 2017 at 15:24
Try this:
update table_b
set id_number = (select id_number
from table_a
where table_a.Column_1 like '%' || table_b.Column_1 || '%'
OR table_a.Column_1 like '%' || table_b.Column_2 || '%'
limit 1)
;
It can be another solutions by converting Column_1 into an array, but this is so clear.
Notice I'm limiting the search to 1 record, just in case the text appears in more than one column_1 of Table_A.
As Evan Carroll has pointed out in the comments sections, I'd remark that this code updates the whole table.
Check it here: http://rextester.com/MUL4593
-
Writing the DDL in the case is always a solid contribution, I just wish you'd put in the question. Anyway, have an upvote.Evan Carroll– Evan Carroll2017年01月27日 18:37:28 +00:00Commented Jan 27, 2017 at 18:37
-
It may be also be useful to note that this query updates the whole table.Evan Carroll– Evan Carroll2017年01月27日 20:05:03 +00:00Commented Jan 27, 2017 at 20:05
-
Hi @EvanCarroll, I really appreciate your advices. I usually write the whole DDL just to avoid typos and ensure the correct answer.McNets– McNets2017年01月27日 20:29:43 +00:00Commented Jan 27, 2017 at 20:29
-
1You do a great job at it too. I would write them as
CREATE TEMP TABLE AS
instead, but that's a matter of style. If I can provide two hints to make your life easier, because we both do this very often, if you use an editor like VIM learn visual-block edit, and line edit mode, and also use Tim Pope's vim-surround plugin. It reduces the workload to less than a minute. ;)Evan Carroll– Evan Carroll2017年01月27日 20:31:56 +00:00Commented Jan 27, 2017 at 20:31
Explore related questions
See similar questions with these tags.