UPDATE with join condition on matching words in columns of another table

Question 1

I have 2 tables that looks like this:

Table A:

CREATE TEMP TABLE table_a (
 Column_1 text,
 ID_number int
);
INSERT INTO table_a VALUES
 ('foo,bar,baz', 123),
 ('qux,quux,quuz',456),
 ('corge,grault,garply',789),
 ('qux,bar,grault', 101);

Table B:

CREATE TEMP TABLE table_b (
 Column_1 text,
 Column_2 text,
 ID_number int
);
INSERT INTO table_b VALUES
 ('foo','baz',null),
 ('qux','quzz',null),
 ('corge','garply',null);

I'm trying to copy across values from the ID_number column in Table A, where the values in Column 1 & 2 of table B can be found in the same row of Column 1 in Table A.

This is the kind of thing I was thinking of:

UPDATE table_b AS B 
SET id_number = A.id_number 
FROM table_a AS A 
WHERE A.column_1 LIKE B.column_1
 AND A.column_1 LIKE B.column_2

.. but obviously this doesn't work.

How can I translate this into a proper query?

Additional info

table_a.Column_1 contains UK addresses, for example:

'47 BOWERS PLACE, GREAT YARMOUTH, NORFOLK, NR20 4AN'

In table_b I have the first line of the address in Column_1 (so, '47 BOWERS PLACE') and the postcode ('NR20 4AN') in Column_2.

I thought it would be best to simplify things, but maybe the actual data has some relevance in this situation.

table_a has about 30 million addresses. table_b has around 60k rows.

Performance is relevant, the faster this runs the better, and it will likely be repeated in the future.

Question 2

Assuming Postgres 9.6, performance is relevant, big tables, "words" composed of characters, no whitespace or punctuation, no stemming or stop words, no phrases, all columns NOT NULL.

Full Text search backed by an index should be among the fastest solutions:

UPDATE table_b b
SET id_number = a.id_number 
FROM table_a a
WHERE to_tsvector('simple', a.column_1)
 @@ plainto_tsquery('simple', concat_ws(' ', b.column_1, b.column_2))
AND b.id_number = a.id_number; -- prevent empty UPDATEs

With a matching expression index on a.column_1:

CREATE INDEX table_a_column_1_idx ON table_a USING GIN (to_tsvector('simple', column_1));

Question 3

Thank you for your answer @Erwin. It set me on the right path (I think!). I actually had a tsvector column already created, so used that instead of creating a new index. My query ended up looking like: UPDATE table_b b SET id_number = a.id_number FROM table_a a WHERE a.tsvector_column @@ plainto_tsquery ('simple', concat_ws(' ', b.column_1, b.column_1)); - But this did leave me with some rows that didn't match up for any visible reason. Any ideas why that might have been?

Question 4

@Matt. Many ideas. Depends on the questions and assumptions I mentioned. Plus, what exactly is in that tsvector column. My educated guess: Your tsvector column might have been created with a different ts config (not 'simple'). Post a new question with exact information for the mismatch case.

Question 5

Thanks again Erwin, king of postgres! You were right on the ts config, it was set as 'english' not 'simple'. Changing this has increased the match rate to its highest possible level. Now I just have to fix some messy data...

Question 6

The key here is that Column_1, represents three possible values for the JOIN. So what you want to use is string_to_array() (so long as those values are comma-separated and can not themselves include a comma).

Run this query,

SELECT id_number, string_to_array(column_1, ',') AS column_1
FROM table_a;
 id_number | column_1 
-----------+-----------------------
 123 | {foo,bar,baz}
 456 | {qux,quux,quuz}
 789 | {corge,grault,garply}
 101 | {qux,bar,grault}

Now, we can run our UPDATE using = ANY(),

UPDATE table_b
SET id_number = A.id_number
FROM (
 SELECT id_number, string_to_array(column_1, ',') AS column_1
 FROM table_a
) AS A
WHERE table_b.column_1 = ANY(A.column_1)
 AND table_b.column_2 = ANY(A.column_1);

You can alternatively use <@

WHERE ARRAY[table_b.column_1, table_b.column_2] <@ A.column_1;

That even makes it a bit more compact..

UPDATE table_b
 SET id_number = A.id_number
FROM table_a AS A 
 WHERE ARRAY[table_b.column_1, table_b.column_2] <@ string_to_array(A.column_1, ',');

Question 7

I came to this answer whilst trying to solve the same problem, where as the example has table_a.col1 as a string my data (also uk address) is already split into series of cols in table_a . ANY needs an array as i understand should i convert cols to array or is there a better way?

Question 8

Converting columns into an array for search is usually not a good idea, as the array won't use the underlying column's indexes.

Question 9

Try this:

update table_b
set id_number = (select id_number
 from table_a
 where table_a.Column_1 like '%' || table_b.Column_1 || '%'
 OR table_a.Column_1 like '%' || table_b.Column_2 || '%'
 limit 1)
;

It can be another solutions by converting Column_1 into an array, but this is so clear.

Notice I'm limiting the search to 1 record, just in case the text appears in more than one column_1 of Table_A.

As Evan Carroll has pointed out in the comments sections, I'd remark that this code updates the whole table.

Check it here: http://rextester.com/MUL4593

Question 10

Writing the DDL in the case is always a solid contribution, I just wish you'd put in the question. Anyway, have an upvote.

Question 11

It may be also be useful to note that this query updates the whole table.

Question 12

Hi @EvanCarroll, I really appreciate your advices. I usually write the whole DDL just to avoid typos and ensure the correct answer.

Question 13

You do a great job at it too. I would write them as CREATE TEMP TABLE AS instead, but that's a matter of style. If I can provide two hints to make your life easier, because we both do this very often, if you use an editor like VIM learn visual-block edit, and line edit mode, and also use Tim Pope's vim-surround plugin. It reduces the workload to less than a minute. ;)

score 3 · Accepted Answer · 2017-01-29 21:18:55Z

3

Assuming Postgres 9.6, performance is relevant, big tables, "words" composed of characters, no whitespace or punctuation, no stemming or stop words, no phrases, all columns NOT NULL.

Full Text search backed by an index should be among the fastest solutions:

UPDATE table_b b
SET id_number = a.id_number 
FROM table_a a
WHERE to_tsvector('simple', a.column_1)
 @@ plainto_tsquery('simple', concat_ws(' ', b.column_1, b.column_2))
AND b.id_number = a.id_number; -- prevent empty UPDATEs

With a matching expression index on a.column_1:

CREATE INDEX table_a_column_1_idx ON table_a USING GIN (to_tsvector('simple', column_1));

Share

Improve this answer

answered Jan 29, 2017 at 21:18

Erwin Brandstetter's user avatar

Erwin Brandstetter Erwin Brandstetter

186k28 gold badges463 silver badges636 bronze badges

3

Thank you for your answer @Erwin. It set me on the right path (I think!). I actually had a tsvector column already created, so used that instead of creating a new index. My query ended up looking like: UPDATE table_b b SET id_number = a.id_number FROM table_a a WHERE a.tsvector_column @@ plainto_tsquery ('simple', concat_ws(' ', b.column_1, b.column_1)); - But this did leave me with some rows that didn't match up for any visible reason. Any ideas why that might have been?

Matt
– Matt

2017年02月01日 16:04:43 +00:00
Commented Feb 1, 2017 at 16:04
@Matt. Many ideas. Depends on the questions and assumptions I mentioned. Plus, what exactly is in that tsvector column. My educated guess: Your tsvector column might have been created with a different ts config (not 'simple'). Post a new question with exact information for the mismatch case.

Erwin Brandstetter
– Erwin Brandstetter

2017年02月01日 20:51:00 +00:00
Commented Feb 1, 2017 at 20:51
1

Thanks again Erwin, king of postgres! You were right on the ts config, it was set as 'english' not 'simple'. Changing this has increased the match rate to its highest possible level. Now I just have to fix some messy data...

Matt
– Matt

2017年02月02日 10:38:22 +00:00
Commented Feb 2, 2017 at 10:38

Add a comment |

Stack Exchange Network

UPDATE with join condition on matching words in columns of another table

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

UPDATE with join condition on matching words in columns of another table

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions