Optimizing querying for any of multiple terms in string

Question 1

I have been pondering what would be the most efficient way to search for any of multiple urls contained in an unstructured text field.

Suppose I want to find posts containing either https://stackexchange.com/tour or https://stackoverflow.com/tour

The obvious query would be to do

WHERE body LIKE '%https://stackexchange.com/tour%' OR body LIKE '%https://stackoverflow.com/tour%';

However, I wonder if this would end up inefficiently doing two full searches of the values.

On MySQL family we could alternatively use REGEXP which would ensure it is checked on one pass:

WHERE body REGEXP 'https://stack(exchange|overflow)\.com/tour';

But the cost of matching a regex might be (much) higher than two simple string searches and thus counterproductive.

There are more options, such as using INSTR()/LOCATE() instead of LIKE, but they would still have to be OR-ed.

Which would be the more efficient way to perform such query? Would the optimizer collapse the multiple LIKEs into one "action"?

Note that a FULLTEXT index would be the preferred solution in many cases

WHERE MATCH (body) AGAINST ('https://stackexchange.com/tour https://stackoverflow.com/tour' IN BOOLEAN MODE);

but is not suitable here since urls are interpreted as multiple words. Quoting them would help, but still return non-urls:

WHERE MATCH (body) AGAINST ('"https://stackexchange.com/tour" "https://stackoverflow.com/tour"' IN BOOLEAN MODE);

(although a fulltext index could be combined to other approaches to filter down the results to search)

A toy table to play with could be

CREATE TABLE post (
 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
 body TEXT,
 FULLTEXT (body)
) ENGINE=InnoDB;
INSERT INTO post (body) VALUES ('See https://stackexchange.com/tour for the kind of questions we like'), ('Look at https://stackoverflow.com/tour'), ('I have read https://stackexchange.com/tour'), ('https://example.com'), ('The HTTPS stackexchange.com tour explains all about TLS');

Question 2

The obvious query would be to do For long text valuess and short searching substring valuess the usage of LOCATE / INSTR function is preferred.

Question 3

Would the optimizer collapse the multiple LIKEs into one "action"? -- Have you looked at the query plan?

Question 4

If you can't use a fulltext index, then there is no way to optimize substring searches.

Whether you do LIKE, or REGEXP, or LOCATE()/INSTR() as the comment above suggests, all of these searches will do a table-scan.

So the only way to make that have better performance is to delete most of your data first, so the table has fewer rows to scan.

An alternative is to use an inverted index. That is, pre-scan your text and look for key words you intend to search. Populate another table with the key words, and a third many-to-many table to map the key words to your text documents where they occur.

CREATE TABLE KeywordInPost (
 keyword_id INT NOT NULL,
 post_id INT NOT NULL,
 PRIMARY KEY (keyword_id, post_id),
 FOREIGN KEY (keyword_id) REFERENCES Keywords(keyword_id),
 FOREIGN KEY (post_id) REFERENCES post(id)
);

Then you can query:

SELECT post.*
FROM Keywords
JOIN KeywordInPost ON Keywords.keyword_id = KeywordInPost.keyword_id
JOIN post ON KeywordInPost.post_id = post.id
WHERE Keywords.word = '<word you want>';

At least that way you don't have to do substring searches. You can store any string you want in the Keywords table and search it with =. Then use efficient indexed lookups in the joins to find posts where that keyword occurs.

Of course, it's up to you to populate the KeywordInPost table and keep it up to date.

Question 5

Your Question reads like a combination of a Question and a self-Answer. Your conclusion is buried in the middle: "fulltext index could be combined to other approaches to filter down the results to search" This avoids the point that Bill makes: "all of these searches [except the combined] will do a table-scan".

I believe that this is a rare case where the order of clauses in WHERE matters, and we can take advantage of it. (Or maybe the Optimizer does FULLTEXT first?)

WHERE MATCH (...) AGAINST ('...' IN BOOLEAN MODE) -- first
 AND ... REGEXP ... -- second

This uses FULLTEXT, avoiding a full table scan, then filters out the extra rows.

I think (without adequate proof) that REGEXP is a little slower than a single LIKE. Your example has two LIKEs per row, so it might be slower than a single REGEXP per row.

You may benefit from using +:

AGAINST("+https stackexchange stackoverflow +com +tour")

But a caution: Common words ('and' and 'the', plus possibly 'https' and 'com' in your dataset) are not indexed. That is the + will decide that no rows match. This would probably be safer and maybe faster:

AGAINST("stackexchange stackoverflow +tour")

(Unless this database is all about "tours".)

(And, yes, Bill's manually built inverted index may be even better.)

Bill Karwin Bill Karwin 17.2k3 gold badges32 silver badges45 bronze badges · Answer 1 · 2022-06-27 15:46:36Z

If you can't use a fulltext index, then there is no way to optimize substring searches.

Whether you do LIKE, or REGEXP, or LOCATE()/INSTR() as the comment above suggests, all of these searches will do a table-scan.

So the only way to make that have better performance is to delete most of your data first, so the table has fewer rows to scan.

An alternative is to use an inverted index. That is, pre-scan your text and look for key words you intend to search. Populate another table with the key words, and a third many-to-many table to map the key words to your text documents where they occur.

CREATE TABLE KeywordInPost (
 keyword_id INT NOT NULL,
 post_id INT NOT NULL,
 PRIMARY KEY (keyword_id, post_id),
 FOREIGN KEY (keyword_id) REFERENCES Keywords(keyword_id),
 FOREIGN KEY (post_id) REFERENCES post(id)
);

Then you can query:

SELECT post.*
FROM Keywords
JOIN KeywordInPost ON Keywords.keyword_id = KeywordInPost.keyword_id
JOIN post ON KeywordInPost.post_id = post.id
WHERE Keywords.word = '<word you want>';

At least that way you don't have to do substring searches. You can store any string you want in the Keywords table and search it with =. Then use efficient indexed lookups in the joins to find posts where that keyword occurs.

Of course, it's up to you to populate the KeywordInPost table and keep it up to date.

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 2 · 2022-06-27 16:19:53Z

Your Question reads like a combination of a Question and a self-Answer. Your conclusion is buried in the middle: "fulltext index could be combined to other approaches to filter down the results to search" This avoids the point that Bill makes: "all of these searches [except the combined] will do a table-scan".

I believe that this is a rare case where the order of clauses in WHERE matters, and we can take advantage of it. (Or maybe the Optimizer does FULLTEXT first?)

WHERE MATCH (...) AGAINST ('...' IN BOOLEAN MODE) -- first
 AND ... REGEXP ... -- second

This uses FULLTEXT, avoiding a full table scan, then filters out the extra rows.

I think (without adequate proof) that REGEXP is a little slower than a single LIKE. Your example has two LIKEs per row, so it might be slower than a single REGEXP per row.

You may benefit from using +:

AGAINST("+https stackexchange stackoverflow +com +tour")

But a caution: Common words ('and' and 'the', plus possibly 'https' and 'com' in your dataset) are not indexed. That is the + will decide that no rows match. This would probably be safer and maybe faster:

AGAINST("stackexchange stackoverflow +tour")

(Unless this database is all about "tours".)

(And, yes, Bill's manually built inverted index may be even better.)

Stack Exchange Network

Optimizing querying for any of multiple terms in string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing querying for any of multiple terms in string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions