I have been pondering what would be the most efficient way to search for any of multiple urls contained in an unstructured text field.
Suppose I want to find posts containing either https://stackexchange.com/tour or https://stackoverflow.com/tour
The obvious query would be to do
WHERE body LIKE '%https://stackexchange.com/tour%' OR body LIKE '%https://stackoverflow.com/tour%';
However, I wonder if this would end up inefficiently doing two full searches of the values.
On MySQL family we could alternatively use REGEXP which would ensure it is checked on one pass:
WHERE body REGEXP 'https://stack(exchange|overflow)\.com/tour';
But the cost of matching a regex might be (much) higher than two simple string searches and thus counterproductive.
There are more options, such as using INSTR()
/LOCATE()
instead of LIKE
, but they would still have to be OR-ed.
Which would be the more efficient way to perform such query? Would the optimizer collapse the multiple LIKE
s into one "action"?
Note that a FULLTEXT index would be the preferred solution in many cases
WHERE MATCH (body) AGAINST ('https://stackexchange.com/tour https://stackoverflow.com/tour' IN BOOLEAN MODE);
but is not suitable here since urls are interpreted as multiple words. Quoting them would help, but still return non-urls:
WHERE MATCH (body) AGAINST ('"https://stackexchange.com/tour" "https://stackoverflow.com/tour"' IN BOOLEAN MODE);
(although a fulltext index could be combined to other approaches to filter down the results to search)
A toy table to play with could be
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
body TEXT,
FULLTEXT (body)
) ENGINE=InnoDB;
INSERT INTO post (body) VALUES ('See https://stackexchange.com/tour for the kind of questions we like'), ('Look at https://stackoverflow.com/tour'), ('I have read https://stackexchange.com/tour'), ('https://example.com'), ('The HTTPS stackexchange.com tour explains all about TLS');
-
1The obvious query would be to do For long text valuess and short searching substring valuess the usage of LOCATE / INSTR function is preferred.Akina– Akina2022年06月27日 06:31:39 +00:00Commented Jun 27, 2022 at 6:31
-
Would the optimizer collapse the multiple LIKEs into one "action"? -- Have you looked at the query plan?mustaccio– mustaccio2022年06月27日 16:30:39 +00:00Commented Jun 27, 2022 at 16:30
2 Answers 2
If you can't use a fulltext index, then there is no way to optimize substring searches.
Whether you do LIKE
, or REGEXP
, or LOCATE()
/INSTR()
as the comment above suggests, all of these searches will do a table-scan.
So the only way to make that have better performance is to delete most of your data first, so the table has fewer rows to scan.
An alternative is to use an inverted index. That is, pre-scan your text and look for key words you intend to search. Populate another table with the key words, and a third many-to-many table to map the key words to your text documents where they occur.
CREATE TABLE KeywordInPost (
keyword_id INT NOT NULL,
post_id INT NOT NULL,
PRIMARY KEY (keyword_id, post_id),
FOREIGN KEY (keyword_id) REFERENCES Keywords(keyword_id),
FOREIGN KEY (post_id) REFERENCES post(id)
);
Then you can query:
SELECT post.*
FROM Keywords
JOIN KeywordInPost ON Keywords.keyword_id = KeywordInPost.keyword_id
JOIN post ON KeywordInPost.post_id = post.id
WHERE Keywords.word = '<word you want>';
At least that way you don't have to do substring searches. You can store any string you want in the Keywords table and search it with =
. Then use efficient indexed lookups in the joins to find posts where that keyword occurs.
Of course, it's up to you to populate the KeywordInPost table and keep it up to date.
Your Question reads like a combination of a Question and a self-Answer. Your conclusion is buried in the middle: "fulltext index could be combined to other approaches to filter down the results to search" This avoids the point that Bill makes: "all of these searches [except the combined] will do a table-scan".
I believe that this is a rare case where the order of clauses in WHERE
matters, and we can take advantage of it. (Or maybe the Optimizer does FULLTEXT first?)
WHERE MATCH (...) AGAINST ('...' IN BOOLEAN MODE) -- first
AND ... REGEXP ... -- second
This uses FULLTEXT, avoiding a full table scan, then filters out the extra rows.
I think (without adequate proof) that REGEXP is a little slower than a single LIKE. Your example has two LIKEs per row, so it might be slower than a single REGEXP per row.
You may benefit from using +
:
AGAINST("+https stackexchange stackoverflow +com +tour")
But a caution: Common words ('and' and 'the', plus possibly 'https' and 'com' in your dataset) are not indexed. That is the +
will decide that no rows match. This would probably be safer and maybe faster:
AGAINST("stackexchange stackoverflow +tour")
(Unless this database is all about "tours".)
(And, yes, Bill's manually built inverted index may be even better.)