My query is this:
UPDATE `phrases`
SET `phrases`.`count`=(SELECT COUNT(*) FROM `strings` WHERE `string` LIKE CONCAT('%', `phrases`.`phrase`, '%'))
My tables look like this:
CREATE TABLE `phrases` (
`hash` varchar(32) NOT NULL,
`count` int DEFAULT 0,
`phrase` text NOT NULL,
PRIMARY KEY (`hash`),
KEY(`count`)
)
And
CREATE TABLE `strings` (
`string` text NOT NULL,
)
phrases
has 18,000 rows and strings
has 1500 rows.
-
\$\begingroup\$ It might be more efficient to have a separate table where you would store the counts per phrase, and then only update this table once a new string is added. Since the number of strings is low in comparison to the phrases, I figure this wont happen that often. So you would not perform the whole count again, just add 1 if the new string matches that phrase. \$\endgroup\$saratis– saratis2011年12月20日 22:09:27 +00:00Commented Dec 20, 2011 at 22:09
2 Answers 2
Since you're using a LIKE
with wildcards, you're going to do a table-scan against both tables, running a total of 18000*1500 = 27000000 substring comparisons.
To optimize this, you need to use some fulltext index technology. I suggest Sphinx Search or Apache Solr. If you do this, you don't need to keep a count of how many matches there are, because the search index makes it a lot less expensive to get a count on demand.
MySQL also implements a FULLTEXT index type, but it is only supported in the MyISAM storage engine in current versions (up to 5.5). I don't recommend using MyISAM for important data.
MySQL 5.6 is developing a fulltext index for InnoDB.
You should drop the index and collect the counts.
This will speed up the updating of the count
column.
When done, put the index back.
ALTER TABLE phrase DROP INDEX `count`;
UPDATE phrase SET COUNT=0;
UPDATE phrases INNER JOIN string
ON ( LOCATE(strings.string,phrases.phrase) > 0 )
SET phrase.`count`=phrase.`count`+1;
ALTER TABLE phrase ADD INDEX `count` (`count`);
This INNER JOIN is nothing more than a Cartesian Product (pointed out by Bill Karwin's answer as 27,000,000 rows being examined in a temp table).
If the time to process is something can live with, all well and good.
If the time to process is disastrously slow, you must try Bill Karwin's answer.