Phrases update query

Question 1

My query is this:

UPDATE `phrases`
SET `phrases`.`count`=(SELECT COUNT(*) FROM `strings` WHERE `string` LIKE CONCAT('%', `phrases`.`phrase`, '%'))

My tables look like this:

CREATE TABLE `phrases` (
 `hash` varchar(32) NOT NULL,
 `count` int DEFAULT 0,
 `phrase` text NOT NULL,
 PRIMARY KEY (`hash`),
 KEY(`count`)
)

And

CREATE TABLE `strings` (
 `string` text NOT NULL,
)

phrases has 18,000 rows and strings has 1500 rows.

Question 2

It might be more efficient to have a separate table where you would store the counts per phrase, and then only update this table once a new string is added. Since the number of strings is low in comparison to the phrases, I figure this wont happen that often. So you would not perform the whole count again, just add 1 if the new string matches that phrase.

Question 3

Since you're using a LIKE with wildcards, you're going to do a table-scan against both tables, running a total of 18000*1500 = 27000000 substring comparisons.

To optimize this, you need to use some fulltext index technology. I suggest Sphinx Search or Apache Solr. If you do this, you don't need to keep a count of how many matches there are, because the search index makes it a lot less expensive to get a count on demand.

MySQL also implements a FULLTEXT index type, but it is only supported in the MyISAM storage engine in current versions (up to 5.5). I don't recommend using MyISAM for important data.

MySQL 5.6 is developing a fulltext index for InnoDB.

Question 4

You should drop the index and collect the counts.

This will speed up the updating of the count column.

When done, put the index back.

ALTER TABLE phrase DROP INDEX `count`;
UPDATE phrase SET COUNT=0;
UPDATE phrases INNER JOIN string
ON ( LOCATE(strings.string,phrases.phrase) > 0 )
SET phrase.`count`=phrase.`count`+1;
ALTER TABLE phrase ADD INDEX `count` (`count`);

This INNER JOIN is nothing more than a Cartesian Product (pointed out by Bill Karwin's answer as 27,000,000 rows being examined in a temp table).

If the time to process is something can live with, all well and good.

If the time to process is disastrously slow, you must try Bill Karwin's answer.

Bill KarwinBill Karwin · Accepted Answer · 2011-12-20 22:04:52Z

Since you're using a LIKE with wildcards, you're going to do a table-scan against both tables, running a total of 18000*1500 = 27000000 substring comparisons.

To optimize this, you need to use some fulltext index technology. I suggest Sphinx Search or Apache Solr. If you do this, you don't need to keep a count of how many matches there are, because the search index makes it a lot less expensive to get a count on demand.

MySQL also implements a FULLTEXT index type, but it is only supported in the MyISAM storage engine in current versions (up to 5.5). I don't recommend using MyISAM for important data.

MySQL 5.6 is developing a fulltext index for InnoDB.

Stack Exchange Network

Phrases update query

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Phrases update query

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions