DB: MySQL 5.5.20 (WampServer, default configuration)
OS: Win 7
HDD: Western Digital 3TB Caviar Green, 3.5", IntelliPower, 64MB, Sata3 (WD30EZRX)
Memory: 8GB
MySQL my.ini: http://pastie.org/private/go9kaxlmlvirati2txbaa
Query in question:
SELECT name.id AS name_id, name.name, cast_info.id,
cast_info.role_id, cast_info.movie_id
FROM cast_info
LEFT JOIN name ON name.id = cast_info.person_id
WHERE cast_info.movie_id = 1000000
ORDER BY cast_info.movie_id ASC
It fetches all people who worked on a given movie. Problem is it can take anywhere from 0.1s to nearly 2.0s. That's too long. When the user needs to run it 10k times they might as well uninstall the application. Even I haven't had the patience to wait for it to finish.
edit: The time it takes to run the query is determined by the number of people who worked on it. Roughly 0.1s for each 10 people.
EXPLAIN:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cast_info
type: ref
possible_keys: idx_mid,mpi
key: idx_mid
key_len: 4
ref: const
rows: 15
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: name
type: eq_ref
possible_keys: PRIMARY,id_name_idx
key: PRIMARY
key_len: 4
ref: imdb.cast_info.person_id
rows: 1
Extra:
Tables:
CREATE TABLE
cast_info
(
id
int(11) NOT NULL AUTO_INCREMENT,
person_id
int(11) NOT NULL,
movie_id
int(11) NOT NULL,
person_role_id
int(11) DEFAULT NULL,
note
text,
nr_order
int(11) DEFAULT NULL,
role_id
int(11) NOT NULL,
PRIMARY KEY (id
),
KEYidx_pid
(person_id
),
KEYidx_mid
(movie_id
),
KEYidx_cid
(person_role_id
),
KEYcast_info_role_id_exists
(role_id
),
KEYmpi
(movie_id
,person_id
,id
)
) ENGINE=MyISAM AUTO_INCREMENT=33261692 DEFAULT CHARSET=utf8CREATE TABLE
name
(
id
int(11) NOT NULL AUTO_INCREMENT,
name
varchar(110) NOT NULL,
imdb_index
varchar(12) DEFAULT NULL,
imdb_id
int(11) DEFAULT NULL,
gender
varchar(1) DEFAULT NULL,
name_pcode_cf
varchar(5) DEFAULT NULL,
name_pcode_nf
varchar(5) DEFAULT NULL,
surname_pcode
varchar(5) DEFAULT NULL,
md5sum
varchar(32) DEFAULT NULL,
PRIMARY KEY (id
),
KEYidx_name
(name
(6)),
KEYidx_imdb_id
(imdb_id
),
KEYidx_pcodecf
(name_pcode_cf
),
KEYidx_pcodenf
(name_pcode_nf
),
KEYidx_pcode
(surname_pcode
),
KEYidx_md5
(md5sum
),
KEYid_name_idx
(id
,name
)
) ENGINE=MyISAM AUTO_INCREMENT=4287972 DEFAULT CHARSET=utf8
Thanks!
edit: MyISAM is used because this is a local database, used by one local application, by one user. Only one query is executed at the same time. Also because IMDbPy takes at least a month to build the database with InnoDB...
edit: Query EXPLAIN after converting to InnoDB:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cast_info
type: ref
possible_keys: mpi
key: mpi
key_len: 3
ref: const
rows: 23
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: name
type: eq_ref
possible_keys: PRIMARY,id_name_idx
key: PRIMARY
key_len: 4
ref: imdb.cast_info.person_id
rows: 1
Extra:
4 Answers 4
It would run faster with InnoDB,
The
PRIMARY KEY
is 'clustered'; this makes the lookup intoname
faster.A suitable setting of
innodb_buffer_pool_size
of, say, 5G (for your 8GB machine) would cache a lot of stuff, thereby minimizing the I/O for 10K operations.
You could also shrink the data (and decrease the I/O) by using MEDIUMINT UNSIGNED
or SMALLINT UNSIGNED
where appropriate.
There are redundant indexes that could (should) be removed; this would speed up the load. Also name(6)
is probably useless.
-
I converted all possible fields to MEDIUMINT UNSIGNED and one to TINYINT UNSIGNED. I don't think that changed anything. I could remove some indexes but not sure which. For instance the
mpi
index was recommended to me on freenode's #mysql channel for this exact query. Are indexes helpful if I only need a field for the SELECT x,y,z part and not JOINs or WHERE?TheMagician– TheMagician2015年12月21日 01:06:44 +00:00Commented Dec 21, 2015 at 1:06 -
Indexes mostly help with
WHERE
, but they can help withJOIN
andORDER BY
.mpi
is a "covering index", meaning that the index (a separate BTree) provides all the columns needed for the query. YourEXPLAIN
does not show it using that index. DROP INDEX idx_mid; to see if it helps.Rick James– Rick James2015年12月21日 02:13:14 +00:00Commented Dec 21, 2015 at 2:13 -
I dropped idx_mid. Converted all tables to InnoDB. I can't set innodb_buffer_pool_size above 2G so it's at 2G. And the query became slower :/TheMagician– TheMagician2015年12月21日 13:23:56 +00:00Commented Dec 21, 2015 at 13:23
-
Please provide the
EXPLAIN
after converting to InnoDB.Rick James– Rick James2015年12月22日 00:04:09 +00:00Commented Dec 22, 2015 at 0:04 -
Added to first post. Rows differs because I might've used another movie_id earlier.TheMagician– TheMagician2015年12月22日 00:40:43 +00:00Commented Dec 22, 2015 at 0:40
I saw query then explain and then table definition and first question came into my mind. Why are you still on MyISAM @ 5.5? Let's just change this to InnoDB and you should get the results. This 0.1 to 2 range could be caused by locks or other reasons but let's avoid that issue first?
Be-aware that below conversion will lock the tables and may cause downtime.
alter table cast_info engine=innodb;
alter table name engine=innodb;
-
Why MyISAM? 1) Building the database with InnoDB takes such a ridiculous amount of time nobody has time for it so MyISAM is the only option here. 2) This is a local database only ever used by a single local application by one user. Ever. So only one query is executed at a time. No two queries are ever executed at the same time. So I don't see how a lock could be causing this? 3) I've already tried converting the tables to InnoDB, nothing changed. In fact, the query took even longer to execute so I converted back to MyISAM.TheMagician– TheMagician2015年12月20日 17:24:24 +00:00Commented Dec 20, 2015 at 17:24
-
1@TheMagician ad 1) - using InnoDB instead of MyISAM for your tables means just naming a different engine in your create table statements, there is not much "design time" difference to make it into a "a ridiculous amount of time" - If you meant foreign keys and stuff, thats still optional (even when I would suggest to set them properly)jkavalik– jkavalik2015年12月20日 21:15:16 +00:00Commented Dec 20, 2015 at 21:15
-
2Sounds like IMDbPy needs some help with tuning, keys, etc.Rick James– Rick James2015年12月20日 23:24:08 +00:00Commented Dec 20, 2015 at 23:24
-
2Well I'd convert all to innodb after imdbpy finishes and back it up. at this point innodb will give you best performance boost and then some variable tuning.... Also better tune atleast buffers to start with before complaining performance. Convert innodb set bufferpool to atleast 4-6G and then see the perf. This may not be dedicated box as well!! +1 for imdbpy help @rick.....mysql_user– mysql_user2015年12月21日 04:12:04 +00:00Commented Dec 21, 2015 at 4:12
-
1@TheMagician, ah, so thats what you mean with "building", then the suggestion to "convert the tables later" is probably really a good one, use the alters as mysql_user suggests. The scripts is maybe not using transactions properly so thats why its slow..jkavalik– jkavalik2015年12月21日 10:02:51 +00:00Commented Dec 21, 2015 at 10:02
Performance is consistent with the hard drive, you shall consider getting an SSD or even better, a RAID system. Your drive is a slow desktop class hard driver, totally not fit for database work. Try testing with a 7200 rpm desktop, and you will already see improvements. If the People table is so large that it does not fit into RAM (cached by OS, not loaded in user memory), physical non-sequential access to the hard drive is the problem of your performance.
Since you don't run a "true" database, but you ship the data to the customer in a desktop application, a quick solution would be to de-normalize and add a redundant field "cast_data" on Movie, with all your needed data as a JSON. Updating such a structure will be difficult if performed in any other way than on movie (for example update the birthday of a cast), but data will be retrieved very fast, no matter the storage.
A few questions/observations:
- Why are you sorting the result on
cast_info.movie_id
when the WHERE clause only allows for a singlemovie_id
? - Likewise, why are you returning the
movie_id
in the result: it is going to be the same in every row.
Try creating a composite index on cast_info(movie_id, person_id)
.
I expect that would be much, much better.
key_buffer_size
andinnodb_buffer_pool_size
?