I have a query across two large tables. The first records the latest user activity at a location, the second is a dimension table with a natural primary key of location.
Table sizes here are roughly 100 millions rows in user_location_rating
and 10 million rows in dim_location
.
Most users have < 1000 records in user_location_rating
and for those users query performance is adequate.
For users with a lot of activity data, this query, even though it is two simple selects, can still be slow. I would like to improve the query performance. Can this by done by adding an additional index? As an alternative, is there a way to exploit an index to make a more limited query (such as below) more efficient than the full query?
SELECT d.create_time
FROM user_location_rating f
JOIN dim_location d using(location_id)
WHERE f.user_id=?
AND f.platform=?
AND d.category=?;
SELECT d.create_time
FROM user_location_rating f
JOIN dim_location d using(location_id)
WHERE f.user_id=?
AND f.platform=?
AND d.category=?
ORDER BY d.create_time DESC
LIMIT 1000;
EXPLAIN SELECT yields the following on these queries (for a user with 999 events, say)
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+-------------+
| 1 | SIMPLE | user_location_rating | ref | k_userloc,k_locplat,k_usrplat | k_usrplat | 8 | const,const | 999 | Using index |
| 1 | SIMPLE | dim_location | eq_ref | PRIMARY | PRIMARY | 4 | user_location_rating.location_id | 1 | Using where |
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+-------------+
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | user_location_rating | ref | k_userloc,k_locplat,k_usrplat | k_usrplat | 8 | const,const | 999 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | dim_location | eq_ref | PRIMARY | PRIMARY | 4 | user_location_rating.location_id | 1 | Using where |
+----+-------------+----------------------+--------+-------------------------------+-----------+---------+----------------------------------+------+----------------------------------------------+
Table definitions
CREATE TABLE `user_location_rating` (
`user_activity_id` int(16) NOT NULL AUTO_INCREMENT,
`user_id` int(16) NOT NULL DEFAULT '0',
`location_id` int(16) NOT NULL DEFAULT '0',
`platform` int(2) NOT NULL DEFAULT '-1',
`rating` int(2) NOT NULL DEFAULT '-1'
PRIMARY KEY (`location_id`,`user_activity_id`),
UNIQUE KEY `k_userloc` (`user_id`,`location_id`),
KEY `k_locplat` (`location_id`,`platform`),
KEY `k_usrplat` (`user_id`,`platform`)
) ENGINE=InnoDB ROW_FORMAT=DYNAMIC
CREATE TABLE `dim_location` (
`location_id` int(16) NOT NULL AUTO_INCREMENT,
`category` int(2) NOT NULL DEFAULT '0',
`create_time` datetime DEFAULT NULL,
PRIMARY KEY (`location_id`)
) ENGINE=InnoDB ROW_FORMAT=DYNAMIC
(I've made some effort to clean up the actual definitions here to be close to the minimal relevant example so apologies if any typos have crept in.)
1 Answer 1
For the first table, an index on (platform, user_id, location_id)
or (user_id, platform, location_id)
would be best, for both queries. Good for you, the existing index k_usrplat
is equivalent to the second index (InnoDB indexes silently include the PK columns). You can see that the index is indeed used by both queries.
For the second table, it's more complex. You might get adequate performance with the existing index of the primary key, at least for the first query. A possible improvement is a (category, location_id, create_time)
index.
You could experiment with a (category, create_time, location_id)
index as well, which might be useful for the second query. A lot will depend on the distribution of data and efficiency may vary for different parameters.
The definitions of integer columns look weird. Why do you have int (16)
and int (2)
? If you think that that means any restrictions on the possible values in the columns, you are mistaken. The columns are of the same type, the numbers in parentheses is just a directive to user interfaces which is mostly ignored. If some of these columns, for example platform
can only hold small values (say 0-100 or 0-2000), then use an appropriately smaller type:
tiny int (-128 .. +127) : 1 byte
small int (-32768 .. + 32767) : 2 bytes
medium int (-2^23 .. + 2^23-1) : 3 bytes
tiny int unsigned (0 .. +255) : 1 byte
small int unsigned (0 .. +65535) : 2 bytes
medium int unsigned (0 .. +2^24-1) : 3 bytes
That will save space in the table and their indexes, both in disk and in memory usage.
Less space -> Less I/O -> faster queries
-
Excellent answer. However, I see no advantage with an index starting with
category
. (If experimenting shows it to be useful, I'm all ears.)Rick James– Rick James2018年03月03日 19:42:37 +00:00Commented Mar 3, 2018 at 19:42
Explore related questions
See similar questions with these tags.
SHOW CREATE TABLE
output for both tables and theEXPLAIN SELECT ...
output for your queries. How big are the tables?user_location_rating
. Also, do most queries involveWHERE user_id = ?
? If so, I may be able to improve even on ypercube's excellent answer.user_id = ?
are the ones of interest here (platform and category could be fixed to constants if that mattered). In terms of the keys already defined, these are more for the context (i.e. they exist and so aren't considered sufficient for this question) but I'm not sure what problem they were thought to solve.