I have some instagram API data sitting across two tables. One with post_data, and one with profile_data.
The profile_data table has a timestamp of when the account was checked through the API to get the number of followers the account has at a point in time.
The post_data table has the timestamp of a post.
I need to join these two together with a best fit timestamp to calculate an engagement rate (i.e. likes on a post / followers of account).
The profile_data table has just under 13k rows of data and the post_data table has just under 6k of rows.
There is a third relational table called brands which contains the brand featured in a post_id
My current query looks like this:
SELECT p.ent_id,
p.post_time,
p.record_updated,
p.last_checked,
p.post_id,
p.id,
p.image,
p.caption,
p.tags,
p.likes as 'likes',
p.comment_count as 'comment_count',
p.comments,
p.users_in_photo,
p.post_type,
p.link,
p.gender,
p.model,
p.content_type,
p.location,
p.campaign,
GROUP_CONCAT(b.brand SEPARATOR',') AS brands,
d.followers,
((p.likes+p.comment_count)/d.followers)*100 AS 'engagement_rate',
p.reach as 'reach',
p.impressions as 'impressions',
p.saved as 'saved',
((p.reach/d.followers)*100) as 'reach_rate'
FROM post_data p
LEFT JOIN featured_brands b ON p.post_id = b.post_id
LEFT JOIN profile_data d ON p.post_time BETWEEN SUBTIME(d.time_checked, '00:30:00')
AND ADDTIME(d.time_checked, '00:30:00')
WHERE p.post_time > '2019-01-02 00:00:00'
AND p.post_time < '2019-05-13 23:59:59'
GROUP BY p.post_id
ORDER BY likes DESC;
This query achieves the goal however for the date range above, the query takes almost 27seconds which is not acceptable. I have identified that changed the SUBTIME and ADDTIME to the equivalent of (d.time_checked - interval 30 minutes)
I can get this down to about 15 seconds but this is still too long.
I figure the BETWEEN call is my major problem but I'm unsure of how to achieve the same result more effectively.
Are there any other methods that I could achieve this result in a more efficient manner?
Edit:
Mysql version 5.5.42.
DDL:
CREATE TABLE `featured_brands` (
`post_id` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
`brand` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `post_data` (
`ent_id` int(11) NOT NULL AUTO_INCREMENT,
`post_time` timestamp NULL DEFAULT NULL,
`record_updated` timestamp NULL DEFAULT NULL,
`last_checked` timestamp NULL DEFAULT NULL,
`post_id` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '',
`image` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`caption` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`tags` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`likes` int(7) NOT NULL,
`comment_count` int(7) NOT NULL,
`comments` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`users_in_photo` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`post_type` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`link` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`gender` varchar(1000) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
`model` varchar(1000) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
`content_type` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
`location` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
`campaign` varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL,
`id` varchar(50) CHARACTER SET utf8mb4 DEFAULT NULL,
`ig_id` varchar(50) CHARACTER SET utf8mb4 DEFAULT NULL,
`engagement` int(11) DEFAULT NULL,
`reach` int(11) DEFAULT NULL,
`saved` int(11) DEFAULT NULL,
`impressions` int(11) DEFAULT NULL,
`no_advanced_data` int(11) DEFAULT NULL,
PRIMARY KEY (`ent_id`),
UNIQUE KEY `post_id` (`post_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `profile_data` (
`record_id` int(11) NOT NULL,
`time_checked` timestamp NULL DEFAULT NULL,
`profile_id` int(7) DEFAULT NULL,
`username` varchar(255) DEFAULT NULL,
`profile_picture` varchar(255) DEFAULT NULL,
`full_name` varchar(255) DEFAULT NULL,
`bio` text,
`website` varchar(255) DEFAULT NULL,
`is_business` int(1) DEFAULT NULL,
`media_count` int(11) DEFAULT NULL,
`follows` int(11) DEFAULT NULL,
`followers` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
1 Answer 1
Some optimizations in these areas have been made since 5.5. I recommend you upgrade.
Don't use LEFT
unless you need the semantics of it.
Avoid joining to b
in the following manner: Change GROUP_CONCAT(b.brand SEPARATOR',') AS brands
to
( SELECT GROUP_CONCAT(brand SEPARATOR',')
FROM featured_brands WHERE post_id = p.post_id
) AS brands
and get rid of the LEFT JOIN ...
This may help the "explode-implode" pattern of JOINing
, then collapsing via GROUP BY
.
You may be able to get rid of the GROUP BY
altogether by turning the other LEFT JOIN
into a "derived table":
SELECT ...
FROM ( derived-table (see below) ) AS x
JOIN post_data AS p ON p.post_id = x.post_id
The derived table would be a SELECT
that JOIN
post_data and profile_data
with the primary goal of finding which post_ids are interesting.
SELECT p1.post_id, d.followers
FROM post_id AS p1
JOIN profile_data AS d ON ...
WHERE post_time ...
Indexes...
profile_data: (time_checked, followers) -- in this order; covering
post_data: (post_time, post_id)
featured_brands (post_id, brand) -- unless you have PRIMARY KEY(post_id)
-
Thanks for the suggestions. I've followed the suggestions including creating the indexes but I'm a bit stuck on the derived table bit. How is this optimising or replacing the between BETWEEN part of the query that is taking the longest time to execute? Creating the indexes and adding your first suggested slowed it further to 36.4seconds.James– James2019年05月16日 11:41:12 +00:00Commented May 16, 2019 at 11:41
Explore related questions
See similar questions with these tags.
((p.likes+p.comment_count)/d.followers)*100 AS 'engagement_rate'
and similar forreach_rate
. But you useGROUP BY p.post_id
. In that case, when more than one record inprofile_data
matches the record inpost_data
, then thefollowers
field value will be taken from a random record within the matched group, making the result both random and non-deterministic...