I am trying to efficiently use the SEDE tool to get a list of tags a user has answered questions on, and their stats on each Tag.
My code works for my account in 3185 ms, but if I try to run it on @jon-skeet's, user ID (22656) then it times out.
Are there ways to make this more efficient / better code, and allow it to be used for any user ID?
-- The tags and scores of my posts
CREATE TABLE #MyPosts (
_tags nvarchar(250) COLLATE SQL_Latin1_General_CP1_CS_AS,
_score int
);
INSERT INTO #MyPosts
SELECT
Question.Tags,
Answers.Score
FROM
Posts as Answers
INNER JOIN
Posts AS Question
ON
Question.Id = Answers.ParentId
WHERE
Answers.PostTypeId = 2 AND
Answers.OwnerUserId = ##UserId:int?8041461##;
--------------------------------------------------------------------------
-- All of the tags with the total score and the amount of times used
CREATE TABLE #TagCounts (
_tagName nvarchar(35) COLLATE SQL_Latin1_General_CP1_CS_AS,
_score int,
_count int
);
INSERT INTO #TagCounts
SELECT
Tags.TagName,
sum(_score),
count(*)
FROM
#MyPosts
INNER JOIN Tags ON #MyPosts._tags LIKE '%<'+Tags.TagName +'>%'
GROUP BY Tags.Id,Tags.TagName;
--------------------------------------------------------------------------
-- Outputs the tags with links, total score, count and average score
SELECT
_tagName as [TagName],
_score as 'Total Score',
_count as 'Answer Count',
cast(_score as FLOAT) / _count as 'Average Score'
FROM
#TagCounts
WHERE
_score IS NOT NULL
ORDER BY
_score DESC
-
\$\begingroup\$ What flavour of SQL is this? Why are you creating temporary tables? \$\endgroup\$Reinderien– Reinderien2018年05月11日 20:44:50 +00:00Commented May 11, 2018 at 20:44
-
\$\begingroup\$ It's the SEDE version of TSQL, I am creating the temporary tables to pass results between the sections, filtering down to try to keep under the timeout limit. \$\endgroup\$jrtapsell– jrtapsell2018年05月11日 20:50:03 +00:00Commented May 11, 2018 at 20:50
-
\$\begingroup\$ Please show the results of an EXPLAIN on each of your queries. \$\endgroup\$Reinderien– Reinderien2018年05月11日 21:07:38 +00:00Commented May 11, 2018 at 21:07
1 Answer 1
This TagName
column holding all the tags is denormalized and the non-equi-join ON #MyPosts._tags LIKE '%<'+Tags.TagName +'>%'
kills performance. There's no efficient way for the optimizer to do this join, thus it's like a Cross-Join to the 50000+ rows in Tags
.
But the base data model is normalized, there's an associative table PostTags
between Posts
and Tags
and then it's simple equi-joins:
with MyPosts as
(
-- The tags and scores of my posts
SELECT
Question.Id AS PostId,
Answers.Score
FROM
Posts as Answers
INNER JOIN
Posts AS Question
ON
Question.Id = Answers.ParentId
WHERE
Answers.PostTypeId = 2 AND
Answers.OwnerUserId = ##UserId:int?8041461##
)
-- All of the tags with the total score and the amount of times used
SELECT
t.TagName,
sum(score) as "Total Score",
count(*) as "Answer Count",
cast(sum(score) as FLOAT) / count(*) as 'Average Score'
FROM
MyPosts AS p
JOIN PostTags AS pt
ON p.PostId = pt.PostId
JOIN Tags AS t
ON pt.TagId = t.Id
GROUP BY t.Id,t.TagName
ORDER BY "Total Score" DESC;
Your HAVING sum(score) IS NULL
is not needed, there's no NULL score.
See SEDE query which runs fine with John Skeet, too :-)