I recently challenged @Hosch520 with writing a query as such:
Find first questions with answers posted within 24 hours
And he did great. I figured I would have a go at it too. Turns out I learned valuable things about data subsets, so here I am seeking review.
I did not specify whether to return only the first answer within 24 hours, or all answers within 24 hours, so we went about it a bit differently. The remarks at the top of my query give more details.
In any case, mine returns all answers posted to a new user's first question within 24 hours.
/*
* Author: @Phrancis
* Title: 1st questions with answers within X hours
* Date: 2015年04月16日
* Purpose:
* To query a new user's first question under the condition that
* the question received one or more answers within a certain number of
* hours (default 24), as well as all qualifying answers, and both users'
* name and current reputation. It also calculates the interval between
* question and answers.
*/
DECLARE @HoursElapsed AS INT;
SET @HoursElapsed = ##NumberOfHours:float?24##;
-- NumberOfHours: Hours elapsed between OP 1st question and answers following it "Decimals are allowed"
DECLARE @QuestionType INT;
DECLARE @AnswerType INT;
SET @QuestionType = 1;
SET @AnswerType = 2;
WITH AllFirstQuestions AS (
SELECT
OwnerUserId
, Id
FROM Posts
WHERE CreationDate IN (SELECT MIN(CreationDate) FROM Posts GROUP BY OwnerUserId)
AND PostTypeId = @QuestionType
)
SELECT
qUser.Id AS [User Link]
, qUser.Reputation AS [Q OP Rep]
, firstQ.Id AS [Post Link]
, q.CreationDate AS [Q Date]
, CONVERT( DECIMAL(10,2), DATEDIFF(MINUTE, q.CreationDate, a.CreationDate) /60.0 ) AS [Hours Elapsed]
, aUser.Id AS [User Link]
, aUser.Reputation AS [A OP Rep]
, a.Id AS [Post Link]
, a.CreationDate AS [A Date]
FROM Posts AS q
INNER JOIN AllFirstQuestions AS firstQ
ON q.Id = firstQ.Id
INNER JOIN Posts AS a
ON q.Id = a.ParentId
AND q.PostTypeId = @QuestionType
AND a.PostTypeId = @AnswerType
INNER JOIN Users qUser
ON q.OwnerUserId = qUser.Id
AND q.OwnerUserId IS NOT NULL
INNER JOIN Users aUser
ON a.OwnerUserId = aUser.Id
AND a.OwnerUserId IS NOT NULL
WHERE a.CreationDate <= (DATEADD(HOUR, @HoursElapsed, a.CreationDate))
AND DATEDIFF(HOUR, q.CreationDate, a.CreationDate) <= @HoursElapsed
/* This condition is needed because some "fluke" questions
* have an answer creation time that is before the question's.
* Usually due to complicated migrations.
*/
AND DATEDIFF(HOUR, q.CreationDate, a.CreationDate) >= 0
ORDER BY
q.CreationDate DESC
, aUser.Reputation DESC;
I'm looking to see if this can be improved. Maybe mine is too much? I personally feel it is reasonable, and it executes pretty fast:
14662 rows returned in 194 ms
I scanned through the output for a while to make sure I did not have multiple first questions by the same user, and AFAIK, I do not. I'm open to any criticism to make this better!
2 Answers 2
a.CreationDate <= (DATEADD(HOUR, @HoursElapsed, a.CreationDate))
This first condition of your WHERE
clause either needs to be removed or needs some explanation.
For any non-negative value of @HoursElapsed
, this will always return true. Perhaps, the intent was to use q.CreationDate
as the third argument of the DATEADD
function?
Otherwise, this condition just says:
x <= x + y
The <=
can return true for any non-negative y
value. No matter the value of x
, x
plus some non-negative y
will always be some number larger than (or in the case of y=0
, equal to) x
.
I don't like your indention style. Specifically, in your CTE, I don't like that the AND
in your WHERE
clause gets the same indention level as the rest of the query. In your main query, I don't like that your INNER JOIN
have the same indention level as the rest of the query (and the WHERE
appears to have a space in front of it.
Importantly, any SQL query has up to eight primary clauses. WITH
, SELECT
, INTO
, FROM
, WHERE
, GROUP BY
, HAVING
, and ORDER BY
. Everything else is a part of one of these clauses. These 8 clauses should have zero indentation relative to the query as a whole, and everything else should be indented by at least one level (and I personally prefer four spaces, but I know SEDE default is two spaces).
Our major offender here is the massive list of INNER JOIN
subclauses (helper elves?) within the FROM
clause.
Keeping in mind the comments I already made about the first part of your WHERE
clause, we could simplify the entire WHERE
clause into simply just this:
WHERE a.CreationDate BETWEEN q.CreationDate AND DATEADD(HOUR,@HoursElapsed,q.CreationDate)
Our FROM
can be massively simplified.
First, there's no reason our CTE can't include the CreationDate
. That completely eliminates the need for us to have the Posts
table we labeled q
. And that can clean up some other nastiness, like q.PostTypeId = @QuestionType
(our CTE already narrows that dataset down to just questions).
Bearing in mind the recommended change to the CTE (just add CreationDate
to the select list), I recommend a FROM
that looks more like this:
FROM AllFirstQuestions q
INNER JOIN Posts as a ON a.PostTypeId = @AnswerType AND a.ParentId = q.Id
INNER JOIN Users qUser ON qUser.Id = q.OwnerUserId
INNER JOIN Users aUser ON aUser.Id = a.OwnerUserId
I'm guessing that a.PostTypeId = @AnswerType
is probably unnecessary, but I don't know enough about SEDE.
I'm going to wager though that the Id
column of the Users
table is an indexed, auto-incrementing, non-null primary key. We don't have to include a AND alias.OwnerUserId IS NOT NULL
on joining it. Any post with a NULL
owner user id will be eliminated by virtue of the fact that there are no NULL
s in the Id
column of the Users table (and even if there were, null wouldn't join to null... I don't believe).
Finally, this is SEDE specific, but it doesn't make sense to allow the user to enter a decimal number for hours elapsed. You're saving it as an INT
, so it's immediately converted to an INT
and the decimal part is lost before you even do anything with it. Moreover, SQL's DATEADD
(one place you use @HoursElapsed
) takes an INT
, so it doesn't make sense to even think about passing anything else here, and SQL's DATEDIFF
(you compare result of this to @HoursElapsed
) returns an INT
, so it also doesn't make a whole lot of sense to compare to a floating point number here. I think you just need to not allow the user to enter decimal places, since they're 100% ignored no matter what. Allowing the user to enter decimal places gives the false impression that your query is doing something it's not.
-
1\$\begingroup\$ Thanks for the answer! Just so you know,
Users.Id
actually can and does contain NULL values. In the instance of a user who deleted their account and things like that. \$\endgroup\$Phrancis– Phrancis2015年04月20日 16:42:42 +00:00Commented Apr 20, 2015 at 16:42
Race condition
If an established user asks a question at the exact same instant that a new user makes a first post, it's possible that the established user's question would be treated as a first post, because your AllFirstQuestions
is purely time-based.
One way to solve that problem is with window functions. The solution below also eliminates a JOIN
of q
and firstQ
.
Users
What is the purpose of this query? Does a self-answer count as receiving an answer? Would scores on the questions/answers be more relevant than their owners' reputation scores?
Posts from deleted users and not-yet-created users still count, right? I would use LEFT OUTER JOIN Users
instead.
The output could be presented more meaningfully. Some of the columns (the Post Link for the question and the Answer Date) seem a bit redundant.
Suggested solution
One more suggestion: you can SET
variables as you DECLARE
them.
DECLARE @HoursElapsed INT = ##NumberOfHours:int?24##;
-- NumberOfHours: Hours elapsed between OP 1st question and answers following it
DECLARE @QuestionType INT = 1;
DECLARE @AnswerType INT = 2;
WITH UserPosts AS (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY OwnerUserId ORDER BY CreationDate) AS UserNthPost
FROM Posts
)
SELECT
qUser.Id AS [User Link]
, qUser.Reputation AS [Q OP Rep]
, aUser.Id AS [User Link]
, aUser.Reputation AS [A OP Rep]
, q.CreationDate AS [Q Date]
, CONVERT( DECIMAL(10,2), DATEDIFF(MINUTE, q.CreationDate, a.CreationDate) /60.0 ) AS [Hours Elapsed]
, 'site://posts/' + CAST(q.Id AS NVARCHAR) + '|Q' AS [Q Link]
, a.Id AS [Post Link]
FROM UserPosts AS q
INNER JOIN Posts AS a
ON q.Id = a.ParentId
AND q.PostTypeId = @QuestionType
AND a.PostTypeId = @AnswerType
LEFT OUTER JOIN Users qUser
ON q.OwnerUserId = qUser.Id
LEFT OUTER JOIN Users aUser
ON a.OwnerUserId = aUser.Id
WHERE
UserNthPost = 1 AND
a.CreationDate BETWEEN q.CreationDate AND DATEADD(HOUR, @HoursElapsed, q.CreationDate)
ORDER BY
q.CreationDate DESC
, aUser.Reputation DESC;
-
\$\begingroup\$ @sᴉɔuɐɹɥԀ For the record, setting variables as you declare them works on MSSQLServer 2008 and up (which means it works on SEDE), but it will not work on MSSQLServer 2005 and earlier. \$\endgroup\$nhgrif– nhgrif2015年04月21日 22:19:39 +00:00Commented Apr 21, 2015 at 22:19