7
\$\begingroup\$

I recently challenged @Hosch520 with writing a query as such:

Find first questions with answers posted within 24 hours

And he did great. I figured I would have a go at it too. Turns out I learned valuable things about data subsets, so here I am seeking review.

I did not specify whether to return only the first answer within 24 hours, or all answers within 24 hours, so we went about it a bit differently. The remarks at the top of my query give more details.

In any case, mine returns all answers posted to a new user's first question within 24 hours.

Link to query on SEDE.

/*
 * Author: @Phrancis
 * Title: 1st questions with answers within X hours
 * Date: 2015年04月16日
 * Purpose: 
 * To query a new user's first question under the condition that
 * the question received one or more answers within a certain number of
 * hours (default 24), as well as all qualifying answers, and both users'
 * name and current reputation. It also calculates the interval between
 * question and answers.
*/
DECLARE @HoursElapsed AS INT;
SET @HoursElapsed = ##NumberOfHours:float?24##;
-- NumberOfHours: Hours elapsed between OP 1st question and answers following it "Decimals are allowed"
DECLARE @QuestionType INT;
DECLARE @AnswerType INT;
SET @QuestionType = 1;
SET @AnswerType = 2;
WITH AllFirstQuestions AS (
 SELECT
 OwnerUserId
 , Id
 FROM Posts
 WHERE CreationDate IN (SELECT MIN(CreationDate) FROM Posts GROUP BY OwnerUserId)
 AND PostTypeId = @QuestionType
)
SELECT 
 qUser.Id AS [User Link]
 , qUser.Reputation AS [Q OP Rep]
 , firstQ.Id AS [Post Link]
 , q.CreationDate AS [Q Date]
 , CONVERT( DECIMAL(10,2), DATEDIFF(MINUTE, q.CreationDate, a.CreationDate) /60.0 ) AS [Hours Elapsed]
 , aUser.Id AS [User Link]
 , aUser.Reputation AS [A OP Rep]
 , a.Id AS [Post Link]
 , a.CreationDate AS [A Date]
FROM Posts AS q
INNER JOIN AllFirstQuestions AS firstQ
 ON q.Id = firstQ.Id
INNER JOIN Posts AS a
 ON q.Id = a.ParentId
 AND q.PostTypeId = @QuestionType
 AND a.PostTypeId = @AnswerType
INNER JOIN Users qUser
 ON q.OwnerUserId = qUser.Id
 AND q.OwnerUserId IS NOT NULL
INNER JOIN Users aUser
 ON a.OwnerUserId = aUser.Id
 AND a.OwnerUserId IS NOT NULL
 WHERE a.CreationDate <= (DATEADD(HOUR, @HoursElapsed, a.CreationDate))
 AND DATEDIFF(HOUR, q.CreationDate, a.CreationDate) <= @HoursElapsed
/* This condition is needed because some "fluke" questions 
 * have an answer creation time that is before the question's. 
 * Usually due to complicated migrations.
 */
 AND DATEDIFF(HOUR, q.CreationDate, a.CreationDate) >= 0
ORDER BY 
 q.CreationDate DESC
 , aUser.Reputation DESC;

I'm looking to see if this can be improved. Maybe mine is too much? I personally feel it is reasonable, and it executes pretty fast:

14662 rows returned in 194 ms

I scanned through the output for a while to make sure I did not have multiple first questions by the same user, and AFAIK, I do not. I'm open to any criticism to make this better!

asked Apr 18, 2015 at 5:50
\$\endgroup\$

2 Answers 2

6
\$\begingroup\$
a.CreationDate <= (DATEADD(HOUR, @HoursElapsed, a.CreationDate))

This first condition of your WHERE clause either needs to be removed or needs some explanation.

For any non-negative value of @HoursElapsed, this will always return true. Perhaps, the intent was to use q.CreationDate as the third argument of the DATEADD function?

Otherwise, this condition just says:

x <= x + y

The <= can return true for any non-negative y value. No matter the value of x, x plus some non-negative y will always be some number larger than (or in the case of y=0, equal to) x.


I don't like your indention style. Specifically, in your CTE, I don't like that the AND in your WHERE clause gets the same indention level as the rest of the query. In your main query, I don't like that your INNER JOIN have the same indention level as the rest of the query (and the WHERE appears to have a space in front of it.

Importantly, any SQL query has up to eight primary clauses. WITH, SELECT, INTO, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. Everything else is a part of one of these clauses. These 8 clauses should have zero indentation relative to the query as a whole, and everything else should be indented by at least one level (and I personally prefer four spaces, but I know SEDE default is two spaces).

Our major offender here is the massive list of INNER JOIN subclauses (helper elves?) within the FROM clause.


Keeping in mind the comments I already made about the first part of your WHERE clause, we could simplify the entire WHERE clause into simply just this:

WHERE a.CreationDate BETWEEN q.CreationDate AND DATEADD(HOUR,@HoursElapsed,q.CreationDate)

Our FROM can be massively simplified.

First, there's no reason our CTE can't include the CreationDate. That completely eliminates the need for us to have the Posts table we labeled q. And that can clean up some other nastiness, like q.PostTypeId = @QuestionType (our CTE already narrows that dataset down to just questions).

Bearing in mind the recommended change to the CTE (just add CreationDate to the select list), I recommend a FROM that looks more like this:

FROM AllFirstQuestions q
 INNER JOIN Posts as a ON a.PostTypeId = @AnswerType AND a.ParentId = q.Id 
 INNER JOIN Users qUser ON qUser.Id = q.OwnerUserId
 INNER JOIN Users aUser ON aUser.Id = a.OwnerUserId

I'm guessing that a.PostTypeId = @AnswerType is probably unnecessary, but I don't know enough about SEDE.

I'm going to wager though that the Id column of the Users table is an indexed, auto-incrementing, non-null primary key. We don't have to include a AND alias.OwnerUserId IS NOT NULL on joining it. Any post with a NULL owner user id will be eliminated by virtue of the fact that there are no NULLs in the Id column of the Users table (and even if there were, null wouldn't join to null... I don't believe).


Finally, this is SEDE specific, but it doesn't make sense to allow the user to enter a decimal number for hours elapsed. You're saving it as an INT, so it's immediately converted to an INT and the decimal part is lost before you even do anything with it. Moreover, SQL's DATEADD (one place you use @HoursElapsed) takes an INT, so it doesn't make sense to even think about passing anything else here, and SQL's DATEDIFF (you compare result of this to @HoursElapsed) returns an INT, so it also doesn't make a whole lot of sense to compare to a floating point number here. I think you just need to not allow the user to enter decimal places, since they're 100% ignored no matter what. Allowing the user to enter decimal places gives the false impression that your query is doing something it's not.

answered Apr 20, 2015 at 1:39
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Thanks for the answer! Just so you know, Users.Id actually can and does contain NULL values. In the instance of a user who deleted their account and things like that. \$\endgroup\$ Commented Apr 20, 2015 at 16:42
4
\$\begingroup\$

Race condition

If an established user asks a question at the exact same instant that a new user makes a first post, it's possible that the established user's question would be treated as a first post, because your AllFirstQuestions is purely time-based.

One way to solve that problem is with window functions. The solution below also eliminates a JOIN of q and firstQ.

Users

What is the purpose of this query? Does a self-answer count as receiving an answer? Would scores on the questions/answers be more relevant than their owners' reputation scores?

Posts from deleted users and not-yet-created users still count, right? I would use LEFT OUTER JOIN Users instead.

The output could be presented more meaningfully. Some of the columns (the Post Link for the question and the Answer Date) seem a bit redundant.

Suggested solution

One more suggestion: you can SET variables as you DECLARE them.

DECLARE @HoursElapsed INT = ##NumberOfHours:int?24##;
-- NumberOfHours: Hours elapsed between OP 1st question and answers following it
DECLARE @QuestionType INT = 1;
DECLARE @AnswerType INT = 2;
WITH UserPosts AS (
 SELECT *
 , ROW_NUMBER() OVER (PARTITION BY OwnerUserId ORDER BY CreationDate) AS UserNthPost
 FROM Posts
)
SELECT 
 qUser.Id AS [User Link]
 , qUser.Reputation AS [Q OP Rep]
 , aUser.Id AS [User Link]
 , aUser.Reputation AS [A OP Rep]
 , q.CreationDate AS [Q Date]
 , CONVERT( DECIMAL(10,2), DATEDIFF(MINUTE, q.CreationDate, a.CreationDate) /60.0 ) AS [Hours Elapsed]
 , 'site://posts/' + CAST(q.Id AS NVARCHAR) + '|Q' AS [Q Link]
 , a.Id AS [Post Link]
FROM UserPosts AS q
 INNER JOIN Posts AS a
 ON q.Id = a.ParentId
 AND q.PostTypeId = @QuestionType
 AND a.PostTypeId = @AnswerType
 LEFT OUTER JOIN Users qUser
 ON q.OwnerUserId = qUser.Id
 LEFT OUTER JOIN Users aUser
 ON a.OwnerUserId = aUser.Id
WHERE
 UserNthPost = 1 AND
 a.CreationDate BETWEEN q.CreationDate AND DATEADD(HOUR, @HoursElapsed, q.CreationDate)
ORDER BY 
 q.CreationDate DESC
 , aUser.Reputation DESC;
answered Apr 21, 2015 at 15:09
\$\endgroup\$
1
  • \$\begingroup\$ @sᴉɔuɐɹɥԀ For the record, setting variables as you declare them works on MSSQLServer 2008 and up (which means it works on SEDE), but it will not work on MSSQLServer 2005 and earlier. \$\endgroup\$ Commented Apr 21, 2015 at 22:19

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.