What this code does it explained in detail in the top portion of the query.
I have left out part of the list of language tags as I felt it was just noise. You can see the full collection here.
You can run this query on SEDE and play with it as you wish.
Note that I left plenty of wiggle room for others to modify and aggregate their own data starting from the simple query at the bottom.
1089 rows returned in 5886 ms
How could I improve this with regards to readability, performance, etc.?
/**
* The objective of this query is to gather data related to cross-posts from
* Stack Overflow (SO) to Code Review (CR). A cross-post as defined in this context is
* a question which has first been asked on SO and then a short time later asked
* again on CR (albeit often slightly modified in the way it is titled or phrased).
* Querying from 2 or more sites requires cross-database queries, and the following
* 2 databases are used here. All relevant tables are in the [dbo] schema.
* - Stack Overflow DB: [StackOverflow]
* - Code Review DB: [StackExchange.Codereview]
* 2 temporary tables are used in order to compensate for the physsical limitations
* of SEDE which otherwise will often time out before the query is completed.
* param @minutesFromSoPostToCrPost int not null : The number of minutes allowed between the original
* SO question and its cross-post on CR. Default 120 minutes.
* param @maximumCharacterCountDifferenceAllowed int not null : The maximum number
* of characters difference between the body of the question.
* NOTE: The higher the number, the more likely that it's not actually a cross-post.
*/
if object_id('tempdb..#LanguageTags') is not null
drop table #LanguageTags;
if object_id('tempdb..#CrossPosts') is not null
drop table #CrossPosts;
go
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
declare @questionPost int = 1;
declare @minutesFromSoPostToCrPost int = 120;
declare @maximumCharacterCountDifferenceAllowed int = 1000;
select
[Primary Stack] = case
when SoUsers.Reputation >= CrUsers.Reputation then
'Stack Overflow'
else
'Code Review'
end
, [Primary User] = case
when SoUsers.Reputation >= CrUsers.Reputation then
'http://stackoverflow.com/users/' + convert(varchar(10), SoUsers.Id) + '|' + SoUsers.DisplayName
else
'http://codereview.stackexchange.com/users/' + convert(varchar(10), CrUsers.Id) + '|' + CrUsers.DisplayName
end
, [SO Original] = 'http://stackoverflow.com/questions/' + convert(varchar(10), SoPosts.Id) + '|' + SoPosts.Title
, [CR Xpost] = 'http://codereview.stackexchange.com/questions/' + convert(varchar(10), CrPosts.Id) + '|' + CrPosts.Title
/*Calculate the character difference of the body of both questions.*/
, [CharCountDiff] = abs(len(CrPosts.Body) - len(SoPosts.Body))
, [SO Score] = SoPosts.Score
, [CR Score] = CrPosts.Score
, [SO Status] = case
when SoPosts.DeletionDate is not null then 'Deleted'
when SoPosts.ClosedDate is not null then 'Closed'
else 'OK' end
, [CR Status] = case
when CrPosts.DeletionDate is not null then 'Deleted'
when CrPosts.ClosedDate is not null then 'Closed'
else 'OK' end
/*Check in @Duga comments*/
, [DugaComments?] = case
when exists (
select 1 from [StackOverflow].dbo.Comments as SoComments
where SoPosts.Id = SoComments.PostId
and SoComments.Text like '%code%review%'
) then 'True' end
, [SO Answers] = SoPosts.AnswerCount
, [CR Answers] = CrPosts.AnswerCount
, [SO Accept?] = case
when SoPosts.AcceptedAnswerId is not null then 'True' end
, [CR Accept?] = case
when CrPosts.AcceptedAnswerId is not null then 'True' end
, [SO Created] = SoPosts.CreationDate
, [Minutes to Xpost] = datediff(minute, SoPosts.CreationDate, CrPosts.CreationDate)
, [Tags] = CrPosts.Tags
/*Adding results into temp table to avoid timeouts in `select distinct`*/
into #CrossPosts
from
/*Common users across CR and SO sites:*/
[StackExchange.Codereview].dbo.Users as CrUsers
inner join [StackOverflow].dbo.Users as SoUsers
/*AccountId is network-wide Id for each user, and
is distinct from the UserId which is for a specific site*/
on CrUsers.AccountId = SoUsers.AccountId
/*Questions by user on both sites:*/
inner join [StackExchange.Codereview].dbo.Posts as CrPosts
on CrUsers.Id = CrPosts.OwnerUserId
and CrPosts.PostTypeId = @questionPost
inner join [StackOverflow].dbo.Posts as SoPosts
on SoUsers.Id = SoPosts.OwnerUserId
and SoPosts.PostTypeId = @questionPost
/*Bring in tags so we can try to eliminate false matches
due to unrelated posts potentially being posted by the same
user on 2 different sites within our scoped time period.*/
inner join [StackExchange.Codereview].dbo.PostTags as CrPT
on CrPosts.Id = CrPT.PostId
inner join [StackExchange.Codereview].dbo.Tags as CrTags
on CrPT.TagId = CrTags.Id
inner join [StackOverflow].dbo.PostTags as SoPT
on SoPosts.Id = SoPT.PostId
inner join [StackOverflow].dbo.Tags as SoTags
on SoPT.TagId = SoTags.Id
where
/*Q was first posted on SO, then later on CR*/
SoPosts.CreationDate < CrPosts.CreationDate
/*Q was posted on CR within a certain number of minutes after being posted on SO*/
and datediff(minute, SoPosts.CreationDate, CrPosts.CreationDate) <= @minutesFromSoPostToCrPost
/*Match at least one language tag from CR->SO per post
Note: We use `select distinct` on the query against #CrossPosts
due to SEDE timing out if attempting to do it during this query.*/
and CrTags.TagName = SoTags.TagName
and exists (
select 1 from #LanguageTags as Langs
where CrTags.TagName = Langs.TagName
)
/*Apply filter based on character count difference of the body of both questions.*/
and abs(len(CrPosts.Body) - len(SoPosts.Body)) <= @maximumCharacterCountDifferenceAllowed
;
/*Use this query to view full result set, or modify it
according to your needs to aggregate from the #CrossPosts table.*/
select distinct *
from #CrossPosts
order by [SO Created] desc;
1 Answer 1
Style
I like your style. I don't mind reading keywords in lowercase. I think it's easier on the eyes, but then again, I'm fairly used to case-sensitive languages with lowercase keywords and syntax highlighting.
I'm also very fond of the prefix column aliases as opposed to the
postfix AS Alias
style that is mainstream.
Bug if I would have to nitpick: the hanging comma in the create table
statement messes with the alignment of the column names. There :)
Filter-join on lookup table
The joins on both Tags
tables is very deeply nested. Which means,
that for each and every Post
, SQL will join with each and every
PostTag
(which is unavoidable, because they are needed in an
unfortunate m-to-n match-and-filter), and then for each of those, it
will join with the Tags
table.
The Tags
table is actually not needed in this join. We can
pre-lookup all the tags that we want to use, by adding their
respective TagId
values to the #LanguageTags
temp table as
follows:
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, CrTagId int
, SoTagId int
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
update Langs
set CrTagId = CrTags.Id
, SoTagId = SoTags.Id
from #LanguageTags Langs
inner join [StackExchange.CodeReview].dbo.Tags as CrTags
on CrTags.TagName = Langs.TagName
inner join [StackOverflow].dbo.Tags as SoTags
on SoTags.TagName = Langs.TagName;
Now we can use this table to filter tags that we want to see by
TagId
, and we can also join the posts from both sites to a single
table now.
SEDE and cross database queries
It seems that SEDE is not really optimized to join the large database
tables to temp tables. So when we put the join described above inside
the fetching query, we get a timeout. The solution: store the TagId
values in the temp table:
select
-- SNIP
, [Tags] = CrPosts.Tags
, [CrTagId] = CrPT.TagId
, [SoTagId] = SoPT.TagId
and then filter-and-distinct on the #LanguageTags
table in the final
query:
select distinct
CP.[Primary Stack]
, CP.[Primary User]
, CP.[SO Original]
, CP.[CR Xpost]
, CP.[CharCountDiff]
, CP.[SO Score]
, CP.[CR Score]
, CP.[SO Status]
, CP.[CR Status]
, CP.[DugaComments?]
, CP.[SO Answers]
, CP.[CR Answers]
, CP.[SO Accept?]
, CP.[CR Accept?]
, CP.[SO Created]
, CP.[Minutes to Xpost]
, CP.[Tags]
, CP.[CrTagId]
, CP.[SoTagId]
from #CrossPosts CP
/*Match at least one language tag from CR->SO per post.*/
inner join #LanguageTags as Langs
on Langs.SoTagId = CP.SoTagId
and Langs.CrTagId = CP.CrTagId
order by [SO Created] desc;
Turns out that this is quite a bit faster, as the query now returns
1130 rows returned in 4166 ms