I'm accumulating results for multiple contingency tables, one for each Test. As an example:
Contingency Table for TestA (one for each test, and for each word)
EventsWithWord EventsWithoutWord
TestA passed n11 n10
TestA failed n01 n00
For analysis, I need the number of events matching each of the following categories (this is how R expects the data results):
- TestA passed for each word:
n11
- TestA results available for each word:
n_1
(sum ofn11
andn01
) - TestA passed:
n1_
(sum ofn11
andn10
) - TestA results available:
n__
Table "TestResults": columns: CompanyID nvarchar(20) NOT NULL, TestName nvarchar(255) NOT NULL, CompanyResults bit NULL
Table "Data" (one row for each event): Columns: EventID int NOT NULL, CompanyID nvarchar(20) NOT NULL
Table "WordEventMap": Columns: EventID int NOT NULL, Word nvarchar(255) NOT NULL
Is there a way of combining some of the queries (e.g., with windowing functions)? Some of these tables are very large--40 million rows currently in WordEventMap and 1.3 million rows in Data, and I expect the results table will have about 4 million rows (40 different tests, 100,000 different words). The current execution plan does not consolidate any of the Common Table Expressions.
Current query:
WITH CTE1
AS (SELECT
COUNT(DISTINCT Data.EventID) AS n__,
TestResults.TestName
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName),
CTE2
AS (SELECT
COUNT(DISTINCT Data.[EventID]) AS n11,
TestResults.TestName,
WordEventMap.Word
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults = 1
GROUP BY TestResults.TestName,
WordEventMap.Word),
CTE3
AS (SELECT
COUNT(DISTINCT Data.[EventID]) AS n1_,
TestResults.TestName
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults = 1
GROUP BY TestResults.TestName),
CTE4
AS (SELECT
COUNT(DISTINCT Data.[EventID]) AS n_1,
TestResults.TestName,
WordEventMap.Word
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName,
WordEventMap.Word)
SELECT
CTE2.TestName,
CTE2.Word,
n11,
n1_,
n_1,
n__
FROM CTE2
INNER JOIN CTE4
ON CTE2.Word = CTE4.Word
AND CTE2.TestName = CTE4.TestName
INNER JOIN CTE1
ON CTE2.TestName = CTE1.TestName
INNER JOIN CTE3
ON CTE2.TestName = CTE3.TestName
Here's a SQL fiddle using the above structure and the below data. The above query is saved as the View [Results]. http://sqlfiddle.com/#!6/a8155e/1
Sample data:
Table TestResults
CompanyA TestA 1
CompanyA TestB 0
CompanyB TestA NULL
CompanyB TestB 1
CompanyC TestA 0
CompanyC TestB 1
Table Data
1 CompanyA
2 CompanyA
3 CompanyB
4 CompanyC
5 CompanyB
Table WordEventMap
1 airplane
1 tightrope
1 eggplant
2 eggplant
2 aardvark
2 eggbeater
3 airplane
3 aardvark
3 spaghetti
4 airplane
4 eggplant
4 wikipedia
5 eggplant
5 eggbeater
5 tightrope
5 licorice
Results Set
TestA aardvark 1 2 1 3
TestB aardvark 1 3 2 5
TestA airplane 1 2 2 3
TestB airplane 2 3 3 5
TestA eggbeater 1 2 1 3
TestA eggplant 1 2 2 3
TestB eggplant 2 3 3 5
TestB licorice 1 3 1 5
TestB spaghetti 1 3 1 5
TestA tightrope 1 2 1 3
TestB tightrope 1 3 2 5
TestB wikipedia 1 3 1 5
-
I would not expect the execution plan to ever "consolidate" CTEs. Primer: #BackToBasics : Common Table Expressions (CTEs). Can you provide a few rows of sample data (preferably at least two rows per / that fall into each CTE), desired results, and tag with the version of SQL Server you're using?Aaron Bertrand– Aaron Bertrand2017年07月18日 21:50:30 +00:00Commented Jul 18, 2017 at 21:50
-
By consolidate, I mean that a CTE is not a stand-alone query, and sql server can produce an execution plan in which the CTE is not its own branch. See scarydba.com/2015/01/14/common-table-expressions-not-tables . I'll work on getting the clarifying information you're asking for and will add it to the question's text.rsjaffe– rsjaffe2017年07月18日 23:12:29 +00:00Commented Jul 18, 2017 at 23:12
1 Answer 1
When you think that optimizer should be able to combine two CTEs into one why don't you write a single CTE, human beings are still smarter than optimizers :-)
As there's only a different WHERE-condition the CTEs 2&4 and 1&3 can be combined using conditional aggregation:
WITH CTE1
AS (SELECT
COUNT(DISTINCT Data.EventID) AS n__,
COUNT(DISTINCT CASE WHEN TestResults.CompanyResults = 1 THEN Data.[EventID] END) AS n1_,
TestResults.TestName
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName),
CTE2
AS (SELECT
COUNT(DISTINCT Data.[EventID]) AS n_1,
COUNT(DISTINCT CASE WHEN TestResults.CompanyResults = 1 THEN Data.[EventID] END) AS n11,
TestResults.TestName,
WordEventMap.Word
FROM TestResults
INNER JOIN Data
ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName,
WordEventMap.Word)
SELECT
CTE2.TestName,
CTE2.Word,
n11,
n1_,
n_1,
n__
FROM CTE2
INNER JOIN CTE1
ON CTE2.TestName = CTE1.TestName
WHERE n11 > 0
;
But the main performance problem are those DISTINCTs, based on uniqueness in your actual data you might be able to remove some (for your example data no DISTINCT is needed).
-
Thanks. Removing the DISTINCT cut the execution time in half. Using your combination of CTEs cut the time by 2/3s.rsjaffe– rsjaffe2017年07月19日 15:29:20 +00:00Commented Jul 19, 2017 at 15:29
Explore related questions
See similar questions with these tags.