Returning columns based on various conditions in Group By

Question 1

I would like to do a 'group by' on the following example table

[Low] = Min of [Low] column
[High] = Max of [High] column
[Symbol]
[Epoch]
[CuVol]
[Dt10Min])
[Close]
[Open]

Columns 3-7 have the values corresponding to highest Epoch, which is 129 in this case and columns 8 will be the value of [Open] at lowest Epoch of the group [Dt10Min],127 in this case.

I am very new to SQL can you please suggest.

Question 2

What version of SQL Server do you use? Please add a tag to the question.

Question 3

I'll borrow Joe's create table/insert example, and assuming Symbol is the effective PK for the final result I've thrown in a second set of data for Symbol = 'B':

create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input
values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');

We'll create a derived table of the max()/min() values (grouped by Symbol), then use the derived table's min(Epoch)/max(Epoch) values to perform 2 joins back to the raw data to generate our final result set:

select dt.Symbol,
 dt.highEpoch as Epoch,
 lowE.[Open] as [Open],
 highE.[Close] as [Close],
 dt.maxHigh as High,
 dt.minLow as Low,
 highE.CuVol as CuVol,
 highE.Dt10Min as Dt10Min
from
(select Symbol,
 min(Epoch) as lowEpoch,
 max(Epoch) as highEpoch,
 max(High) as maxHigh,
 min(Low) as minLow
from #input
group by Symbol) dt
join #input lowE
on lowE.Symbol = dt.Symbol
and lowE.Epoch = dt.lowEpoch
join #input highE
on highE.Symbol = dt.Symbol
and highE.Epoch = dt.highEpoch
order by dt.Symbol;
Symbol | Epoch | Open | Close | High | Low | CuVol | Dt10Min 
------ | ----- | ---- | ----- | ---- | ---- | ----- | -------------------
A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 22/08/2010 00:00:00
B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 23/08/2010 00:00:00

Here's a dbfiddle for the above.

Question 4

Thanks markp, this definitely helps. With 300million records (26GB disk space) and 40GB allocated to SQL Server, the process was not completed. I think i should process it using python/spark.

Question 5

Unless you plan on loading 26GB of data into your python/spark app, you may want to spend some time on checking for tuning opportunities with the proposed query (eg, an index on (Symbol,Epoch) should speed up the joins to the lowE and highE tables; if you have multiple engines - aka free cpu cycles - run multiple copies of the above against subsets of the main table - aka, parallelize the operation)

Question 6

Let's start by putting your data into a temp table. For future questions you'll want to do this in the question itself so people that want to help you don't have to copy it by hand from a screenshot. I made some guesses about data types:

CREATE TABLE #input (
 Symbol VARCHAR(1) NOT NULL,
 Epoch INT NOT NULL,
 [Open] NUMERIC(9, 1) NOT NULL,
 [Close] NUMERIC(9, 1) NOT NULL,
 [High] NUMERIC(9, 1) NOT NULL,
 [Low] NUMERIC(9, 1) NOT NULL,
 CuVol INT NOT NULL,
 Dt10Min DATETIME NOT NULL
);
INSERT INTO #input
VALUES
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822')
, ('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822')
, ('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822');

Now let's write separate queries for all of the information that you want and combine them at the end. Getting the minimum value from a table is very straightforward:

SELECT MIN([Low])
FROM #input;

As is getting the maximum value from a table:

SELECT MAX([High])
FROM #input;

Carrying along column values based on the minimum or maximum of another column is more complex. Around here we call that a "greatest n per group" problem. If you just need a single row then you can use the TOP operator. SQL Server will do a TOP 1 sort which will require any extremely low memory grant. This can be helpful if your table is very large. The query below returns the row with the lowest Epoch:

SELECT TOP 1 [Open]
FROM #input
ORDER BY Epoch ASC;

We can write a very similar query to return data associated with the largest Epoch. It's also easy to select as many columns as needed:

SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
FROM #input
ORDER BY Epoch DESC;

Now we need to combine all of the queries together to get a single result set. Each query is guaranteed to return a single row, so we might as well use CROSS JOIN. Here's the complete query:

SELECT
 highest_epoch.symbol
, highest_epoch.Epoch
, lowest_epoch.[Open]
, highest_epoch.[Close]
, min_max.High
, min_max.Low
, highest_epoch.CuVol
, highest_epoch.Dt10Min
FROM
(
 SELECT
 MIN(i.[Low]) [Low]
 , MAX(i.[High]) [High]
 FROM #input i
) min_max
CROSS JOIN 
(
 SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
 FROM #input
 ORDER BY Epoch DESC
) highest_epoch
CROSS JOIN 
(
 SELECT TOP 1 [Open]
 FROM #input
 ORDER BY Epoch ASC
) lowest_epoch;

This query returns the results that you're looking for:

╔════════╦═══════╦══════╦═══════╦══════╦══════╦═══════╦═════════════════════════╗
║ symbol ║ Epoch ║ Open ║ Close ║ High ║ Low ║ CuVol ║ Dt10Min ║
╠════════╬═══════╬══════╬═══════╬══════╬══════╬═══════╬═════════════════════════╣
║ A ║ 129 ║ 23.6 ║ 24.6 ║ 24.8 ║ 22.8 ║ 4800 ║ 2010年08月22日 00:00:00.000 ║
╚════════╩═══════╩══════╩═══════╩══════╩══════╩═══════╩═════════════════════════╝

Without any indexes, it'll do three scans of the table but it shouldn't require sorting. That could be okay depending on the size of the table and your response time requirements. However, if you need to speed the query up you could consider adding covering indexes for each derived table or you could use a more advanced technique to do the "greatest n per group" calculations. For example, it's possible to use GROUP BY to carry along extra columns as I showed in this answer.

Question 7

Thank you Joe, this solution is great and you wonderfully explained it. However my table has 300million records and SQL Server says out of memory even with 40GB of RAM allocated to it.

Question 8

@vkb I posted a better solution which shouldn't require anything more than a trivial memory grant. Give that a try.

Question 9

Overall, this problem is a top-n-per-group problem with a slight twist. You need to find both the first and the last values in each group.

If you use SQL Server 2012 or later, one way to do it is to use ROW_NUMBER and FIRST_VALUE / LAST_VALUE functions.

Sample data (thank you @markp)

create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');

Query

WITH
CTE
AS
(
 SELECT
 Symbol
 ,Epoch
 ,[Open]
 ,[Close]
 ,High
 ,Low
 ,CuVol
 ,Dt10Min
 ,MIN(Low) OVER (PARTITION BY Symbol) AS MinLow
 ,MAX(High) OVER (PARTITION BY Symbol) AS MaxHigh
 ,LAST_VALUE([Open]) 
 OVER (PARTITION BY Symbol ORDER BY Epoch DESC
 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstOpen
 ,ROW_NUMBER() OVER (PARTITION BY Symbol ORDER BY Epoch DESC) AS rnDesc
 FROM
 #input
)
SELECT
 Symbol
 ,Epoch
 ,FirstOpen
 ,[Close]
 ,MaxHigh
 ,MinLow
 ,CuVol
 ,Dt10Min
FROM CTE
WHERE rnDesc = 1
;

Result

+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| Symbol | Epoch | FirstOpen | Close | MaxHigh | MinLow | CuVol | Dt10Min |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 2010年08月22日 00:00:00.000 |
| B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 2010年08月23日 00:00:00.000 |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+

Without proper indexes the engine would have to sort the table. It could be expensive.

Ideally, create a clustered index, like so:

CREATE CLUSTERED INDEX IX ON #input
(
 Symbol ASC
 ,Epoch DESC
);

If you already have a clustered index, then add a non-clustered index and INCLUDE all other columns into it, like so:

CREATE NONCLUSTERED INDEX IX ON #input
(
 Symbol ASC
 ,Epoch DESC
)
INCLUDE ([Open], [Close], High, Low, CuVol, Dt10Min)
;

If including all columns into an index sound like too much, try to add an index just on two columns (Symbol ASC, Epoch DESC), but extra look-ups may be quite expensive.

I used

LAST_VALUE([Open]) OVER (... ORDER BY Epoch DESC ...)

instead of more intuitive

FIRST_VALUE([Open]) OVER (... ORDER BY Epoch ASC ...)

to match the index definition. If you use FIRST_VALUE ... ORDER BY Epoch ASC, then the engine would do an extra sort even with the index (or two sorts without an index).

markp-fuso markp-fuso 2,6041 gold badge10 silver badges19 bronze badges · Answer 1 · 2017-08-26 17:20:48Z

I'll borrow Joe's create table/insert example, and assuming Symbol is the effective PK for the final result I've thrown in a second set of data for Symbol = 'B':

create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input
values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');

We'll create a derived table of the max()/min() values (grouped by Symbol), then use the derived table's min(Epoch)/max(Epoch) values to perform 2 joins back to the raw data to generate our final result set:

select dt.Symbol,
 dt.highEpoch as Epoch,
 lowE.[Open] as [Open],
 highE.[Close] as [Close],
 dt.maxHigh as High,
 dt.minLow as Low,
 highE.CuVol as CuVol,
 highE.Dt10Min as Dt10Min
from
(select Symbol,
 min(Epoch) as lowEpoch,
 max(Epoch) as highEpoch,
 max(High) as maxHigh,
 min(Low) as minLow
from #input
group by Symbol) dt
join #input lowE
on lowE.Symbol = dt.Symbol
and lowE.Epoch = dt.lowEpoch
join #input highE
on highE.Symbol = dt.Symbol
and highE.Epoch = dt.highEpoch
order by dt.Symbol;
Symbol | Epoch | Open | Close | High | Low | CuVol | Dt10Min 
------ | ----- | ---- | ----- | ---- | ---- | ----- | -------------------
A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 22/08/2010 00:00:00
B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 23/08/2010 00:00:00

Here's a dbfiddle for the above.

Thanks markp, this definitely helps. With 300million records (26GB disk space) and 40GB allocated to SQL Server, the process was not completed. I think i should process it using python/spark.
Unless you plan on loading 26GB of data into your python/spark app, you may want to spend some time on checking for tuning opportunities with the proposed query (eg, an index on (Symbol,Epoch) should speed up the joins to the lowE and highE tables; if you have multiple engines - aka free cpu cycles - run multiple copies of the above against subsets of the main table - aka, parallelize the operation)

Joe Obbish Joe Obbish 33.1k4 gold badges76 silver badges155 bronze badges · Answer 2 · 2017-08-26 15:54:35Z

Let's start by putting your data into a temp table. For future questions you'll want to do this in the question itself so people that want to help you don't have to copy it by hand from a screenshot. I made some guesses about data types:

CREATE TABLE #input (
 Symbol VARCHAR(1) NOT NULL,
 Epoch INT NOT NULL,
 [Open] NUMERIC(9, 1) NOT NULL,
 [Close] NUMERIC(9, 1) NOT NULL,
 [High] NUMERIC(9, 1) NOT NULL,
 [Low] NUMERIC(9, 1) NOT NULL,
 CuVol INT NOT NULL,
 Dt10Min DATETIME NOT NULL
);
INSERT INTO #input
VALUES
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822')
, ('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822')
, ('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822');

Now let's write separate queries for all of the information that you want and combine them at the end. Getting the minimum value from a table is very straightforward:

SELECT MIN([Low])
FROM #input;

As is getting the maximum value from a table:

SELECT MAX([High])
FROM #input;

Carrying along column values based on the minimum or maximum of another column is more complex. Around here we call that a "greatest n per group" problem. If you just need a single row then you can use the TOP operator. SQL Server will do a TOP 1 sort which will require any extremely low memory grant. This can be helpful if your table is very large. The query below returns the row with the lowest Epoch:

SELECT TOP 1 [Open]
FROM #input
ORDER BY Epoch ASC;

We can write a very similar query to return data associated with the largest Epoch. It's also easy to select as many columns as needed:

SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
FROM #input
ORDER BY Epoch DESC;

Now we need to combine all of the queries together to get a single result set. Each query is guaranteed to return a single row, so we might as well use CROSS JOIN. Here's the complete query:

SELECT
 highest_epoch.symbol
, highest_epoch.Epoch
, lowest_epoch.[Open]
, highest_epoch.[Close]
, min_max.High
, min_max.Low
, highest_epoch.CuVol
, highest_epoch.Dt10Min
FROM
(
 SELECT
 MIN(i.[Low]) [Low]
 , MAX(i.[High]) [High]
 FROM #input i
) min_max
CROSS JOIN 
(
 SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
 FROM #input
 ORDER BY Epoch DESC
) highest_epoch
CROSS JOIN 
(
 SELECT TOP 1 [Open]
 FROM #input
 ORDER BY Epoch ASC
) lowest_epoch;

This query returns the results that you're looking for:

╔════════╦═══════╦══════╦═══════╦══════╦══════╦═══════╦═════════════════════════╗
║ symbol ║ Epoch ║ Open ║ Close ║ High ║ Low ║ CuVol ║ Dt10Min ║
╠════════╬═══════╬══════╬═══════╬══════╬══════╬═══════╬═════════════════════════╣
║ A ║ 129 ║ 23.6 ║ 24.6 ║ 24.8 ║ 22.8 ║ 4800 ║ 2010年08月22日 00:00:00.000 ║
╚════════╩═══════╩══════╩═══════╩══════╩══════╩═══════╩═════════════════════════╝

Without any indexes, it'll do three scans of the table but it shouldn't require sorting. That could be okay depending on the size of the table and your response time requirements. However, if you need to speed the query up you could consider adding covering indexes for each derived table or you could use a more advanced technique to do the "greatest n per group" calculations. For example, it's possible to use GROUP BY to carry along extra columns as I showed in this answer.

Thank you Joe, this solution is great and you wonderfully explained it. However my table has 300million records and SQL Server says out of memory even with 40GB of RAM allocated to it.
@vkb I posted a better solution which shouldn't require anything more than a trivial memory grant. Give that a try.

score 1 · Answer 3 · 2017-08-28 05:42:49Z

Overall, this problem is a top-n-per-group problem with a slight twist. You need to find both the first and the last values in each group.

If you use SQL Server 2012 or later, one way to do it is to use ROW_NUMBER and FIRST_VALUE / LAST_VALUE functions.

Sample data (thank you @markp)

create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');

Query

WITH
CTE
AS
(
 SELECT
 Symbol
 ,Epoch
 ,[Open]
 ,[Close]
 ,High
 ,Low
 ,CuVol
 ,Dt10Min
 ,MIN(Low) OVER (PARTITION BY Symbol) AS MinLow
 ,MAX(High) OVER (PARTITION BY Symbol) AS MaxHigh
 ,LAST_VALUE([Open]) 
 OVER (PARTITION BY Symbol ORDER BY Epoch DESC
 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstOpen
 ,ROW_NUMBER() OVER (PARTITION BY Symbol ORDER BY Epoch DESC) AS rnDesc
 FROM
 #input
)
SELECT
 Symbol
 ,Epoch
 ,FirstOpen
 ,[Close]
 ,MaxHigh
 ,MinLow
 ,CuVol
 ,Dt10Min
FROM CTE
WHERE rnDesc = 1
;

Result

+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| Symbol | Epoch | FirstOpen | Close | MaxHigh | MinLow | CuVol | Dt10Min |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 2010年08月22日 00:00:00.000 |
| B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 2010年08月23日 00:00:00.000 |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+

Without proper indexes the engine would have to sort the table. It could be expensive.

Ideally, create a clustered index, like so:

CREATE CLUSTERED INDEX IX ON #input
(
 Symbol ASC
 ,Epoch DESC
);

If you already have a clustered index, then add a non-clustered index and INCLUDE all other columns into it, like so:

CREATE NONCLUSTERED INDEX IX ON #input
(
 Symbol ASC
 ,Epoch DESC
)
INCLUDE ([Open], [Close], High, Low, CuVol, Dt10Min)
;

If including all columns into an index sound like too much, try to add an index just on two columns (Symbol ASC, Epoch DESC), but extra look-ups may be quite expensive.

I used

LAST_VALUE([Open]) OVER (... ORDER BY Epoch DESC ...)

instead of more intuitive

FIRST_VALUE([Open]) OVER (... ORDER BY Epoch ASC ...)

to match the index definition. If you use FIRST_VALUE ... ORDER BY Epoch ASC, then the engine would do an extra sort even with the index (or two sorts without an index).

Stack Exchange Network

Returning columns based on various conditions in Group By

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Returning columns based on various conditions in Group By

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions