I would like to do a 'group by' on the following example table
Input: enter image description here
Output: enter image description here
- [Low] = Min of [Low] column
- [High] = Max of [High] column
- [Symbol]
- [Epoch]
- [CuVol]
- [Dt10Min])
- [Close]
- [Open]
Columns 3-7 have the values corresponding to highest Epoch, which is 129 in this case and columns 8 will be the value of [Open] at lowest Epoch of the group [Dt10Min],127 in this case.
I am very new to SQL can you please suggest.
-
What version of SQL Server do you use? Please add a tag to the question.Vladimir Baranov– Vladimir Baranov2017年08月28日 05:06:36 +00:00Commented Aug 28, 2017 at 5:06
3 Answers 3
I'll borrow Joe's create table/insert
example, and assuming Symbol
is the effective PK for the final result I've thrown in a second set of data for Symbol = 'B'
:
create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input
values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');
We'll create a derived table of the max()/min()
values (grouped by Symbol
), then use the derived table's min(Epoch)/max(Epoch)
values to perform 2 joins back to the raw data to generate our final result set:
select dt.Symbol,
dt.highEpoch as Epoch,
lowE.[Open] as [Open],
highE.[Close] as [Close],
dt.maxHigh as High,
dt.minLow as Low,
highE.CuVol as CuVol,
highE.Dt10Min as Dt10Min
from
(select Symbol,
min(Epoch) as lowEpoch,
max(Epoch) as highEpoch,
max(High) as maxHigh,
min(Low) as minLow
from #input
group by Symbol) dt
join #input lowE
on lowE.Symbol = dt.Symbol
and lowE.Epoch = dt.lowEpoch
join #input highE
on highE.Symbol = dt.Symbol
and highE.Epoch = dt.highEpoch
order by dt.Symbol;
Symbol | Epoch | Open | Close | High | Low | CuVol | Dt10Min
------ | ----- | ---- | ----- | ---- | ---- | ----- | -------------------
A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 22/08/2010 00:00:00
B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 23/08/2010 00:00:00
Here's a dbfiddle for the above.
-
Thanks markp, this definitely helps. With 300million records (26GB disk space) and 40GB allocated to SQL Server, the process was not completed. I think i should process it using python/spark.vkb– vkb2017年08月27日 15:41:46 +00:00Commented Aug 27, 2017 at 15:41
-
Unless you plan on loading 26GB of data into your python/spark app, you may want to spend some time on checking for tuning opportunities with the proposed query (eg, an index on (Symbol,Epoch) should speed up the joins to the
lowE
andhighE
tables; if you have multiple engines - aka free cpu cycles - run multiple copies of the above against subsets of the main table - aka, parallelize the operation)markp-fuso– markp-fuso2017年08月27日 15:59:58 +00:00Commented Aug 27, 2017 at 15:59
Let's start by putting your data into a temp table. For future questions you'll want to do this in the question itself so people that want to help you don't have to copy it by hand from a screenshot. I made some guesses about data types:
CREATE TABLE #input (
Symbol VARCHAR(1) NOT NULL,
Epoch INT NOT NULL,
[Open] NUMERIC(9, 1) NOT NULL,
[Close] NUMERIC(9, 1) NOT NULL,
[High] NUMERIC(9, 1) NOT NULL,
[Low] NUMERIC(9, 1) NOT NULL,
CuVol INT NOT NULL,
Dt10Min DATETIME NOT NULL
);
INSERT INTO #input
VALUES
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822')
, ('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822')
, ('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822');
Now let's write separate queries for all of the information that you want and combine them at the end. Getting the minimum value from a table is very straightforward:
SELECT MIN([Low])
FROM #input;
As is getting the maximum value from a table:
SELECT MAX([High])
FROM #input;
Carrying along column values based on the minimum or maximum of another column is more complex. Around here we call that a "greatest n per group" problem. If you just need a single row then you can use the TOP operator. SQL Server will do a TOP 1 sort which will require any extremely low memory grant. This can be helpful if your table is very large. The query below returns the row with the lowest Epoch
:
SELECT TOP 1 [Open]
FROM #input
ORDER BY Epoch ASC;
We can write a very similar query to return data associated with the largest Epoch
. It's also easy to select as many columns as needed:
SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
FROM #input
ORDER BY Epoch DESC;
Now we need to combine all of the queries together to get a single result set. Each query is guaranteed to return a single row, so we might as well use CROSS JOIN
. Here's the complete query:
SELECT
highest_epoch.symbol
, highest_epoch.Epoch
, lowest_epoch.[Open]
, highest_epoch.[Close]
, min_max.High
, min_max.Low
, highest_epoch.CuVol
, highest_epoch.Dt10Min
FROM
(
SELECT
MIN(i.[Low]) [Low]
, MAX(i.[High]) [High]
FROM #input i
) min_max
CROSS JOIN
(
SELECT TOP 1 Symbol, [Epoch], [CuVol], [Dt10Min], [Close]
FROM #input
ORDER BY Epoch DESC
) highest_epoch
CROSS JOIN
(
SELECT TOP 1 [Open]
FROM #input
ORDER BY Epoch ASC
) lowest_epoch;
This query returns the results that you're looking for:
╔════════╦═══════╦══════╦═══════╦══════╦══════╦═══════╦═════════════════════════╗
║ symbol ║ Epoch ║ Open ║ Close ║ High ║ Low ║ CuVol ║ Dt10Min ║
╠════════╬═══════╬══════╬═══════╬══════╬══════╬═══════╬═════════════════════════╣
║ A ║ 129 ║ 23.6 ║ 24.6 ║ 24.8 ║ 22.8 ║ 4800 ║ 2010年08月22日 00:00:00.000 ║
╚════════╩═══════╩══════╩═══════╩══════╩══════╩═══════╩═════════════════════════╝
Without any indexes, it'll do three scans of the table but it shouldn't require sorting. That could be okay depending on the size of the table and your response time requirements. However, if you need to speed the query up you could consider adding covering indexes for each derived table or you could use a more advanced technique to do the "greatest n per group" calculations. For example, it's possible to use GROUP BY
to carry along extra columns as I showed in this answer.
-
Thank you Joe, this solution is great and you wonderfully explained it. However my table has 300million records and SQL Server says out of memory even with 40GB of RAM allocated to it.vkb– vkb2017年08月27日 15:39:05 +00:00Commented Aug 27, 2017 at 15:39
-
@vkb I posted a better solution which shouldn't require anything more than a trivial memory grant. Give that a try.Joe Obbish– Joe Obbish2017年08月27日 16:00:50 +00:00Commented Aug 27, 2017 at 16:00
Overall, this problem is a top-n-per-group
problem with a slight twist. You need to find both the first and the last values in each group.
If you use SQL Server 2012 or later, one way to do it is to use ROW_NUMBER
and FIRST_VALUE
/ LAST_VALUE
functions.
Sample data (thank you @markp)
create table #input
(Symbol varchar(1) NOT NULL
,Epoch int NOT NULL
,[Open] numeric(9, 1) NOT NULL
,[Close] numeric(9, 1) NOT NULL
,High numeric(9, 1) NOT NULL
,Low numeric(9, 1) NOT NULL
,CuVol int NOT NULL
,Dt10Min datetime NOT NULL
);
insert into #input values
('A', 127, 23.6, 24, 23.9, 22.8, 1600, '20100822'),
('A', 128, 24.6, 24.1, 24.8, 23.6, 3200, '20100822'),
('A', 129, 23.7, 24.6, 23.9, 23.5, 4800, '20100822'),
('B', 227, 33.6, 34, 33.9, 32.8, 1605, '20100821'),
('B', 228, 34.6, 34.1, 34.8, 33.6, 3205, '20100822'),
('B', 229, 33.7, 34.6, 33.9, 33.5, 4805, '20100823');
Query
WITH
CTE
AS
(
SELECT
Symbol
,Epoch
,[Open]
,[Close]
,High
,Low
,CuVol
,Dt10Min
,MIN(Low) OVER (PARTITION BY Symbol) AS MinLow
,MAX(High) OVER (PARTITION BY Symbol) AS MaxHigh
,LAST_VALUE([Open])
OVER (PARTITION BY Symbol ORDER BY Epoch DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstOpen
,ROW_NUMBER() OVER (PARTITION BY Symbol ORDER BY Epoch DESC) AS rnDesc
FROM
#input
)
SELECT
Symbol
,Epoch
,FirstOpen
,[Close]
,MaxHigh
,MinLow
,CuVol
,Dt10Min
FROM CTE
WHERE rnDesc = 1
;
Result
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| Symbol | Epoch | FirstOpen | Close | MaxHigh | MinLow | CuVol | Dt10Min |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
| A | 129 | 23.6 | 24.6 | 24.8 | 22.8 | 4800 | 2010年08月22日 00:00:00.000 |
| B | 229 | 33.6 | 34.6 | 34.8 | 32.8 | 4805 | 2010年08月23日 00:00:00.000 |
+--------+-------+-----------+-------+---------+--------+-------+-------------------------+
Without proper indexes the engine would have to sort the table. It could be expensive.
Ideally, create a clustered index, like so:
CREATE CLUSTERED INDEX IX ON #input
(
Symbol ASC
,Epoch DESC
);
If you already have a clustered index, then add a non-clustered index and INCLUDE
all other columns into it, like so:
CREATE NONCLUSTERED INDEX IX ON #input
(
Symbol ASC
,Epoch DESC
)
INCLUDE ([Open], [Close], High, Low, CuVol, Dt10Min)
;
If including all columns into an index sound like too much, try to add an index just on two columns (Symbol ASC, Epoch DESC)
, but extra look-ups may be quite expensive.
I used
LAST_VALUE([Open]) OVER (... ORDER BY Epoch DESC ...)
instead of more intuitive
FIRST_VALUE([Open]) OVER (... ORDER BY Epoch ASC ...)
to match the index definition. If you use FIRST_VALUE ... ORDER BY Epoch ASC
, then the engine would do an extra sort even with the index (or two sorts without an index).