Reduce number of rows with same id repeated based on status

Question 1

I have the following table:

-- id, location_id, status, posted_year, posted_quarter
CREATE TABLE foo AS
SELECT * FROM ( VALUES
(1 ,12,'active' ,2014,3), 
(2 ,12,'inactive',2014,3),
(3 ,12,'active' ,2014,3),
(4 ,12,'active' ,2014,4),
(5 ,12,'inactive',2014,4),
(6 ,13,'active' ,2015,1),
(7 ,13,'active' ,2015,1),
(8 ,13,'inactive',2015,1),
(9 ,13,'active' ,2015,2),
(10,13,'active' ,2015,2),
(11,13,'inactive',2015,3),
(12,13,'active' ,2015,4),
(13,13,'active' ,2015,4),
(14,13,'inactive',2015,4),
(15,12,'active' ,2015,1),
(16,13,'active' ,2015,1),
(17,12,'inactive',2015,1),
(18,12,'active' ,2015,2)
) AS t(id,location_id,status,posted_year,posted_quarter);

I want to recreate this table but have only one quarter per year from each location.

We might have more than one record in a year for a quarter, and in this case we need to check the status based on:

If one status is active for a year in selected quarter, the status will be active, otherwise the status will be inactive.

Examples:

location_id 12 for year 2014 and quarter 3 will have one record in the new table with status active.
location_id 12 for year 2015 and quarter 1 will have one record in the new table with status inactive.

How to write this query?

Question 2

Is this table UNIQUE(location_id, id, posted_quater)?

Question 3

I updated the table as requested and it is has unique id and repeated location_id, posted_quarter

Question 4

@Eyla as some other general points of suggestion INACTIVE/ACTIVE should likely be a bool: is_active (if those are the only states)

Question 5

The only trick here is to create something you can group by that involves the year, and the quarter. This isn't the only way to do it, but you can do this...

make_timestamp(posted_year,1,1,0,0,0)::date
+ posted_quarter*3*'1 month'::interval

Remember, three months in a quarter-year. You can also do posted_year + posted_quater*0.25 but all the same.

From there, it's pretty basic. I'll use a CTE to separate formatting from calculation (feel free to ditch it for speed).

`DISTINCT ON()`

WITH t AS (
 SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
 FROM foo
)
SELECT DISTINCT ON ( location_id, qtr ) location_id, qtr, status
FROM t
ORDER BY location_id, qtr, status='active' DESC;

`GROUP BY` ... `bool_or()`

Or, alternatively (and maybe faster),

WITH t AS (
 SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
 FROM foo
)
SELECT location_id,
 qtr,
 CASE WHEN bool_or(status='active') THEN 'active' ELSE 'inactive' END
FROM t
GROUP BY location_id, qtr
ORDER BY location_id, qtr;

Question 6

Assuming the id column is just a meaningless serial column, all you need is a simple aggregation.

Create the table first, with an actual serial column to auto-assign defaults.

CREATE TABLE foo (
 foo_id serial -- serial!
 , location_id int NOT NULL -- REFERENCES locations(location_id)
 , posted_year int NOT NULL -- might be smallint
 , posted_quarter int NOT NULL -- might be smallint
 , status_active boolean NOT NULL -- boolean!
);

Then insert aggregated data, without id:

INSERT INTO foo(location_id, posted_year, posted_quarter, status_active)
SELECT location_id
 , posted_year
 , posted_quarter
 , CASE WHEN min(status) = 'active' THEN true ELSE false END
FROM (
 VALUES
 (1 ,12,'active' ,2014,3), 
 (2 ,12,'inactive',2014,3),
 (3 ,12,'active' ,2014,3),
 (4 ,12,'active' ,2014,4),
 (5 ,12,'inactive',2014,4),
 (6 ,13,'active' ,2015,1),
 (7 ,13,'active' ,2015,1),
 (8 ,13,'inactive',2015,1),
 (9 ,13,'active' ,2015,2),
 (10,13,'active' ,2015,2),
 (11,13,'inactive',2015,3),
 (12,13,'active' ,2015,4),
 (13,13,'active' ,2015,4),
 (14,13,'inactive',2015,4),
 (15,12,'active' ,2015,1),
 (16,13,'active' ,2015,1),
 (17,12,'inactive',2015,1),
 (18,12,'active' ,2015,2)
 ) t(id, location_id, status, posted_year, posted_quarter)
GROUP BY posted_year, posted_quarter, location_id
ORDER BY posted_year, posted_quarter, location_id;
ALTER TABLE foo
 ADD PRIMARY KEY (foo_id)
, ADD UNIQUE (location_id, posted_year, posted_quarter);

Result:

 foo_id | location_id | posted_year | posted_quarter | status_active
--------+-------------+-------------+----------------+---------------
 1 | 12 | 2014 | 3 | t
 2 | 12 | 2014 | 4 | t
 3 | 12 | 2015 | 1 | t
 4 | 13 | 2015 | 1 | t
 5 | 12 | 2015 | 2 | t
 6 | 13 | 2015 | 2 | t
 7 | 13 | 2015 | 3 | f
 8 | 13 | 2015 | 4 | t

Since 'active' sorts before 'inactive', min(status) will return 'active' if any row in the group is active. Convert to boolean right away to fit the boolean column in the table.

The added UNIQUE constraint disables dupes in the future. You might make (location_id, posted_year, posted_quarter) the multi-column PRIMARY KEY instead and drop foo_id altogether. That's a matter of taste and other requirements.

Either way, it's cheaper to add theses constraints after you fill the table.

Auto increment SQL function

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges511 bronze badges · Accepted Answer · 2016-12-16 22:55:55Z

The only trick here is to create something you can group by that involves the year, and the quarter. This isn't the only way to do it, but you can do this...

make_timestamp(posted_year,1,1,0,0,0)::date
+ posted_quarter*3*'1 month'::interval

Remember, three months in a quarter-year. You can also do posted_year + posted_quater*0.25 but all the same.

From there, it's pretty basic. I'll use a CTE to separate formatting from calculation (feel free to ditch it for speed).

`DISTINCT ON()`

WITH t AS (
 SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
 FROM foo
)
SELECT DISTINCT ON ( location_id, qtr ) location_id, qtr, status
FROM t
ORDER BY location_id, qtr, status='active' DESC;

`GROUP BY` ... `bool_or()`

Or, alternatively (and maybe faster),

WITH t AS (
 SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
 FROM foo
)
SELECT location_id,
 qtr,
 CASE WHEN bool_or(status='active') THEN 'active' ELSE 'inactive' END
FROM t
GROUP BY location_id, qtr
ORDER BY location_id, qtr;

Stack Exchange Network

Reduce number of rows with same id repeated based on status

2 Answers 2

`DISTINCT ON()`

`GROUP BY` ... `bool_or()`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reduce number of rows with same id repeated based on status

2 Answers 2

DISTINCT ON()

GROUP BY ... bool_or()

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`DISTINCT ON()`

`GROUP BY` ... `bool_or()`