1

I have legacy tables similar to the following:

employee

------------------------------------
| employee_id | name 
------------------------------------
| 1 | David 
| 2 | Mathew 
------------------------------------

payroll

-------------------------------------
| employee_id | salary 
-------------------------------------
| 2 | 200000 
| 3 | 90000 
-------------------------------------

I want to get the following data, after joins and filters:

-----------------------------------------------------------
| address_id | employee_id | address 
-----------------------------------------------------------
| 1 | 2 | street 1, NY 
| 2 | 2 | street 2, DC 
------------------------------------------------------------

I have the following query:

SELECT employee_id, salary, address_arr 
FROM employee 
LEFT JOIN payroll on payroll.employee_id = employee.employee_id
INNER JOIN 
 (
 SELECT employee_id, ARRAY_AGG(address) as address_arr 
 FROM addresses 
 GROUP BY employee_id
 ) table_address ON table_address.employee_id = employee.employee_id
WHERE employee.employee_id < 1000000
LIMIT 100
OFFSET 0

Above query gives the desired output but is highly unoptimized as the GROUP BY operation occurs over the complete addresses table before being used for JOIN operation the outer query.

Kindly answer:

  1. How can we avoid the GROUP BY operation to occurs over the complete addresses table by using LIMIT OFFSET of the outer query?
  2. Will the condition WHERE employee.employee_id < 1000000 be applied on subquery before or after the GROUP BY operation in the inner query. If the condition is applied after the GROUP BY, how can we avoid that?

Note: There are multiple JOINs and subqueries in the actual query being used.

asked Feb 8, 2019 at 7:04
2
  • PS. LIMIT without ORDER BY gives you 100 random records from the whole data array... do you really need in that? Commented Feb 8, 2019 at 7:44
  • You should present us the simplest query you can which still have the issue. If you remove the left join on PAYROLL, does the problem go away? Commented Feb 9, 2019 at 15:18

3 Answers 3

3

I am not sure if this is really more efficient, but you could try to join to a derived table that applies the limit.

select emp.employee_id, emp.salary, adr.address_arr 
from (
 SELECT employee_id, salary, address_arr 
 FROM employee 
 LEFT JOIN payroll on payroll.employee_id = employee.employee_id
 WHERE employee.employee_id < 1000000
 LIMIT 100
 OFFSET 0
) as emp
 JOIN (
 SELECT a.employee_id, ARRAY_AGG(a.address) as address_arr 
 FROM addresses a
 GROUP BY employee_id
 ) as adr ON adr.employee_id = emp.employee_id;

The first derived table only selects 100 rows, and the join/group by should then only be done for those 100 employees.

If the optimizer doesn't push that down, you could try a lateral join instead to "force" a push down:

select emp.employee_id, emp.salary, adr.address_arr 
from (
 SELECT employee_id, salary, address_arr 
 FROM employee 
 LEFT JOIN payroll on payroll.employee_id = employee.employee_id
 WHERE employee.employee_id < 1000000
 LIMIT 100
 OFFSET 0
) as emp
 LATERAL JOIN (
 SELECT a.employee_id, ARRAY_AGG(a.address) as address_arr 
 FROM addresses a
 WHERE a.employee_id = emp.employee_id
 GROUP BY employee_id
 ) as adr ON adr.employee_id = emp.employee_id;

The join condition isn't really needed, but it dosn't hurt either

answered Feb 8, 2019 at 7:41
1

Maybe

SELECT employee.employee_id, payroll.salary, ARRAY_AGG(addresses.address)
FROM employee 
INNER JOIN addresses ON addresses.employee_id = employee.employee_id
LEFT JOIN payroll on payroll.employee_id = employee.employee_id
WHERE employee.employee_id < 1000000
GROUP BY employee.employee_id
LIMIT 100
OFFSET 0

?

And - do you really need in records where no appropriate records in payroll table which leads to NULLs in payroll.salary? Maybe, INNER JOIN is enough?

answered Feb 8, 2019 at 7:15
2
  • The query is incorrect. payroll.salary needs to be included in GROUP BY clause, which will unoptimise the query further Commented Feb 8, 2019 at 7:48
  • @nimeshkiranverma payroll.salary needs to be included in GROUP BY clause you may wrap it using any aggregate function which can be applied to this field datatype. Commented Feb 8, 2019 at 7:50
1

I am new to this and had some help , here's what I came up with:

with t as (SELECT employee.employee_id, salary FROM employee LEFT JOIN payroll on payroll.employee_id = employee.employee_id WHERE employee.employee_id < 1000000 LIMIT 100 OFFSET 0)
select t.employee_id, max(t.salary), ARRAY_AGG(address) as address_arr from address left join t on address.employee_id = t.employee_id where address.employee_id = t.employee_id group by t.employee_id;

explain analyze yields

HashAggregate (cost=443931.83..443933.08 rows=100 width=44) (actual time=3173.705..3173.705 rows=1 loops=1)
 Group Key: t.employee_id
 CTE t
 -> Limit (cost=313.99..316.90 rows=100 width=12) (actual time=7.441..7.511 rows=100 loops=1)
 -> Hash Right Join (cost=313.99..145616.13 rows=4983381 width=12) (actual time=7.437..7.490 rows=100 loops=1)
 Hash Cond: (payroll.employee_id = employee.employee_id)
 -> Seq Scan on payroll (cost=0.00..76778.79 rows=4983879 width=12) (actual time=0.036..0.047 rows=100 loops=1)
 -> Hash (cost=189.00..189.00 rows=9999 width=4) (actual time=7.370..7.370 rows=10000 loops=1)
 Buckets: 16384 Batches: 1 Memory Usage: 480kB
 -> Seq Scan on employee (cost=0.00..189.00 rows=9999 width=4) (actual time=0.016..3.783 rows=10000 loops=1)
 Filter: (employee_id < 1000000)
 -> Hash Join (cost=3.25..441981.29 rows=217817 width=37) (actual time=7.699..3119.096 rows=200000 loops=1)
 Hash Cond: (address.employee_id = t.employee_id)
 -> Seq Scan on address (cost=0.00..364792.00 rows=20002100 width=29) (actual time=0.067..1568.256 rows=20000000 loops=1)
 -> Hash (cost=2.00..2.00 rows=100 width=12) (actual time=7.609..7.609 rows=100 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 13kB
 -> CTE Scan on t (cost=0.00..2.00 rows=100 width=12) (actual time=7.449..7.580 rows=100 loops=1)
 Planning time: 0.456 ms
 Execution time: 3174.499 ms
(19 rows)

enter image description here

answered Feb 21, 2019 at 20:59
1
  • 1
    Could you give us the explain output in plain text please ? Commented Feb 22, 2019 at 7:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.