4

I have a relatively complicated query with a subquery fetching an array like so:

...
ARRAY(SELECT category_id FROM category_schedule_con con
 WHERE s.id = con.schedule_id ORDER BY category_id) AS cats,
...

and would like to use the array 'cats' in a later WHERE condition like

...
WHERE 4 = ANY(cats)
...

But this does not work as it states that the column 'cats' doesn't exist. c/p'ing the subquery into the ANY clause yields the expected result.

Erwin Brandstetter
186k28 gold badges463 silver badges636 bronze badges
asked Aug 12, 2015 at 1:05
1
  • Your version of Postgres is missing. Also, it's almost always better to post a complete (simplified) query, and the table definitions are almost always helpful, too. Commented Aug 12, 2015 at 16:07

2 Answers 2

6

Explanation

By definition in the SQL standard (which Postgres implements) you can reference output columns in ORDER BY or GROUP BY, but not in the WHERE or HAVING clause. The manual:

An output column's name can be used to refer to the column's value in ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses; there you must write out the expression instead.

Related:

Obviously, your subquery is a correlated subquery expression in the SELECT list (which is hidden in the question due to over-simplification).

To avoid repeating lengthy / expensive expressions in the WHERE clause you can use a subquery in the FROM list. A correlated subquery in the SELECT list cannot be referenced by alias in the WHERE clause, that's just an output column like any other.

Chances are, your query can be much more efficient ...

Better query

Applying the above, this query would work:

SELECT s.*, con.cats
FROM some_table s -- guessing the missing query
JOIN (
 SELECT schedule_id
 , array_agg(category_id ORDER BY category_id) AS cats
 FROM category_schedule_con
 GROUP BY 1
 ) con ON con.schedule_id = s.id
WHERE 3 = ANY(cats);

Now you can reference the column alias cats in the WHERE clause. But this query is terribly inefficient for big tables for multiple reasons. Most importantly, the predicate is not sargable.

Using GROUP BY and the aggregate function array_agg() instead of the ARRAY constructor because we are producing multiple arrays in a single query.

You can apply an ORDER BY clause to almost any aggregate function, but a sorted subquery typically performs better:

SELECT s.*, con.cats
FROM some_table s
JOIN (
 SELECT id, array_agg(category_id) AS cats
 FROM (
 SELECT schedule_id AS id, category_id -- alias id for convenience
 FROM category_schedule_con
 ORDER BY 1, 2 -- to get ordered list per schedule_id
 ) con
 GROUP BY 1
 ) con USING (id)
WHERE 3 = ANY(cats);

More importantly, pull the predicate (the WHERE condition) down into the subquery to make it possible to use an index and exclude irrelevant rows early. Much faster with big tables:

SELECT s.*, con.cats
FROM some_table s
JOIN (
 SELECT id, array_agg(category_id) AS cats
 FROM (
 SELECT schedule_id AS id, category_id
 FROM category_schedule_con c
 WHERE EXISTS (
 SELECT 1 FROM category_schedule_con
 WHERE schedule_id = c.schedule_id
 AND category_id = 3
 )
 ORDER BY 1, 2
 ) con
 GROUP BY 1
 ) con USING (id);

Depending on data distribution, a LATERAL join (requires Postgres 9.3+) may be more efficient:

SELECT s.*, con.cats
FROM some_table s
 , LATERAL (
 SELECT ARRAY (
 SELECT category_id
 FROM category_schedule_con
 WHERE schedule_id = s.id
 ORDER BY 1
 ) AS cats
 ) con
WHERE EXISTS (
 SELECT 1 FROM category_schedule_con
 WHERE schedule_id = s.id
 AND category_id = 3
 );

About LATERAL:

But it should be the fastest to invert the logic: start by finding schedule_id that have category_id = 3, self-join to category_schedule_con and aggregate before joining to the other table:

SELECT s.*, con.cats
FROM (
 SELECT id, array_agg(c.category_id) AS cats
 FROM (
 SELECT schedule_id AS id
 FROM category_schedule_con
 WHERE category_id = 3
 ) x
 JOIN category_schedule_con c USING (id)
 GROUP BY id
 ) con
JOIN some_table s USING (id);

Index

Be sure to have a multicolumn index like:

CREATE INDEX category_schedule_con_foo_idx ON category_schedule_con
(schedule_id, category_id);

For the last query, we'd need the inverted sequence of columns:

CREATE INDEX category_schedule_con_bar_idx ON category_schedule_con
(category_id, schedule_id);

and another one with two columns switched back again. Two indexes on just (category_id) and (schedule_id) would work fast, too;

A PK or UNIQUE constraint on both columns serves as well.

answered Aug 12, 2015 at 16:02
0
1

Preamble: Check out Erwin's answer on this for a really interesting and detailed explanation.

I think one way to get this to work appropriately is to name the subquery via a CTE, like

WITH cats AS(SELECT category_id FROM category_schedule_con con 
 WHERE s.id = con.schedule_id ORDER BY category_id)...

and then later when you need to apply your WHERE predicate, use

... WHERE 4 = ANY(ARRAY(SELECT * FROM cats)) ...

Give that a shot and see if it works for you.

Note: you can also write the predicate WHERE clause as

... WHERE 4 = ANY(SELECT * FROM cats) ...

and the query should still work, but may yield a different query plan, as the planner treats sets of records and arrays differently. If you are running into query run time issues, this could create a 'hack' fix to your problem.

answered Aug 12, 2015 at 1:24
3
  • 1
    I may be wrong but I thought the OP meant that they want to reference cats in the WHERE clause of the same SELECT statement (on the same level). And you probably know that you can't reference a column alias in a WHERE clause. Commented Aug 12, 2015 at 16:43
  • Yeah, I think you're right. I think I misunderstood the OP, and I fumbled my repsonse a bit. In my defense, I was trying to write the response at the end of a long work day. Not in my defense, I shouldn't have been so hasty! :P Either way, I have made an edit to reflect this, where simply using a CTE could be a fix. Commented Aug 12, 2015 at 17:36
  • Looking at it again, I'm actually still not sure I know what the OP is asking for... I thought they were trying to build the cats result as a single column set of returned records, to later be referenced as a table (or an array, if applying the ARRAY() condition). If they're trying to build it as a column, then I'm off base, but I don't think I know enough about their query. Commented Aug 12, 2015 at 17:46

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.