I have a relatively complicated query with a subquery fetching an array like so:
...
ARRAY(SELECT category_id FROM category_schedule_con con
WHERE s.id = con.schedule_id ORDER BY category_id) AS cats,
...
and would like to use the array 'cats' in a later WHERE condition like
...
WHERE 4 = ANY(cats)
...
But this does not work as it states that the column 'cats' doesn't exist.
c/p'ing the subquery into the ANY
clause yields the expected result.
-
Your version of Postgres is missing. Also, it's almost always better to post a complete (simplified) query, and the table definitions are almost always helpful, too.Erwin Brandstetter– Erwin Brandstetter2015年08月12日 16:07:06 +00:00Commented Aug 12, 2015 at 16:07
2 Answers 2
Explanation
By definition in the SQL standard (which Postgres implements) you can reference output columns in ORDER BY
or GROUP BY
, but not in the WHERE
or HAVING
clause. The manual:
An output column's name can be used to refer to the column's value in
ORDER BY
andGROUP BY
clauses, but not in theWHERE
orHAVING
clauses; there you must write out the expression instead.
Related:
- PostgreSQL reusing computation result in select query
- PostgreSQL Where count condition
- GROUP BY + CASE statement
Obviously, your subquery is a correlated subquery expression in the SELECT
list (which is hidden in the question due to over-simplification).
To avoid repeating lengthy / expensive expressions in the WHERE
clause you can use a subquery in the FROM
list. A correlated subquery in the SELECT
list cannot be referenced by alias in the WHERE
clause, that's just an output column like any other.
Chances are, your query can be much more efficient ...
Better query
Applying the above, this query would work:
SELECT s.*, con.cats
FROM some_table s -- guessing the missing query
JOIN (
SELECT schedule_id
, array_agg(category_id ORDER BY category_id) AS cats
FROM category_schedule_con
GROUP BY 1
) con ON con.schedule_id = s.id
WHERE 3 = ANY(cats);
Now you can reference the column alias cats
in the WHERE
clause. But this query is terribly inefficient for big tables for multiple reasons. Most importantly, the predicate is not sargable.
Using GROUP BY
and the aggregate function array_agg()
instead of the ARRAY
constructor because we are producing multiple arrays in a single query.
You can apply an ORDER BY
clause to almost any aggregate function, but a sorted subquery typically performs better:
SELECT s.*, con.cats
FROM some_table s
JOIN (
SELECT id, array_agg(category_id) AS cats
FROM (
SELECT schedule_id AS id, category_id -- alias id for convenience
FROM category_schedule_con
ORDER BY 1, 2 -- to get ordered list per schedule_id
) con
GROUP BY 1
) con USING (id)
WHERE 3 = ANY(cats);
More importantly, pull the predicate (the WHERE
condition) down into the subquery to make it possible to use an index and exclude irrelevant rows early. Much faster with big tables:
SELECT s.*, con.cats
FROM some_table s
JOIN (
SELECT id, array_agg(category_id) AS cats
FROM (
SELECT schedule_id AS id, category_id
FROM category_schedule_con c
WHERE EXISTS (
SELECT 1 FROM category_schedule_con
WHERE schedule_id = c.schedule_id
AND category_id = 3
)
ORDER BY 1, 2
) con
GROUP BY 1
) con USING (id);
Depending on data distribution, a LATERAL
join (requires Postgres 9.3+) may be more efficient:
SELECT s.*, con.cats
FROM some_table s
, LATERAL (
SELECT ARRAY (
SELECT category_id
FROM category_schedule_con
WHERE schedule_id = s.id
ORDER BY 1
) AS cats
) con
WHERE EXISTS (
SELECT 1 FROM category_schedule_con
WHERE schedule_id = s.id
AND category_id = 3
);
About LATERAL
:
But it should be the fastest to invert the logic: start by finding schedule_id
that have category_id = 3
, self-join to category_schedule_con
and aggregate before joining to the other table:
SELECT s.*, con.cats
FROM (
SELECT id, array_agg(c.category_id) AS cats
FROM (
SELECT schedule_id AS id
FROM category_schedule_con
WHERE category_id = 3
) x
JOIN category_schedule_con c USING (id)
GROUP BY id
) con
JOIN some_table s USING (id);
Index
Be sure to have a multicolumn index like:
CREATE INDEX category_schedule_con_foo_idx ON category_schedule_con
(schedule_id, category_id);
For the last query, we'd need the inverted sequence of columns:
CREATE INDEX category_schedule_con_bar_idx ON category_schedule_con
(category_id, schedule_id);
and another one with two columns switched back again. Two indexes on just (category_id)
and (schedule_id)
would work fast, too;
A PK or UNIQUE constraint on both columns serves as well.
Preamble: Check out Erwin's answer on this for a really interesting and detailed explanation.
I think one way to get this to work appropriately is to name the subquery via a CTE, like
WITH cats AS(SELECT category_id FROM category_schedule_con con
WHERE s.id = con.schedule_id ORDER BY category_id)...
and then later when you need to apply your WHERE
predicate, use
... WHERE 4 = ANY(ARRAY(SELECT * FROM cats)) ...
Give that a shot and see if it works for you.
Note: you can also write the predicate WHERE
clause as
... WHERE 4 = ANY(SELECT * FROM cats) ...
and the query should still work, but may yield a different query plan, as the planner treats sets of records and arrays differently. If you are running into query run time issues, this could create a 'hack' fix to your problem.
-
1I may be wrong but I thought the OP meant that they want to reference
cats
in the WHERE clause of the same SELECT statement (on the same level). And you probably know that you can't reference a column alias in a WHERE clause.Andriy M– Andriy M2015年08月12日 16:43:06 +00:00Commented Aug 12, 2015 at 16:43 -
Yeah, I think you're right. I think I misunderstood the OP, and I fumbled my repsonse a bit. In my defense, I was trying to write the response at the end of a long work day. Not in my defense, I shouldn't have been so hasty! :P Either way, I have made an edit to reflect this, where simply using a CTE could be a fix.Chris– Chris2015年08月12日 17:36:17 +00:00Commented Aug 12, 2015 at 17:36
-
Looking at it again, I'm actually still not sure I know what the OP is asking for... I thought they were trying to build the
cats
result as a single column set of returned records, to later be referenced as a table (or an array, if applying theARRAY()
condition). If they're trying to build it as a column, then I'm off base, but I don't think I know enough about their query.Chris– Chris2015年08月12日 17:46:00 +00:00Commented Aug 12, 2015 at 17:46