Managing the order of aggregation in Postgres

Question 1

Suppose I have the following merging function that merges two jsonb values overwriting duplicate keys with the values from the second

create or replace function jsonb_concat(a jsonb, b jsonb) returns jsonb
as
'select 1ドル || 2ドル'
 language sql
 immutable
 parallel safe
;

Since I want to use that function in aggregates I need to define an aggregate function like so

create or replace aggregate jsonb_merge_agg(jsonb)
(
 sfunc = jsonb_concat,
 stype = jsonb,
 initcond = '{}'
);

And suppose I have the following table

id (bigint)	username (text)	event (jsonb)
1	foo	{ "it": 1, "key": "bla" }
2	foo	{ "it": 2, "key" : "dah" }
3	bar	{}
4	zar	{}

When I want to aggregate on username column and merge the event column I would use the following query

select username, jsonb_merge_agg(event) from table group by username

And I expect to get the following results, where records with greater id overwrite keys value pairs of earlier records in the aggregate

username	event
foo	{ "it": 2, "key": "dah"}
bar	{}
zar	{}

The problem is that I am sometimes seeing that the foo aggregate contains combination { "it": 1, "key": "bla" } instead. I am aware that Postgres does not have natural row order in the table, so it might be different between transactions. How do I manage the order of the aggregation merge?

Question 2

Be aware that there is a built-in jsonb_concat() that hides your custom function, which goes unused. I added details to my answer.

Question 3

@ErwinBrandstetter that's a good point, but I'm not seeing it being referenced somewhere in the documentation (such as json functions page 9.5). I do see the comment on such function that comment on function jsonb_concat(unknown, unknown) is 'implementation of || operator';

Question 4

There are many undocumented internal funstions. Schema-qualify the function name to make sure that yours gets used. (Or don't, in this particular case, and use the superior built-in function.)

Question 5

In the Aggregate functions page (chapter 9, section 20 in postgresql 12 documentation "Aggregate Functions" (1)) there's a paragraph specifically addressing this

The aggregate functions array_agg, json_agg, jsonb_agg, json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as similar user-defined aggregate functions, produce meaningfully different result values depending on the order of the input values. This ordering is unspecified by default, but can be controlled by writing an ORDER BY clause within the aggregate call, as shown in Section 4.2.7.

And in the section 4.2.7 (2) Value expressions/Aggregate expressions the syntax is explained as following

aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]

as a result, to get consistent merge order the query must be modified to the following

select username, jsonb_merge_agg(event order by id)

1: https://www.postgresql.org/docs/12/functions-aggregate.html

2: https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-AGGREGATES

Question 6

Same approach can be taken in newer versions than psql12.

Question 7

The manual goes on to say:

... Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

In fact, this is typically much faster:

SELECT username, jsonb_merge_agg(event) AS event
FROM (
 SELECT username, event
 FROM tbl
 ORDER BY username, id
 ) sub
GROUP BY username;

fiddle

Because a single sort operation is much cheaper than a separate sort per user.

As long as the outer query level does nothing to reorder rows before the aggregation, this is safe.

If the use case is as simple as your sample suggests - effectively taking the latest value per user - a plain DISTINCT ON query does it:

SELECT DISTINCT ON (username)
 username, COALESCE(event, '{}')
FROM tbl
ORDER BY username, event IS NULL, id;

fiddle

(I presume your real case actually merges values. Then this is not applicable.)

The 2nd ORDER BY term event IS NULL makes sure null values are ignored like in your aggregate function - by sorting them last. See:

Now, why would I claim that your aggregate function ignores null values, when your custom function jsonb_concat(jsonb, jsonb) isn't defined STRICT? Postgres already has a built-in function jsonb_concat(jsonb, jsonb), and that one is STRICT. Your CREATE AGGREGATE uses the unqualified function name, and in a sane setup the system schema pg_catalog takes precedence. So the aggregate actually uses the built-in (STRICT) version of the function. (Incidentally a good thing as that one is much faster!) Your demo is quite a trap. (Maybe trapping you already?)

Question 8

That's correct. I missed a point to specify that the event jsonb column shall never contain null values, and that the merged result shall have more keys than just the latest available value for that particular username record.

Dragas Dragas 1814 bronze badges · Answer 1 · 2024-01-17 09:22:11Z

In the Aggregate functions page (chapter 9, section 20 in postgresql 12 documentation "Aggregate Functions" (1)) there's a paragraph specifically addressing this

The aggregate functions array_agg, json_agg, jsonb_agg, json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as similar user-defined aggregate functions, produce meaningfully different result values depending on the order of the input values. This ordering is unspecified by default, but can be controlled by writing an ORDER BY clause within the aggregate call, as shown in Section 4.2.7.

And in the section 4.2.7 (2) Value expressions/Aggregate expressions the syntax is explained as following

aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]

as a result, to get consistent merge order the query must be modified to the following

select username, jsonb_merge_agg(event order by id)

1: https://www.postgresql.org/docs/12/functions-aggregate.html

2: https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-AGGREGATES

Same approach can be taken in newer versions than psql12.

Dragas
– Dragas

2024年01月17日 09:22:51 +00:00
Commented Jan 17, 2024 at 9:22

score 1 · Answer 2 · 2024-01-17 23:05:26Z

The manual goes on to say:

... Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

In fact, this is typically much faster:

SELECT username, jsonb_merge_agg(event) AS event
FROM (
 SELECT username, event
 FROM tbl
 ORDER BY username, id
 ) sub
GROUP BY username;

fiddle

Because a single sort operation is much cheaper than a separate sort per user.

As long as the outer query level does nothing to reorder rows before the aggregation, this is safe.

If the use case is as simple as your sample suggests - effectively taking the latest value per user - a plain DISTINCT ON query does it:

SELECT DISTINCT ON (username)
 username, COALESCE(event, '{}')
FROM tbl
ORDER BY username, event IS NULL, id;

fiddle

(I presume your real case actually merges values. Then this is not applicable.)

The 2nd ORDER BY term event IS NULL makes sure null values are ignored like in your aggregate function - by sorting them last. See:

Now, why would I claim that your aggregate function ignores null values, when your custom function jsonb_concat(jsonb, jsonb) isn't defined STRICT? Postgres already has a built-in function jsonb_concat(jsonb, jsonb), and that one is STRICT. Your CREATE AGGREGATE uses the unqualified function name, and in a sane setup the system schema pg_catalog takes precedence. So the aggregate actually uses the built-in (STRICT) version of the function. (Incidentally a good thing as that one is much faster!) Your demo is quite a trap. (Maybe trapping you already?)

That's correct. I missed a point to specify that the event jsonb column shall never contain null values, and that the merged result shall have more keys than just the latest available value for that particular username record.

Stack Exchange Network

Managing the order of aggregation in Postgres

2 Answers 2

Aside

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Managing the order of aggregation in Postgres

2 Answers 2

Aside

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions