Parallel merge join with sorting on large temporary tables in PostgreSQL

Question 1

I have two tables with text column which I want to join: larger has billion rows and smaller has 100M rows. Tables don't fit into memory, so PostgreSQL reasonably uses merge join for joining.

The problem happens on the sorting stage: PostgreSQL does single threaded sorting, which takes forever.

Is there any way to solve this? I imagine parallel multi-worker sort would scale it. Or maybe there are any other possible solutions? I think this should be very common scenario.

Update: I found that issue is reproducible only for temporary tables, which is a known feature per: https://stackoverflow.com/questions/69533864/why-are-scans-of-ctes-and-temporary-tablest-parallel-restricted Parallel scans are not allowed on temporary tables.

Question 2

Please provide proper information for performance questions, as instructed here: dba.meta.stackexchange.com/a/3299/3684

Question 3

Also your version of Postgres.

Question 4

Having parallel sort in PostgreSQL would mean to exchange lots of rows between parallel worker processes, so it is questionable whether that would be a win.

If the speed of that query is very important, one possible solution for the problem would be partitioning. You'd have to partition both tables on the column that is used in the join condition (I expect that the join will be on =) and use the same partition boundaries for both tables. Then set the parameter enable_parallel_join to on, and PostgreSQL will perform the join for each partition. Not only can that be parallelized, but since the tables are smaller, you may also end up with a faster hash join.

Question 5

Parallel sort could work following way: firstm local sort on bunch of partitions is performed, then single threaded merge sort in linear time is performed. No rows exchange between workers is needed.

Question 6

There is no multithreading in PostgreSQL, only multiprocessing, so data have to be exchanged between processes via IPC (shared memory). And no, that won't change in the near future.

Question 7

@RikuIki how do you merge without getting the data that needs merging from the workers?

Question 8

through materialization to the disk, since data doesn't fit into the memory anyway.

Laurenz Albe Laurenz Albe 61.9k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2022-03-26 04:04:41Z

Having parallel sort in PostgreSQL would mean to exchange lots of rows between parallel worker processes, so it is questionable whether that would be a win.

If the speed of that query is very important, one possible solution for the problem would be partitioning. You'd have to partition both tables on the column that is used in the join condition (I expect that the join will be on =) and use the same partition boundaries for both tables. Then set the parameter enable_parallel_join to on, and PostgreSQL will perform the join for each partition. Not only can that be parallelized, but since the tables are smaller, you may also end up with a faster hash join.

Parallel sort could work following way: firstm local sort on bunch of partitions is performed, then single threaded merge sort in linear time is performed. No rows exchange between workers is needed.
There is no multithreading in PostgreSQL, only multiprocessing, so data have to be exchanged between processes via IPC (shared memory). And no, that won't change in the near future.
@RikuIki how do you merge without getting the data that needs merging from the workers?
through materialization to the disk, since data doesn't fit into the memory anyway.

Stack Exchange Network

Parallel merge join with sorting on large temporary tables in PostgreSQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parallel merge join with sorting on large temporary tables in PostgreSQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions