0

I have two tables with text column which I want to join: larger has billion rows and smaller has 100M rows. Tables don't fit into memory, so PostgreSQL reasonably uses merge join for joining.

The problem happens on the sorting stage: PostgreSQL does single threaded sorting, which takes forever.

Is there any way to solve this? I imagine parallel multi-worker sort would scale it. Or maybe there are any other possible solutions? I think this should be very common scenario.

Update: I found that issue is reproducible only for temporary tables, which is a known feature per: https://stackoverflow.com/questions/69533864/why-are-scans-of-ctes-and-temporary-tablest-parallel-restricted Parallel scans are not allowed on temporary tables.

asked Mar 25, 2022 at 20:50
2
  • 1
    Please provide proper information for performance questions, as instructed here: dba.meta.stackexchange.com/a/3299/3684 Commented Mar 25, 2022 at 22:57
  • 1
    Also your version of Postgres. Commented Mar 26, 2022 at 0:05

1 Answer 1

-1

Having parallel sort in PostgreSQL would mean to exchange lots of rows between parallel worker processes, so it is questionable whether that would be a win.

If the speed of that query is very important, one possible solution for the problem would be partitioning. You'd have to partition both tables on the column that is used in the join condition (I expect that the join will be on =) and use the same partition boundaries for both tables. Then set the parameter enable_parallel_join to on, and PostgreSQL will perform the join for each partition. Not only can that be parallelized, but since the tables are smaller, you may also end up with a faster hash join.

answered Mar 26, 2022 at 4:04
4
  • Parallel sort could work following way: firstm local sort on bunch of partitions is performed, then single threaded merge sort in linear time is performed. No rows exchange between workers is needed. Commented Mar 26, 2022 at 14:47
  • There is no multithreading in PostgreSQL, only multiprocessing, so data have to be exchanged between processes via IPC (shared memory). And no, that won't change in the near future. Commented Mar 26, 2022 at 15:57
  • @RikuIki how do you merge without getting the data that needs merging from the workers? Commented Mar 26, 2022 at 16:49
  • through materialization to the disk, since data doesn't fit into the memory anyway. Commented Mar 26, 2022 at 19:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.