I used to update a very large table using UPDATE
queries, but they were taking too long to execute. To improve performance, I switched to using the CREATE TABLE
approach and adding indexes to update the table. This approach has significantly increased my query execution speed, but I want to understand its scalability and limitations.
Server Specifications:
- PostgreSQL version: 15.6
- RAM: 32 GB
- Cores: 16
- Disk Space: SSD 250 GB (50% free)
- OS: Linux Ubuntu 22.04
PostgreSQL Configuration:
max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
Table Details:
Table Name | Row Count | Size |
---|---|---|
source_switchdata_tmp_details | 60 Million | 30 GB |
source_npcidata_tmp_details | 60 Million | 30 GB |
source_aepscbsdata_tmp_details | 60 Million | 30 GB |
Query:
BEGIN;
ALTER TABLE source_switchdata_tmp_details RENAME TO source_switchdata_tmp_details_og;
CREATE TABLE source_switchdata_tmp_details AS
SELECT DISTINCT ON (A.uniqueid) A.transactiondate,
A.cycles,
A.transactionamount,
A.bcid,
A.bcname,
A.username,
A.terminalid,
A.uidauthcode,
A.itc,
A.transactiondetails,
A.deststan,
A.sourcestan,
A.hostresponsecode,
A.institutionid,
A.acquirer,
A.bcrefid,
A.cardno,
A.rrn,
A.transactiontype,
A.filename,
A.cardnotrim,
A.uniqueid,
A.transactiondatetime,
A.transactionstatus,
A.overall_probable_status,
A.recon_created_date,
A.priority_no,
A.recon_key_priority_1_1_to_2,
A.recon_key_priority_1_1_to_3,
A.recon_key_priority_2_1_to_2,
A.recon_key_priority_2_1_to_3,
A.process_status,
A.reconciliation_date_time,
CURRENT_TIMESTAMP AS recon_updated_date,
CASE
WHEN C.recon_key_priority_1_2_to_1 IS NOT NULL THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS recon_status_1_to_2,
CASE
WHEN D.recon_key_priority_1_3_to_1 IS NOT NULL THEN 'Reconciled'
WHEN D.recon_key_priority_2_3_to_1 IS NOT NULL THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS recon_status_1_to_3,
CASE
WHEN (C.recon_key_priority_1_2_to_1 IS NOT NULL AND D.recon_key_priority_1_3_to_1 IS NOT NULL) THEN 'Reconciled'
WHEN (D.recon_key_priority_2_3_to_1 IS NOT NULL) THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS overall_recon_status
FROM source_switchdata_tmp_details_og A
LEFT JOIN source_aepscbsdata_tmp_details C ON (A.recon_key_priority_1_1_to_2 = C.recon_key_priority_1_2_to_1)
LEFT JOIN source_npcidata_tmp_details D
ON (A.recon_key_priority_1_1_to_3 = D.recon_key_priority_1_3_to_1)
OR (A.recon_key_priority_2_1_to_3 = D.recon_key_priority_2_3_to_1);
DROP TABLE source_switchdata_tmp_details_og;
COMMIT;
Unique Constraints and Indexes:
A.uniqueid = Primary key and Index
A.recon_key_priority_1_1_to_3 = Index
A.recon_key_priority_1_1_to_2 = Index
D.recon_key_priority_1_3_to_1 = Index
A.recon_key_priority_2_1_to_3 = Index
D.recon_key_priority_2_3_to_1 = Index
Questions:
- Currently, I am running the above query for 180 million rows (60M + 60M + 60M). In the future, I may need to run this query for 1 billion rows. Will this approach be scalable for 1 billion rows? We can increase the server specifications if needed, but will this approach be feasible? Essentially, if I were to recreate the table for 300 million rows or even 1 billion rows, will it be practical?
- My team suggests updating the data in chunks of 1 million rows. Is this approach better than the current one?
- The query currently takes around 20 minutes, which is acceptable. If the data size increases, what bottlenecks, such as I/O bottlenecks, should I be aware of to ensure the query time scales proportionally without getting stuck?
- What are the limitations of the current approach? And what can I do to avoid such limitations?
Any insights or optimizations would be greatly appreciated. Thank you!
1 Answer 1
Your statement will become slower if the tables get bigger, but I guess that's what you expect. But the slowdown won't be linear; I expect it to grow with the square of the number of rows, because of the OR
in the join condition with source_npcidata_tmp_details
. That OR
forces PostgreSQL to perform a nested loop join, which will become very slow with big tables. Keep your join conditions to simple =
if you want your queries to scale.
Another potential problem is the DISTINCT ON
, which requires a sort that has computational complexity of O(n*log(n)), so the execution time will increase more than linearly. Consider carefully if your data allow duplicate uniqueid
s in the query result, and only use DISTINCT
if you really have to.
The bottleneck here is CPU speed, and you won't be able to scale that.
Updating the table instead of creating a new copy is a good idea if most rows will remain unchanged. In that case, you should add WHERE
conditions so that the rows are only modified if the values change.
Explore related questions
See similar questions with these tags.