Bulk Updates on Normalised data present across several PostgreSQL tables

Question 1

We have several websites from where we download csv files - say 100 such websites. We don't have a control over their schedules, so we download files from them every few hours. Each such file can be assumed to have 20GB worth of data where 1 row should be on an avg. 4KB. So around 5 million rows per file.

The first time this activity was performed, we read, transformed and normalised the data from all these files across 10 tables in a PostgreSQL Database. Now, every time we get a new file from each of these websites, we want to update our database. This would involve adding any new rows available, updating existing rows if corresponding columns changed or marking the row in the tables with a flag if the corresponding row was suddenly absent from the source(s).

Is the most optimal way to do this is to use some diffing tool (maybe a UNIX diff) to find out what got added, what got removed and what got changed, and then only for those create a bulk INSERT/UPDATE queries and fire them into the Database? Is there a recommended file format or a tool for this or some sort of out of box solution if say the files are downloaded in AWS S3?

OR are there faster ways like creating temporary normalised tables from new versions of the files and then doing some sort of compare and update into the main customer facing tables?

OR just create temporary but not-normalised table and do the diffing there with the new version of the data and then transform and update the main normalised tables?

I'm new to databases and had also read that we can write sever side function which Databases can execute and do a few things for us - ie., bake transform + normalise logic into such server-side functions and create a trigger to execute them etc.

What is the recommendation?

Question 2

INSERT ... ON CONFLICT ... DO UPDATE

Question 3

@LaurenzAlbe So is that the most optimal? For eg. if i could diff the csvs beforehand and collect only those rows that need to be changed and create SQL queries for those, will that be less taxing to the tables (which are live and serving the customers' (GET) queries)?

Question 4

Nobody can predict performance, you have to measure it. This is unanswerable - for example, you cam only diff if you have the previous data, and they are sorted in the same fashion. I just hinted at the canonical way to do this.

Question 5

@LaurenzAlbe so we would have the previously downloaded file for diffing the new one against. There's no guarantee on the sort order, so we'll have to do that before diffing. Statistically the number of unchanged rows would be ~90% with the remaining 10% being addition of new records, deletion of some old ones and updates in some old ones. Which is why I was thinking running an INSERT..ON CONFLICT..DO UPDATE might be wasteful since that would be done for every record in the new file. So wanted to know if there was a precedence for such cases assuming someone had to do something similar.

Question 6

Does each row have a stable primary key?

Question 7

It would probably be best to do as much as possible outside the database, as databases add a lot of overhead. But Unix diff might not be suitable, as it wants all the data in RAM, at least if there is more than a handful of differences. I find sort and uniq to be a more reliable way to deal with very large text data files than diff. This does depend on every line being a record, i.e. no embedded newlines.

For example, to get all new lines not identical to an old line:

sort new old old | uniq -u

Since you included old twice, every old line will be present at least twice and so not survive the -u. And every new line identical to an old line will be present 3 times as a group, so also not surviving -u. Leaving only the truly new new lines.

But if you do this on the entire data file, the results mix together new lines and changed lines. But if you isolated just the primary key columns, then what it would give you is all the new primary keys. If you use `INSERT...ON CONFLICT...DO UPDATE", then you don't need to distinguish those two situations as the database will resolve those two, without having to bother with all the provably identical data.

To get the data that needs to be marked as deleted:

extract_pk new new old | sort | uniq -u

Would give you the list the primary keys from the old data no longer in the new data, which would then have to be flagged as deleted.

Question 8

Thanks! I did actually do this and meant to update the results here soon, but you right! diff/comm/sort etc finish crunching files in around 40secs when methods with SQL JOIN takes around 23 mins 30 secs and using EXCEPT did not return even after 35 mins so i killed it. For diff there is actually a ``--speed-large-files` option for GNU diff - not sure if it's meant to not try and load everything in RAM though.

Question 9

For eg. for EXCEPT i created table t0 using COPY from flatfile version-1 and then table t1 using version-2 which has a few rows different. Then select * from t0 except select * from t1;. For joins select * from t0 full outer join t1 on row(t0) = row(t1) where t0.primary_key_col is null or t1.primary_key_col is null; These were orders of magnitude slower on two flatfile versions, each of size 10G.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2022-02-04 17:02:46Z

It would probably be best to do as much as possible outside the database, as databases add a lot of overhead. But Unix diff might not be suitable, as it wants all the data in RAM, at least if there is more than a handful of differences. I find sort and uniq to be a more reliable way to deal with very large text data files than diff. This does depend on every line being a record, i.e. no embedded newlines.

For example, to get all new lines not identical to an old line:

sort new old old | uniq -u

Since you included old twice, every old line will be present at least twice and so not survive the -u. And every new line identical to an old line will be present 3 times as a group, so also not surviving -u. Leaving only the truly new new lines.

But if you do this on the entire data file, the results mix together new lines and changed lines. But if you isolated just the primary key columns, then what it would give you is all the new primary keys. If you use `INSERT...ON CONFLICT...DO UPDATE", then you don't need to distinguish those two situations as the database will resolve those two, without having to bother with all the provably identical data.

To get the data that needs to be marked as deleted:

extract_pk new new old | sort | uniq -u

Would give you the list the primary keys from the old data no longer in the new data, which would then have to be flagged as deleted.

Thanks! I did actually do this and meant to update the results here soon, but you right! diff/comm/sort etc finish crunching files in around 40secs when methods with SQL JOIN takes around 23 mins 30 secs and using EXCEPT did not return even after 35 mins so i killed it. For diff there is actually a ``--speed-large-files` option for GNU diff - not sure if it's meant to not try and load everything in RAM though.
For eg. for EXCEPT i created table t0 using COPY from flatfile version-1 and then table t1 using version-2 which has a few rows different. Then select * from t0 except select * from t1;. For joins select * from t0 full outer join t1 on row(t0) = row(t1) where t0.primary_key_col is null or t1.primary_key_col is null; These were orders of magnitude slower on two flatfile versions, each of size 10G.

Stack Exchange Network

Bulk Updates on Normalised data present across several PostgreSQL tables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bulk Updates on Normalised data present across several PostgreSQL tables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions