We have several websites from where we download csv files - say 100 such websites. We don't have a control over their schedules, so we download files from them every few hours. Each such file can be assumed to have 20GB worth of data where 1 row should be on an avg. 4KB. So around 5 million rows per file.
The first time this activity was performed, we read, transformed and normalised the data from all these files across 10 tables in a PostgreSQL Database. Now, every time we get a new file from each of these websites, we want to update our database. This would involve adding any new rows available, updating existing rows if corresponding columns changed or marking the row in the tables with a flag if the corresponding row was suddenly absent from the source(s).
Is the most optimal way to do this is to use some diffing tool (maybe a UNIX diff
) to find out what got added, what got removed and what got changed, and then only for those create a bulk INSERT/UPDATE
queries and fire them into the Database? Is there a recommended file format or a tool for this or some sort of out of box solution if say the files are downloaded in AWS S3
?
OR are there faster ways like creating temporary normalised tables from new versions of the files and then doing some sort of compare and update into the main customer facing tables?
OR just create temporary but not-normalised table and do the diffing there with the new version of the data and then transform and update the main normalised tables?
I'm new to databases and had also read that we can write sever side function which Databases can execute and do a few things for us - ie., bake transform + normalise logic into such server-side functions and create a trigger to execute them etc.
What is the recommendation?
1 Answer 1
It would probably be best to do as much as possible outside the database, as databases add a lot of overhead. But Unix diff
might not be suitable, as it wants all the data in RAM, at least if there is more than a handful of differences. I find sort
and uniq
to be a more reliable way to deal with very large text data files than diff
. This does depend on every line being a record, i.e. no embedded newlines.
For example, to get all new lines not identical to an old line:
sort new old old | uniq -u
Since you included old
twice, every old line will be present at least twice and so not survive the -u
. And every new line identical to an old line will be present 3 times as a group, so also not surviving -u
. Leaving only the truly new new lines.
But if you do this on the entire data file, the results mix together new lines and changed lines. But if you isolated just the primary key columns, then what it would give you is all the new primary keys. If you use `INSERT...ON CONFLICT...DO UPDATE", then you don't need to distinguish those two situations as the database will resolve those two, without having to bother with all the provably identical data.
To get the data that needs to be marked as deleted:
extract_pk new new old | sort | uniq -u
Would give you the list the primary keys from the old data no longer in the new data, which would then have to be flagged as deleted.
-
Thanks! I did actually do this and meant to update the results here soon, but you right!
diff/comm/sort
etc finish crunching files in around40secs
when methods withSQL
JOIN
takes around 23 mins 30 secs and usingEXCEPT
did not return even after 35 mins so i killed it. Fordiff
there is actually a ``--speed-large-files` option forGNU diff
- not sure if it's meant to not try and load everything in RAM though.KnowSQL– KnowSQL2022年02月05日 05:08:21 +00:00Commented Feb 5, 2022 at 5:08 -
For eg. for
EXCEPT
i created tablet0
usingCOPY
from flatfile version-1 and then tablet1
using version-2 which has a few rows different. Thenselect * from t0 except select * from t1;
. For joinsselect * from t0 full outer join t1 on row(t0) = row(t1) where t0.primary_key_col is null or t1.primary_key_col is null;
These were orders of magnitude slower on two flatfile versions, each of size 10G.KnowSQL– KnowSQL2022年02月05日 05:14:30 +00:00Commented Feb 5, 2022 at 5:14
INSERT ... ON CONFLICT ... DO UPDATE
csv
s beforehand and collect only those rows that need to be changed and createSQL
queries for those, will that be less taxing to the tables (which are live and serving the customers' (GET) queries)?INSERT..ON CONFLICT..DO UPDATE
might be wasteful since that would be done for every record in the new file. So wanted to know if there was a precedence for such cases assuming someone had to do something similar.