I have a large (5-10 GB) binary file on AWS S3 that will require custom parsing, probably in python. It is essentially a sequential set of millions of dataframes, all having the same structure. What is the best way for me to get this data into a severless/hosted AWS Aurora PostgreSQL instance? So far I have thought of: 1. I could write to a CSV file and use COPY, but the size would be astronomical 2. I could send it over the wire in batches of rows 3. use AWS Glue, though I'm still learning about that.
2 Answers 2
I could write to a CSV file and use COPY, but the size would be astronomical
You could write the CSV data stream to a pipe rather than a file:
geneate_csv | psql -c '\copy tablename from stdin'
or
\copy tablename from program 'generate_csv'
-
Would this work over the wire to an AWS serverless/hosted instance? I don't own the hosting machine; just edited the question to reflect that.dnb– dnb2019年09月03日 14:02:01 +00:00Commented Sep 3, 2019 at 14:02
-
It should work over the wire (I haven't used Aurora, but I doubt they gutted the COPY protocol). Of course "psql" needs be given connection parameters that lets it connect nonlocally (host, port, etc.) And of course you do need some server which can execute your 'generate_csv' program and 'psql', but it could be a temporary one just stood up for that purpose.jjanes– jjanes2019年09月03日 14:19:23 +00:00Commented Sep 3, 2019 at 14:19
-
Think I'll end up doing something like this. Interesting performance breakdown here: hakibenita.com/fast-load-data-python-postgresql.dnb– dnb2019年09月05日 03:39:41 +00:00Commented Sep 5, 2019 at 3:39
Not something I would recommend as a general solution, but I wrote a similar thing that would convert data on the fly and write them out using the wire format (e.g. the same format that COPY
uses). It was in Java and used the internal PGWriter
class, so you'd need to find a way to do the same thing in Python.
It's incredibly fast though, over a magnitude faster than inserting with batches. Although I'm not sure whether rewritebatchedinserts would have made normal batch insertion fast enough.