I currently have a script that fires a request to an API endpoint which returns a csv.gzip
file - which roughly contains 75,000 rows and 15 columns. I download this files to the webserver disk storage, unzip the file to .csv
and then loop through every row and add the data into my database. Finally deleting the files saved to disk. The process currently takes between 5 and 10 minutes to complete.
I'm sure there is areas of improvement but not sure how to implement them. Some of them is:
- Save csv data to variable rather than on disk.
- Bulk import the data into my database.
I'm sure there are other improvements to be made, so any advice would be appreciated.
response = oauth.get(realm)
content = ET.fromstring(response.content)
coded_string = (content.find('.//pricefile'))
decoded_string = base64.b64decode(coded_string.text)
with open('test.csv.gzip', 'wb') as f:
f.write(decoded_string)
with gzip.open('test.csv.gzip', 'rb') as f_in:
with open('test.csv', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
with open('test.csv') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
pricing.objects.update_or_create(
product_id=row[0],
date=datetime.now(),
defaults={
'average_price': Decimal(row[1] or 0),
'low_price': Decimal(row[2] or 0),
'high_price': Decimal(row[3] or 0),
...
})
os.remove('test.csv')
os.remove('test.csv.gzip')
1 Answer 1
Do not send anything to disk (especially the root of the server; this should have been /tmp
).
Once you have your decoded_string
, you could use gzip.decompress
producing a string.
Alternatively, if you have too much memory pressure, wrap it in a BytesIO file-like and put it through gzip.GzipFile
, reading from it incrementally. This is probably what I would do. Ensure that you do not iterate line by line; instead iterate in blobs on the order of your cache size (try 4 MB and fine-tune from there).
Once you either have a monolithic string blob or an incremental string iterator, pipe it to COPY FROM STDIN. PostgreSQL natively understands CSV and should decode it instead of Python. You may or may not want to drop the results in a temporary table for SQL processing. PsycoPG supports this.
Explore related questions
See similar questions with these tags.
pricing
? Does it not have direct CSV import? \$\endgroup\$Postgresql
\$\endgroup\$