Get Gzip file from endpoint, extract data and add to database

Question 1

I currently have a script that fires a request to an API endpoint which returns a csv.gzip file - which roughly contains 75,000 rows and 15 columns. I download this files to the webserver disk storage, unzip the file to .csv and then loop through every row and add the data into my database. Finally deleting the files saved to disk. The process currently takes between 5 and 10 minutes to complete.

I'm sure there is areas of improvement but not sure how to implement them. Some of them is:

Save csv data to variable rather than on disk.
Bulk import the data into my database.

I'm sure there are other improvements to be made, so any advice would be appreciated.

response = oauth.get(realm)
content = ET.fromstring(response.content)
coded_string = (content.find('.//pricefile'))
decoded_string = base64.b64decode(coded_string.text)
with open('test.csv.gzip', 'wb') as f:
 f.write(decoded_string)
 with gzip.open('test.csv.gzip', 'rb') as f_in:
 with open('test.csv', 'wb') as f_out:
 shutil.copyfileobj(f_in, f_out)
with open('test.csv') as f:
 reader = csv.reader(f)
 next(reader)
 for row in reader:
 pricing.objects.update_or_create(
 product_id=row[0],
 date=datetime.now(),
 defaults={
 'average_price': Decimal(row[1] or 0),
 'low_price': Decimal(row[2] or 0),
 'high_price': Decimal(row[3] or 0),
 ...
 })
os.remove('test.csv')
os.remove('test.csv.gzip')

Question 2

What is the current working directory for the context of this code?

Question 3

Is there any reason that you go back and forth between the disk so much instead of just staying in memory?

Question 4

What database is associated with pricing? Does it not have direct CSV import?

Question 5

1. It saves, reads, and deletes the files at the root directory of the webserver. 2. I want to add the CSV data to a variable instead of a physical file but cannot seem to get it working that way. 3. the database is Postgresql

Question 6

Are you using psycopg?

Question 7

Do not send anything to disk (especially the root of the server; this should have been /tmp).

Once you have your decoded_string, you could use gzip.decompress producing a string.

Alternatively, if you have too much memory pressure, wrap it in a BytesIO file-like and put it through gzip.GzipFile, reading from it incrementally. This is probably what I would do. Ensure that you do not iterate line by line; instead iterate in blobs on the order of your cache size (try 4 MB and fine-tune from there).

Once you either have a monolithic string blob or an incremental string iterator, pipe it to COPY FROM STDIN. PostgreSQL natively understands CSV and should decode it instead of Python. You may or may not want to drop the results in a temporary table for SQL processing. PsycoPG supports this.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2022-08-20 17:10:43Z

Do not send anything to disk (especially the root of the server; this should have been /tmp).

Once you have your decoded_string, you could use gzip.decompress producing a string.

Alternatively, if you have too much memory pressure, wrap it in a BytesIO file-like and put it through gzip.GzipFile, reading from it incrementally. This is probably what I would do. Ensure that you do not iterate line by line; instead iterate in blobs on the order of your cache size (try 4 MB and fine-tune from there).

Once you either have a monolithic string blob or an incremental string iterator, pipe it to COPY FROM STDIN. PostgreSQL natively understands CSV and should decode it instead of Python. You may or may not want to drop the results in a temporary table for SQL processing. PsycoPG supports this.

Stack Exchange Network

Get Gzip file from endpoint, extract data and add to database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Get Gzip file from endpoint, extract data and add to database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions