Upload csv file to sqlserver with python

Question 1

I have to weekly upload a csv file to sqlserver, and I do the job using python 3. The problem is that it takes too long for the file to be uploaded (around 30 minutes), and the table has 49000 rows and 80 columns.

Here is a piece of the code, where I have to transform the date format and replace quotes as well. I have already tried it with pandas, but took longer than that.

import csv
import os
import pyodbc
import time
srv='server_name'
db='database'
tb='table'
conn=pyodbc.connect('Trusted_Connection=yes',DRIVER='{SQL Server}',SERVER=srv,DATABASE=db)
c=conn.cursor()
csvfile='file.csv'
with open(csvfile,'r') as csvfile:
reader = csv.reader(csvfile, delimiter=';')
 cnt=0
 for row in reader:
 if cnt>0:
 for r in range(0,len(row)):
 #this is the part where I transform the date format from dd/mm/yyyy to yyyy-mm-dd
 if (len(row[r])==10 or len(row[r])==19) and row[r][2]=='/' and row[r][5]=='/':
 row[r]=row[r][6:10]+'-'+row[r][3:5]+'-'+row[r][0:2]
 #here I replace the quote to nothing, since it is not important for the report
 if row[r].find("'")>0:
 row[r]=row[r].replace("'","")
 #at this part I query the index to increment by 1 on the table
 qcnt="select count(1) from "+tb
 resq=c.execute(qcnt)
 rq=c.fetchone()
 rq=str(rq[0])
 #here I insert each row into the table that already exists
 insrt=("insert into "+tb+" values("+rq+",'"+("', '".join(row))+"')")
 if cnt>0:
 res=c.execute(insrt)
 conn.commit()
 cnt+=1
conn.close()

Any help will be appreciated. Thanks!

Question 2

What is reader?

Question 3

sorry, copied from my code but forgot to insert it here. It comes at this part: 'with open(csvfile,'r') as csvfile: reader = csv.reader(csvfile, delimiter=';')'. Just edited it now.

Question 4

First of all, when in doubt, profile.

Now a not-so-wild guess. Most of the time is wasted in

 qcnt="select count(1) from "+tb
 resq=c.execute(qcnt)
 rq=c.fetchone()
 rq=str(rq[0])

In fact, the rq is incremented by each successful insert. Better fetch it once, and increment it locally:

 qcnt="select count(1) from "+tb
 resq=c.execute(qcnt)
 rq=c.fetchone()
 for row in csvfile:
 ....
 insert = ....
 c.execute(insert)
 rq += 1
 ....

Another guess is that committing each insert separately also does not help with performance. Do it once, after the loop. In any case, you must check the success of each commit.

Notice that if there is more than one client updating the table simultaneously, there is a data race (clients fetching the same rq), both with the original design, and with my suggestion. Moving rq into a column of its own may help; I don't know your DB design and requirements.

Consider a single insert values, wrapped into a transaction, instead of multiple independent inserts.

Testing for cnt > 0 is also wasteful. Read and discard the first line; then loop for the remaining rows.

Figuring out whether a field represents a date seems strange. You shall know that in advance.

Question 5

Thanks man, made two modifications, and elapsed time was reduced by half the time (mainly on the increment part). Awesome!!!

vnp vnp 58.5k4 gold badges55 silver badges144 bronze badges · Accepted Answer · 2019-02-07 22:16:43Z

First of all, when in doubt, profile.

Now a not-so-wild guess. Most of the time is wasted in

 qcnt="select count(1) from "+tb
 resq=c.execute(qcnt)
 rq=c.fetchone()
 rq=str(rq[0])

In fact, the rq is incremented by each successful insert. Better fetch it once, and increment it locally:

 qcnt="select count(1) from "+tb
 resq=c.execute(qcnt)
 rq=c.fetchone()
 for row in csvfile:
 ....
 insert = ....
 c.execute(insert)
 rq += 1
 ....

Another guess is that committing each insert separately also does not help with performance. Do it once, after the loop. In any case, you must check the success of each commit.

Notice that if there is more than one client updating the table simultaneously, there is a data race (clients fetching the same rq), both with the original design, and with my suggestion. Moving rq into a column of its own may help; I don't know your DB design and requirements.

Consider a single insert values, wrapped into a transaction, instead of multiple independent inserts.

Testing for cnt > 0 is also wasteful. Read and discard the first line; then loop for the remaining rows.

Figuring out whether a field represents a date seems strange. You shall know that in advance.

Thanks man, made two modifications, and elapsed time was reduced by half the time (mainly on the increment part). Awesome!!!

Stack Exchange Network

Upload csv file to sqlserver with python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Upload csv file to sqlserver with python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions