I have to weekly upload a csv file to sqlserver, and I do the job using python 3. The problem is that it takes too long for the file to be uploaded (around 30 minutes), and the table has 49000 rows and 80 columns.
Here is a piece of the code, where I have to transform the date format and replace quotes as well. I have already tried it with pandas, but took longer than that.
import csv
import os
import pyodbc
import time
srv='server_name'
db='database'
tb='table'
conn=pyodbc.connect('Trusted_Connection=yes',DRIVER='{SQL Server}',SERVER=srv,DATABASE=db)
c=conn.cursor()
csvfile='file.csv'
with open(csvfile,'r') as csvfile:
reader = csv.reader(csvfile, delimiter=';')
cnt=0
for row in reader:
if cnt>0:
for r in range(0,len(row)):
#this is the part where I transform the date format from dd/mm/yyyy to yyyy-mm-dd
if (len(row[r])==10 or len(row[r])==19) and row[r][2]=='/' and row[r][5]=='/':
row[r]=row[r][6:10]+'-'+row[r][3:5]+'-'+row[r][0:2]
#here I replace the quote to nothing, since it is not important for the report
if row[r].find("'")>0:
row[r]=row[r].replace("'","")
#at this part I query the index to increment by 1 on the table
qcnt="select count(1) from "+tb
resq=c.execute(qcnt)
rq=c.fetchone()
rq=str(rq[0])
#here I insert each row into the table that already exists
insrt=("insert into "+tb+" values("+rq+",'"+("', '".join(row))+"')")
if cnt>0:
res=c.execute(insrt)
conn.commit()
cnt+=1
conn.close()
Any help will be appreciated. Thanks!
1 Answer 1
First of all, when in doubt, profile.
Now a not-so-wild guess. Most of the time is wasted in
qcnt="select count(1) from "+tb
resq=c.execute(qcnt)
rq=c.fetchone()
rq=str(rq[0])
In fact, the rq
is incremented by each successful insert
. Better fetch it once, and increment it locally:
qcnt="select count(1) from "+tb
resq=c.execute(qcnt)
rq=c.fetchone()
for row in csvfile:
....
insert = ....
c.execute(insert)
rq += 1
....
Another guess is that committing each insert separately also does not help with performance. Do it once, after the loop. In any case, you must check the success of each commit.
Notice that if there is more than one client updating the table simultaneously, there is a data race (clients fetching the same rq
), both with the original design, and with my suggestion. Moving rq
into a column of its own may help; I don't know your DB design and requirements.
Consider a single insert values
, wrapped into a transaction, instead of multiple independent insert
s.
Testing for cnt > 0
is also wasteful. Read and discard the first line; then loop for the remaining rows.
Figuring out whether a field represents a date seems strange. You shall know that in advance.
-
\$\begingroup\$ Thanks man, made two modifications, and elapsed time was reduced by half the time (mainly on the increment part). Awesome!!! \$\endgroup\$Rafael Polara– Rafael Polara2019年02月08日 17:17:32 +00:00Commented Feb 8, 2019 at 17:17
reader
? \$\endgroup\$