I am trying to speed up the runtime of my program as the API can make calls. However after adding the inserts to a database 3 mins turns in 7 mins after 180 API calls, which can be one a second. How can I further improve the inserts into the DB?
I am using mysql database, but potentially could change if it will increase performance
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup=BeautifulSoup(resp.text)
#EXTRACT FROM XML
quantity = ['' if product.amount is None else product.amount.text for product in responseSoup.findAll("offersummary")]
price = ['' if product.lowestnewprice is None else product.lowestnewprice.formattedprice.text for product in responseSoup.findAll("offersummary")]
prime = ['' if product.iseligibleforprime is None else product.iseligibleforprime.text for product in responseSoup("offer")]
#UPDATE DB
for zz in zip(asins.split(","), price,quantity,prime):
cursor = conn.cursor()
if zz[1] == "Too low to display":
print "Scraping..."
zz = scrapeTooLow(zz[0])
dat = [zz[0], zz[1], zz[2], zz[3]]
if zz[1] != "":
priceFormat = float(zz[1].replace("$", "").replace(",", ""))
mep = round(priceFormat*1.35, 2)
dp = round(priceFormat*1.38, 2)
fp = round(priceFormat*1.17, 2)
else:
priceFormat = ""
mep = ""
dp = ""
fp = ""
dat.extend([mep,dp,fp])
print dat
try:
cursor.execute("""INSERT INTO data VALUES (%s,%s, %s, %s, %s, %s, %s)""",dat)
conn.commit()
except:
conn.rollback()
print count, i
1 Answer 1
For performance questions you should really take a profiler to it and make sure that the Python code isn't accidentally the culprit first.
The numbers you quote for this sound way too high, though of course we can't see the actual data.
Now for the code.
Even though this is running Python 2 it should still be made compatible with Python 3, in particular:
- Use
print
as a function, that is,print(...)
, for consistency. - Avoid
zip
for big lists, instead useizip
fromitertools
. - Use consistent syntax, also read PEP8 regarding naming, because Python names should be consistenly lower case with underscores - since your names are at least consistent that is less of an issue.
- A program should have a
if __name__ == "__main__": ...
block, assuming this not part of a bigger script where such a block is already in place.`
Apart from that more generally:
- Use more descriptive names.
zz
,dp
,fp
,mep
,dat
tell the reader absolutely nothing about the meaning. - Random float literals should be named, or create a separate
- Use functions to reuse functionality and to decompose the whole program into clear steps.
- Related things should be place together, e.g. the
cursor
call can be moved to the end where the database functionality is located instead of at the start of the loop. Btw. are you sure that you acquire a new transaction automatically? It would help to know which particular library is being used.
Also:
- The
replace
call can be done in one step,",ドル".replace("", "")
. The XML extraction looks pretty verbose. Unfortunately I'm not so sure about the types, though I'd imagine this could be written a bit more clearly with at least a helper function for two of the steps,
quantity
andprime
:def empty_or_text(thing): return '' if thing is None else thing.text
It also makes sense to cache things instead of recomputing it all over again, in this case that applies to the
findAll
calls for"offersummary"
.dat
is simplyzz[0:4]
.- Since strings are immutable the assignment of the empty string can be
done in one line, e.g.
a = b = ""
.
And finally as has been said, it might make sense to group inserts into larger batches unless you require the transactional behaviour for some reason, or you can't simply rerun a batch.
Also take a look at this Stackoverflow post to deal with the insert statement a bit better.
That's it. I'm afraid without a bit more information about the data structures it's harder to infer how it should be structured instead.
At this I point I have the following for the modified code:
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)
def empty_or_text(thing):
return '' if thing is None else thing.text
# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")
quantity = [empty_or_text(product.amount)
for product in offerSummaries]
prime = [empty_or_text(product.iseligibleforprime)
for product in responseSoup("offer")]
price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
for product in offerSummaries]
from itertools import izip
# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
if zz[1] == "Too low to display":
print("Scraping...")
zz = scrapeTooLow(zz[0])
priceFormat = mep = dp = fp = ""
if zz[1] != "":
priceFormat = float(zz[1].replace(",ドル", ""))
mep = round(priceFormat * 1.35, 2)
dp = round(priceFormat * 1.38, 2)
fp = round(priceFormat * 1.17, 2)
dat = zz[0:4] + [mep, dp, fp]
print(dat)
try:
conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
conn.commit()
except:
conn.rollback()
print(count, i)
Explore related questions
See similar questions with these tags.
params
,url
,asins
,getSignedUrl
,scrapeTooLow
) and relevant imports? \$\endgroup\$INSERT INTO table VALUES (row1val, row1val),(row2val, row2val), etc
but 1st you need to actually find out where the bottleneck is \$\endgroup\$