Inserting products scraped from the web into a database

Question 1

I am trying to speed up the runtime of my program as the API can make calls. However after adding the inserts to a database 3 mins turns in 7 mins after 180 API calls, which can be one a second. How can I further improve the inserts into the DB?

I am using mysql database, but potentially could change if it will increase performance

url = getSignedUrl(params)
resp = requests.get(url)
responseSoup=BeautifulSoup(resp.text)
#EXTRACT FROM XML
quantity = ['' if product.amount is None else product.amount.text for product in responseSoup.findAll("offersummary")]
price = ['' if product.lowestnewprice is None else product.lowestnewprice.formattedprice.text for product in responseSoup.findAll("offersummary")]
prime = ['' if product.iseligibleforprime is None else product.iseligibleforprime.text for product in responseSoup("offer")]
#UPDATE DB
for zz in zip(asins.split(","), price,quantity,prime):
 cursor = conn.cursor()
 if zz[1] == "Too low to display":
 print "Scraping..."
 zz = scrapeTooLow(zz[0])
 dat = [zz[0], zz[1], zz[2], zz[3]]
 if zz[1] != "":
 priceFormat = float(zz[1].replace("$", "").replace(",", ""))
 mep = round(priceFormat*1.35, 2)
 dp = round(priceFormat*1.38, 2)
 fp = round(priceFormat*1.17, 2)
 else:
 priceFormat = ""
 mep = ""
 dp = ""
 fp = ""
 dat.extend([mep,dp,fp])
 print dat 
 try:
 cursor.execute("""INSERT INTO data VALUES (%s,%s, %s, %s, %s, %s, %s)""",dat) 
 conn.commit()
 except:
 conn.rollback()
print count, i

Question 2

Your code is incomplete and can not run. Can you provide the missing definitions (params, url, asins, getSignedUrl, scrapeTooLow) and relevant imports?

Question 3

I would be very supprised if a sinle insert takes more than a second - is the database on the same server, or at least in the same local network? One way to increase performance would be to aggregate all the results and add them with one insert INSERT INTO table VALUES (row1val, row1val),(row2val, row2val), etc but 1st you need to actually find out where the bottleneck is

Question 4

You could write the INSERTs to a file, put a BEGIN/COMMIT block around them just to see if that helps?

Question 5

For performance questions you should really take a profiler to it and make sure that the Python code isn't accidentally the culprit first.

The numbers you quote for this sound way too high, though of course we can't see the actual data.

Now for the code.

Even though this is running Python 2 it should still be made compatible with Python 3, in particular:

Use print as a function, that is, print(...), for consistency.
Avoid zip for big lists, instead use izip from itertools.
Use consistent syntax, also read PEP8 regarding naming, because Python names should be consistenly lower case with underscores - since your names are at least consistent that is less of an issue.
A program should have a if __name__ == "__main__": ... block, assuming this not part of a bigger script where such a block is already in place.`

Apart from that more generally:

Use more descriptive names. zz, dp, fp, mep, dat tell the reader absolutely nothing about the meaning.
Random float literals should be named, or create a separate
Use functions to reuse functionality and to decompose the whole program into clear steps.
Related things should be place together, e.g. the cursor call can be moved to the end where the database functionality is located instead of at the start of the loop. Btw. are you sure that you acquire a new transaction automatically? It would help to know which particular library is being used.

Also:

The replace call can be done in one step, ",ドル".replace("", "").
The XML extraction looks pretty verbose. Unfortunately I'm not so sure about the types, though I'd imagine this could be written a bit more clearly with at least a helper function for two of the steps, quantity and prime:
```
def empty_or_text(thing):
 return '' if thing is None else thing.text
```
It also makes sense to cache things instead of recomputing it all over again, in this case that applies to the findAll calls for "offersummary".
dat is simply zz[0:4].
Since strings are immutable the assignment of the empty string can be done in one line, e.g. a = b = "".

And finally as has been said, it might make sense to group inserts into larger batches unless you require the transactional behaviour for some reason, or you can't simply rerun a batch.

Also take a look at this Stackoverflow post to deal with the insert statement a bit better.

That's it. I'm afraid without a bit more information about the data structures it's harder to infer how it should be structured instead.

At this I point I have the following for the modified code:

url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)
def empty_or_text(thing):
 return '' if thing is None else thing.text
# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")
quantity = [empty_or_text(product.amount)
 for product in offerSummaries]
prime = [empty_or_text(product.iseligibleforprime)
 for product in responseSoup("offer")]
price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
 for product in offerSummaries]
from itertools import izip
# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
 if zz[1] == "Too low to display":
 print("Scraping...")
 zz = scrapeTooLow(zz[0])
 priceFormat = mep = dp = fp = ""
 if zz[1] != "":
 priceFormat = float(zz[1].replace(",ドル", ""))
 mep = round(priceFormat * 1.35, 2)
 dp = round(priceFormat * 1.38, 2)
 fp = round(priceFormat * 1.17, 2)
 dat = zz[0:4] + [mep, dp, fp]
 print(dat)
 try:
 conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
 conn.commit()
 except:
 conn.rollback()
print(count, i)

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2016-02-04 21:50:56Z

For performance questions you should really take a profiler to it and make sure that the Python code isn't accidentally the culprit first.

The numbers you quote for this sound way too high, though of course we can't see the actual data.

Now for the code.

Even though this is running Python 2 it should still be made compatible with Python 3, in particular:

Use print as a function, that is, print(...), for consistency.
Avoid zip for big lists, instead use izip from itertools.
Use consistent syntax, also read PEP8 regarding naming, because Python names should be consistenly lower case with underscores - since your names are at least consistent that is less of an issue.
A program should have a if __name__ == "__main__": ... block, assuming this not part of a bigger script where such a block is already in place.`

Apart from that more generally:

Use more descriptive names. zz, dp, fp, mep, dat tell the reader absolutely nothing about the meaning.
Random float literals should be named, or create a separate
Use functions to reuse functionality and to decompose the whole program into clear steps.
Related things should be place together, e.g. the cursor call can be moved to the end where the database functionality is located instead of at the start of the loop. Btw. are you sure that you acquire a new transaction automatically? It would help to know which particular library is being used.

Also:

The replace call can be done in one step, ",ドル".replace("", "").
The XML extraction looks pretty verbose. Unfortunately I'm not so sure about the types, though I'd imagine this could be written a bit more clearly with at least a helper function for two of the steps, quantity and prime:
```
def empty_or_text(thing):
 return '' if thing is None else thing.text
```
It also makes sense to cache things instead of recomputing it all over again, in this case that applies to the findAll calls for "offersummary".
dat is simply zz[0:4].
Since strings are immutable the assignment of the empty string can be done in one line, e.g. a = b = "".

And finally as has been said, it might make sense to group inserts into larger batches unless you require the transactional behaviour for some reason, or you can't simply rerun a batch.

Also take a look at this Stackoverflow post to deal with the insert statement a bit better.

That's it. I'm afraid without a bit more information about the data structures it's harder to infer how it should be structured instead.

At this I point I have the following for the modified code:

url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)
def empty_or_text(thing):
 return '' if thing is None else thing.text
# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")
quantity = [empty_or_text(product.amount)
 for product in offerSummaries]
prime = [empty_or_text(product.iseligibleforprime)
 for product in responseSoup("offer")]
price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
 for product in offerSummaries]
from itertools import izip
# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
 if zz[1] == "Too low to display":
 print("Scraping...")
 zz = scrapeTooLow(zz[0])
 priceFormat = mep = dp = fp = ""
 if zz[1] != "":
 priceFormat = float(zz[1].replace(",ドル", ""))
 mep = round(priceFormat * 1.35, 2)
 dp = round(priceFormat * 1.38, 2)
 fp = round(priceFormat * 1.17, 2)
 dat = zz[0:4] + [mep, dp, fp]
 print(dat)
 try:
 conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
 conn.commit()
 except:
 conn.rollback()
print(count, i)

Stack Exchange Network

Inserting products scraped from the web into a database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Inserting products scraped from the web into a database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions