Web scraping and speeding up performance

Question 1

I was wondering if the following code could be altered to speed up the scraping process. I know xrp has many transactions, but it takes an hour to go through half a day of transactions (at the busiest times, 2016-2018). My thinking is that the try, except statements are slowing it down but I am not sure. I would prefer to go through at least a week every hour.

The if and else statements perform important filtering; that behavior needs to be preserved.

def get_data(last_date): #Date has to be inputted like so: "2018-08-03"
 import requests
 import json
 import os
 import time
 os.chdir("") #Enter in directory
 headers = "TransactionType,Ledger,Tx,TimeStamp,ToAddress,FromAddress,Amount,Currency\n"
 f = open("data.csv","w")
 f.write(headers)
 success = False
 marker = ""
 counter = 1
 while success == False:
 try:
 r = requests.get("https://data.ripple.com/v2/transactions?end=" + last_date + "&descending=true" + "&limit=100" + marker)
 counter += 1
 page = r.text
 jsonPg = json.loads(page)
 transactions = jsonPg["transactions"]
 print(transactions[0]["date"])
 for item in transactions:
 type_of = item["tx"]["TransactionType"]
 tx = item["hash"]
 ledger = item["ledger_index"]
 timestamp = item["date"]
 if type_of == "Payment" and item["meta"]["TransactionResult"] == "tesSUCCESS":
 if "RippleState" not in str(item):
 to_address = item["tx"]["Destination"]
 from_address = item["tx"]["Account"]
 try:
 amount = float(item["meta"]["delivered_amount"])/1000000.0
 currency = "XRP"
 except:
 amount = item["meta"]["delivered_amount"]["value"]
 currency = item["meta"]["delivered_amount"]["currency"]
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 elif "RippleState" in str(item):
 meta = item["meta"]["AffectedNodes"]
 meta = [i for i in meta if "DeletedNode" not in i]
 meta = [i for i in meta if "DirectoryNode" not in str(i)]
 meta = [i for i in meta if "Offer" not in str(i)]
 for p in meta:
 if "RippleState" in str(p):
 try:
 amount = float(p["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
 currency = p["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 from_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 else:
 to_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 from_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 except:
 amount = float(p["CreatedNode"]["NewFields"]["Balance"]["value"])
 currency = p["CreatedNode"]["NewFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 from_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 else:
 to_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 from_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 elif "RippleState" not in str(p) and "FinalFields" in str(p) and "PreviousFields" in str(p):
 try:
 amount = (float(p["ModifiedNode"]["FinalFields"]["Balance"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
 currency = "XRP"
 if amount > 0:
 to_address = p["ModifiedNode"]["FinalFields"]["Account"]
 from_address = "NEED TO ADD!!"
 else:
 to_address = "NEED TO ADD!!"
 from_address = p["ModifiedNode"]["FinalFields"]["Account"]
 except:
 continue
 else:
 continue
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 else:
 continue
 elif type_of == "OfferCreate" and item["meta"]["TransactionResult"] == "tesSUCCESS":
 if "RippleState" in str(item) and len(item["meta"]["AffectedNodes"]) >= 5:
 #print(tx)
 metaT = item["meta"]["AffectedNodes"]
 metaT = [i for i in metaT if "DeletedNode" not in i]
 metaT = [i for i in metaT if "DirectoryNode" not in str(i)]
 metaT = [i for i in metaT if "Offer" not in str(i)]
 for q in metaT:
 if "RippleState" in str(q):
 try:
 amount = float(q["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
 currency = q["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
 if amount > 0:
 #print(tx)
 to_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 from_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 else:
 to_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 from_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 except:
 amount = float(q["CreatedNode"]["NewFields"]["Balance"]["value"])
 currency = q["CreatedNode"]["NewFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 from_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 else:
 to_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 from_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 elif "RippleState" not in str(q) and "PreviousFields" in str(q) and "FinalFields" in str(q):
 try:
 amount = (float(q["ModifiedNode"]["FinalFields"]["Balance"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
 currency = "XRP"
 if amount > 0:
 to_address = q["ModifiedNode"]["FinalFields"]["Account"]
 from_address = "NEED TO ADD!!"
 else:
 to_address = "NEED TO ADD!!"
 from_address = q["ModifiedNode"]["FinalFields"]["Account"]
 except:
 continue
 else:
 continue
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 else:
 continue
 else:
 continue
 if page.find("marker") == -1:
 success = True
 #print("YES!")
 else:
 marker = "&marker=" + jsonPg["marker"]
 #print("marker
 except:
 print("Slept!")
 time.sleep(3)
 print("Worked!")
 f.close()
get_data("2016-05-01")

Question 2

Can you explain the time.sleep(3)?

Question 3

Welcome to Code Review! Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 4

Creating a requests session instead of using the requests.get should speed it up, example, instead of using:

while ...:
 r = requests.get('http://...')

use:

s = requests.Session()
while ...:
 r = s.get('http://...')

Question 5

This is a very daunting piece of code, with a lot of places for improvement. I wouldn't even start with speeding up the scraping process untill the readability of the code is fixed.

separate

Separate the different tasks the code needs to do into different pieces of code, or methods.

Your code :

downloads a file
parses it
saves the result to a csv file
starts again with the next file

If needed, each of those parts can be split even further (the different transaction types for example)

If you separate the program like this, it is also easier to work on the pieces that take the longest. Finding out whether it is the downloading, writing or parsing that slows everything down is impossible at this moment. If you separate it in a good way, you can even start parallelizing things.

Another advantage is that you can test individual pieces. You can for example save a file on your pc and parse that instead of downloading it each time, to test whether the program acts as expected

Code quality

Apart from that, there are a lot of other things to improve

open files with a with block
never use a blanck except:, always be more specific in what kind of exception you want to catch
try to limit the line length to 80-120 characters. Putting pieces of code in methods will help tremendously here
don't handcode the URL, but let requests do that for you
why all those conversions to str?

If you can start with adressing these first issues, you can start working on the performance

DaniloNC DaniloNC 1361 bronze badge · Accepted Answer · 2018-11-01 20:33:46Z

Creating a requests session instead of using the requests.get should speed it up, example, instead of using:

while ...:
 r = requests.get('http://...')

use:

s = requests.Session()
while ...:
 r = s.get('http://...')

Stack Exchange Network

Web scraping and speeding up performance

2 Answers 2

separate

Code quality

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraping and speeding up performance

2 Answers 2

separate

Code quality

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions