I was wondering if the following code could be altered to speed up the scraping process. I know xrp has many transactions, but it takes an hour to go through half a day of transactions (at the busiest times, 2016-2018). My thinking is that the try
, except
statements are slowing it down but I am not sure. I would prefer to go through at least a week every hour.
The if
and else
statements perform important filtering; that behavior needs to be preserved.
def get_data(last_date): #Date has to be inputted like so: "2018-08-03"
import requests
import json
import os
import time
os.chdir("") #Enter in directory
headers = "TransactionType,Ledger,Tx,TimeStamp,ToAddress,FromAddress,Amount,Currency\n"
f = open("data.csv","w")
f.write(headers)
success = False
marker = ""
counter = 1
while success == False:
try:
r = requests.get("https://data.ripple.com/v2/transactions?end=" + last_date + "&descending=true" + "&limit=100" + marker)
counter += 1
page = r.text
jsonPg = json.loads(page)
transactions = jsonPg["transactions"]
print(transactions[0]["date"])
for item in transactions:
type_of = item["tx"]["TransactionType"]
tx = item["hash"]
ledger = item["ledger_index"]
timestamp = item["date"]
if type_of == "Payment" and item["meta"]["TransactionResult"] == "tesSUCCESS":
if "RippleState" not in str(item):
to_address = item["tx"]["Destination"]
from_address = item["tx"]["Account"]
try:
amount = float(item["meta"]["delivered_amount"])/1000000.0
currency = "XRP"
except:
amount = item["meta"]["delivered_amount"]["value"]
currency = item["meta"]["delivered_amount"]["currency"]
f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
from_address + "," + str(amount) + "," + currency + "\n")
elif "RippleState" in str(item):
meta = item["meta"]["AffectedNodes"]
meta = [i for i in meta if "DeletedNode" not in i]
meta = [i for i in meta if "DirectoryNode" not in str(i)]
meta = [i for i in meta if "Offer" not in str(i)]
for p in meta:
if "RippleState" in str(p):
try:
amount = float(p["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
currency = p["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
if amount > 0:
to_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
from_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
else:
to_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
from_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
except:
amount = float(p["CreatedNode"]["NewFields"]["Balance"]["value"])
currency = p["CreatedNode"]["NewFields"]["Balance"]["currency"]
if amount > 0:
to_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
from_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
else:
to_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
from_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
elif "RippleState" not in str(p) and "FinalFields" in str(p) and "PreviousFields" in str(p):
try:
amount = (float(p["ModifiedNode"]["FinalFields"]["Balance"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
currency = "XRP"
if amount > 0:
to_address = p["ModifiedNode"]["FinalFields"]["Account"]
from_address = "NEED TO ADD!!"
else:
to_address = "NEED TO ADD!!"
from_address = p["ModifiedNode"]["FinalFields"]["Account"]
except:
continue
else:
continue
f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
from_address + "," + str(amount) + "," + currency + "\n")
else:
continue
elif type_of == "OfferCreate" and item["meta"]["TransactionResult"] == "tesSUCCESS":
if "RippleState" in str(item) and len(item["meta"]["AffectedNodes"]) >= 5:
#print(tx)
metaT = item["meta"]["AffectedNodes"]
metaT = [i for i in metaT if "DeletedNode" not in i]
metaT = [i for i in metaT if "DirectoryNode" not in str(i)]
metaT = [i for i in metaT if "Offer" not in str(i)]
for q in metaT:
if "RippleState" in str(q):
try:
amount = float(q["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
currency = q["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
if amount > 0:
#print(tx)
to_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
from_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
else:
to_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
from_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
except:
amount = float(q["CreatedNode"]["NewFields"]["Balance"]["value"])
currency = q["CreatedNode"]["NewFields"]["Balance"]["currency"]
if amount > 0:
to_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
from_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
else:
to_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
from_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
elif "RippleState" not in str(q) and "PreviousFields" in str(q) and "FinalFields" in str(q):
try:
amount = (float(q["ModifiedNode"]["FinalFields"]["Balance"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
currency = "XRP"
if amount > 0:
to_address = q["ModifiedNode"]["FinalFields"]["Account"]
from_address = "NEED TO ADD!!"
else:
to_address = "NEED TO ADD!!"
from_address = q["ModifiedNode"]["FinalFields"]["Account"]
except:
continue
else:
continue
f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
from_address + "," + str(amount) + "," + currency + "\n")
else:
continue
else:
continue
if page.find("marker") == -1:
success = True
#print("YES!")
else:
marker = "&marker=" + jsonPg["marker"]
#print("marker
except:
print("Slept!")
time.sleep(3)
print("Worked!")
f.close()
get_data("2016-05-01")
2 Answers 2
Creating a requests session instead of using the requests.get should speed it up, example, instead of using:
while ...:
r = requests.get('http://...')
use:
s = requests.Session()
while ...:
r = s.get('http://...')
This is a very daunting piece of code, with a lot of places for improvement. I wouldn't even start with speeding up the scraping process untill the readability of the code is fixed.
separate
Separate the different tasks the code needs to do into different pieces of code, or methods.
Your code :
- downloads a file
- parses it
- saves the result to a csv file
- starts again with the next file
If needed, each of those parts can be split even further (the different transaction types for example)
If you separate the program like this, it is also easier to work on the pieces that take the longest. Finding out whether it is the downloading, writing or parsing that slows everything down is impossible at this moment. If you separate it in a good way, you can even start parallelizing things.
Another advantage is that you can test individual pieces. You can for example save a file on your pc and parse that instead of downloading it each time, to test whether the program acts as expected
Code quality
Apart from that, there are a lot of other things to improve
- open files with a
with
block - never use a blanck
except:
, always be more specific in what kind of exception you want to catch - try to limit the line length to 80-120 characters. Putting pieces of code in methods will help tremendously here
- don't handcode the URL, but let requests do that for you
- why all those conversions to
str
?
If you can start with adressing these first issues, you can start working on the performance
Explore related questions
See similar questions with these tags.
time.sleep(3)
? \$\endgroup\$