1
\$\begingroup\$

I was wondering if the following code could be altered to speed up the scraping process. I know xrp has many transactions, but it takes an hour to go through half a day of transactions (at the busiest times, 2016-2018). My thinking is that the try, except statements are slowing it down but I am not sure. I would prefer to go through at least a week every hour.

The if and else statements perform important filtering; that behavior needs to be preserved.

def get_data(last_date): #Date has to be inputted like so: "2018-08-03"
 import requests
 import json
 import os
 import time
 os.chdir("") #Enter in directory
 headers = "TransactionType,Ledger,Tx,TimeStamp,ToAddress,FromAddress,Amount,Currency\n"
 f = open("data.csv","w")
 f.write(headers)
 success = False
 marker = ""
 counter = 1
 while success == False:
 try:
 r = requests.get("https://data.ripple.com/v2/transactions?end=" + last_date + "&descending=true" + "&limit=100" + marker)
 counter += 1
 page = r.text
 jsonPg = json.loads(page)
 transactions = jsonPg["transactions"]
 print(transactions[0]["date"])
 for item in transactions:
 type_of = item["tx"]["TransactionType"]
 tx = item["hash"]
 ledger = item["ledger_index"]
 timestamp = item["date"]
 if type_of == "Payment" and item["meta"]["TransactionResult"] == "tesSUCCESS":
 if "RippleState" not in str(item):
 to_address = item["tx"]["Destination"]
 from_address = item["tx"]["Account"]
 try:
 amount = float(item["meta"]["delivered_amount"])/1000000.0
 currency = "XRP"
 except:
 amount = item["meta"]["delivered_amount"]["value"]
 currency = item["meta"]["delivered_amount"]["currency"]
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 elif "RippleState" in str(item):
 meta = item["meta"]["AffectedNodes"]
 meta = [i for i in meta if "DeletedNode" not in i]
 meta = [i for i in meta if "DirectoryNode" not in str(i)]
 meta = [i for i in meta if "Offer" not in str(i)]
 for p in meta:
 if "RippleState" in str(p):
 try:
 amount = float(p["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
 currency = p["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 from_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 else:
 to_address = p["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 from_address = p["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 except:
 amount = float(p["CreatedNode"]["NewFields"]["Balance"]["value"])
 currency = p["CreatedNode"]["NewFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 from_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 else:
 to_address = p["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 from_address = p["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 elif "RippleState" not in str(p) and "FinalFields" in str(p) and "PreviousFields" in str(p):
 try:
 amount = (float(p["ModifiedNode"]["FinalFields"]["Balance"]) - float(p["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
 currency = "XRP"
 if amount > 0:
 to_address = p["ModifiedNode"]["FinalFields"]["Account"]
 from_address = "NEED TO ADD!!"
 else:
 to_address = "NEED TO ADD!!"
 from_address = p["ModifiedNode"]["FinalFields"]["Account"]
 except:
 continue
 else:
 continue
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 else:
 continue
 elif type_of == "OfferCreate" and item["meta"]["TransactionResult"] == "tesSUCCESS":
 if "RippleState" in str(item) and len(item["meta"]["AffectedNodes"]) >= 5:
 #print(tx)
 metaT = item["meta"]["AffectedNodes"]
 metaT = [i for i in metaT if "DeletedNode" not in i]
 metaT = [i for i in metaT if "DirectoryNode" not in str(i)]
 metaT = [i for i in metaT if "Offer" not in str(i)]
 for q in metaT:
 if "RippleState" in str(q):
 try:
 amount = float(q["ModifiedNode"]["FinalFields"]["Balance"]["value"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]["value"])
 currency = q["ModifiedNode"]["FinalFields"]["Balance"]["currency"]
 if amount > 0:
 #print(tx)
 to_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 from_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 else:
 to_address = q["ModifiedNode"]["FinalFields"]["HighLimit"]["issuer"]
 from_address = q["ModifiedNode"]["FinalFields"]["LowLimit"]["issuer"]
 except:
 amount = float(q["CreatedNode"]["NewFields"]["Balance"]["value"])
 currency = q["CreatedNode"]["NewFields"]["Balance"]["currency"]
 if amount > 0:
 to_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 from_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 else:
 to_address = q["CreatedNode"]["NewFields"]["HighLimit"]["issuer"]
 from_address = q["CreatedNode"]["NewFields"]["LowLimit"]["issuer"]
 elif "RippleState" not in str(q) and "PreviousFields" in str(q) and "FinalFields" in str(q):
 try:
 amount = (float(q["ModifiedNode"]["FinalFields"]["Balance"]) - float(q["ModifiedNode"]["PreviousFields"]["Balance"]))/1000000.0
 currency = "XRP"
 if amount > 0:
 to_address = q["ModifiedNode"]["FinalFields"]["Account"]
 from_address = "NEED TO ADD!!"
 else:
 to_address = "NEED TO ADD!!"
 from_address = q["ModifiedNode"]["FinalFields"]["Account"]
 except:
 continue
 else:
 continue
 f.write(type_of + "," + str(ledger) + "," + tx + "," + timestamp + "," + to_address + "," +
 from_address + "," + str(amount) + "," + currency + "\n")
 else:
 continue
 else:
 continue
 if page.find("marker") == -1:
 success = True
 #print("YES!")
 else:
 marker = "&marker=" + jsonPg["marker"]
 #print("marker
 except:
 print("Slept!")
 time.sleep(3)
 print("Worked!")
 f.close()
get_data("2016-05-01")
asked Nov 1, 2018 at 19:51
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Can you explain the time.sleep(3)? \$\endgroup\$ Commented Nov 2, 2018 at 17:13
  • 1
    \$\begingroup\$ Welcome to Code Review! Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$ Commented Nov 2, 2018 at 19:22

2 Answers 2

2
\$\begingroup\$

Creating a requests session instead of using the requests.get should speed it up, example, instead of using:

while ...:
 r = requests.get('http://...')

use:

s = requests.Session()
while ...:
 r = s.get('http://...')
Vogel612
25.5k7 gold badges59 silver badges141 bronze badges
answered Nov 1, 2018 at 20:33
\$\endgroup\$
1
\$\begingroup\$

This is a very daunting piece of code, with a lot of places for improvement. I wouldn't even start with speeding up the scraping process untill the readability of the code is fixed.

separate

Separate the different tasks the code needs to do into different pieces of code, or methods.

Your code :

  1. downloads a file
  2. parses it
  3. saves the result to a csv file
  4. starts again with the next file

If needed, each of those parts can be split even further (the different transaction types for example)

If you separate the program like this, it is also easier to work on the pieces that take the longest. Finding out whether it is the downloading, writing or parsing that slows everything down is impossible at this moment. If you separate it in a good way, you can even start parallelizing things.

Another advantage is that you can test individual pieces. You can for example save a file on your pc and parse that instead of downloading it each time, to test whether the program acts as expected

Code quality

Apart from that, there are a lot of other things to improve

  • open files with a with block
  • never use a blanck except:, always be more specific in what kind of exception you want to catch
  • try to limit the line length to 80-120 characters. Putting pieces of code in methods will help tremendously here
  • don't handcode the URL, but let requests do that for you
  • why all those conversions to str?

If you can start with adressing these first issues, you can start working on the performance

answered Nov 6, 2018 at 14:01
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.