I wrote a script which is supposed to be making requests to an API to gather a lot of data. There is a limit on requests that I can make: about 30 000 in 8 hours otherwise, I will get banned for a significant time. Each object I get from the API can be identified uniquely with a hash. Each API call returns me some data I need and also the hash of the next object I need to get. So I start with some hash I have, make a request, parse the result and obtain the hash of the next object. Repeat. I also log the time of each 20th request I make, recording the time to keep track of the number of the requests I made in last 8 hours.
Here is my code:
import ujson
import requests
import time
import os
import cProfile
# 'logs' has all the requests in the last 8 hours
# the snippet of code which does all of the hard work
for i in range(len(logs) * requests_per_log, maximum_requests): # to stay in max requests range
r = requests.get(url + current_hash)
block = ujson.loads(r.text) # use ujson because it is faster
block_timestamp_str = format_timestamp(block['time'])
block_index_str = str(block['block_index'])
# only log each 20th request
if i % 20 == 0:
f_logs.write(str(time.time()) + '\n') # log the time when the request was made
f = open('data/' + block_index_str + "__" + block_timestamp_str, "w+")
block_string = parse_transactions_in_block(block)
current_hash = block['prev_block']
f.write(block_string)
f.close()
# record the hash the script stopped at
f_current_hash.write(current_hash)
# some of the functions it uses:
def parse_transactions_in_block(block):
block_string = ''
for transaction in block['tx']:
block_string += str(transaction['time']) + ',' + str(transaction['tx_index']) \
+ ',' + str(calc_total_input(transaction)) + '\n'
return block_string
def calc_total_input(transaction):
total_input = 0
for input in transaction['out']:
total_input += input['value']
# this is the time format I was asked to keep my data in
def format_timestamp(unix_time):
return time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime(unix_time))
There is a lot of data to go through, so I want it to be as fast as possible and my previous iterations took a while to run. I am running it on Google Compute Engine using a Linux distribution. Any ideas how I can make this work faster? I don't have enough experience on concurrent computing in Python, so I am just looking for a way of optimizing what I have, without concurrency.
1 Answer 1
Here are some of the improvements to the current "synchronous" approach:
maintain an instance of the
requests.Session()
- this would improve performance:..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
with requests.Session() as session: r = session.get(url + current_hash)
use the
.json()
method to get the JSON object directly out of a response (you would first need to adjust the "json model" to useujson
, source):requests.models.json = ujson # ... block = r.json()
- don't open/close files in the main loop - collect the data into memory and dump after. If the data does not fit the memory, use pagination - write to the output file(s) in chunks.
the
parse_transactions_in_block()
may be rewritten usingstr.join()
andstr.format()
:def parse_transactions_in_block(block): return ''.join("{time},{index},{total}\n".format(time=transaction['time'], index=transaction['tx_index'], total=calc_total_input(transaction)) for transaction in block['tx'])
the
calc_total_input()
can be rewritten usingsum()
:def calc_total_input(transaction): return sum(input['value'] for input in transaction['out'])
try the
PyPy
interpreter - can give you a performance boost with no changes to the code (well, I doubtujson
will still work, butsimplejson
might be a good alternative in this case)
That said, given the information provided in the question, the bottleneck feels to still be the blocking nature of the script. See if you can switch to, for example, Scrapy
web-scraping framework, or use the grequests
library.