Profiling script for gathering a lot of data from an API

Question 1

I wrote a script which is supposed to be making requests to an API to gather a lot of data. There is a limit on requests that I can make: about 30 000 in 8 hours otherwise, I will get banned for a significant time. Each object I get from the API can be identified uniquely with a hash. Each API call returns me some data I need and also the hash of the next object I need to get. So I start with some hash I have, make a request, parse the result and obtain the hash of the next object. Repeat. I also log the time of each 20th request I make, recording the time to keep track of the number of the requests I made in last 8 hours.

Here is my code:

import ujson
import requests
import time
import os
import cProfile
# 'logs' has all the requests in the last 8 hours
# the snippet of code which does all of the hard work
for i in range(len(logs) * requests_per_log, maximum_requests): # to stay in max requests range
 r = requests.get(url + current_hash)
 block = ujson.loads(r.text) # use ujson because it is faster
 block_timestamp_str = format_timestamp(block['time'])
 block_index_str = str(block['block_index'])
 # only log each 20th request
 if i % 20 == 0:
 f_logs.write(str(time.time()) + '\n') # log the time when the request was made
 f = open('data/' + block_index_str + "__" + block_timestamp_str, "w+")
 block_string = parse_transactions_in_block(block)
 current_hash = block['prev_block']
 f.write(block_string)
 f.close()
 # record the hash the script stopped at
 f_current_hash.write(current_hash)
# some of the functions it uses:
def parse_transactions_in_block(block):
 block_string = ''
 for transaction in block['tx']:
 block_string += str(transaction['time']) + ',' + str(transaction['tx_index']) \
 + ',' + str(calc_total_input(transaction)) + '\n'
 return block_string
def calc_total_input(transaction):
 total_input = 0
 for input in transaction['out']:
 total_input += input['value']
# this is the time format I was asked to keep my data in 
def format_timestamp(unix_time):
 return time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime(unix_time))

There is a lot of data to go through, so I want it to be as fast as possible and my previous iterations took a while to run. I am running it on Google Compute Engine using a Linux distribution. Any ideas how I can make this work faster? I don't have enough experience on concurrent computing in Python, so I am just looking for a way of optimizing what I have, without concurrency.

Question 2

Here are some of the improvements to the current "synchronous" approach:

maintain an instance of the requests.Session() - this would improve performance:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
```
with requests.Session() as session:
 r = session.get(url + current_hash)
```
use the .json() method to get the JSON object directly out of a response (you would first need to adjust the "json model" to use ujson, source):
```
requests.models.json = ujson
# ...
block = r.json()
```
don't open/close files in the main loop - collect the data into memory and dump after. If the data does not fit the memory, use pagination - write to the output file(s) in chunks.

the parse_transactions_in_block() may be rewritten using str.join() and str.format():

def parse_transactions_in_block(block):
 return ''.join("{time},{index},{total}\n".format(time=transaction['time'],
 index=transaction['tx_index'],
 total=calc_total_input(transaction))
 for transaction in block['tx'])

the calc_total_input() can be rewritten using sum():

def calc_total_input(transaction):
 return sum(input['value'] for input in transaction['out'])

try the PyPy interpreter - can give you a performance boost with no changes to the code (well, I doubt ujson will still work, but simplejson might be a good alternative in this case)

That said, given the information provided in the question, the bottleneck feels to still be the blocking nature of the script. See if you can switch to, for example, Scrapy web-scraping framework, or use the grequests library.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-02-11 23:22:56Z

Here are some of the improvements to the current "synchronous" approach:

maintain an instance of the requests.Session() - this would improve performance:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
```
with requests.Session() as session:
 r = session.get(url + current_hash)
```
use the .json() method to get the JSON object directly out of a response (you would first need to adjust the "json model" to use ujson, source):
```
requests.models.json = ujson
# ...
block = r.json()
```
don't open/close files in the main loop - collect the data into memory and dump after. If the data does not fit the memory, use pagination - write to the output file(s) in chunks.

the parse_transactions_in_block() may be rewritten using str.join() and str.format():

def parse_transactions_in_block(block):
 return ''.join("{time},{index},{total}\n".format(time=transaction['time'],
 index=transaction['tx_index'],
 total=calc_total_input(transaction))
 for transaction in block['tx'])

the calc_total_input() can be rewritten using sum():

def calc_total_input(transaction):
 return sum(input['value'] for input in transaction['out'])

try the PyPy interpreter - can give you a performance boost with no changes to the code (well, I doubt ujson will still work, but simplejson might be a good alternative in this case)

That said, given the information provided in the question, the bottleneck feels to still be the blocking nature of the script. See if you can switch to, for example, Scrapy web-scraping framework, or use the grequests library.

Stack Exchange Network

Profiling script for gathering a lot of data from an API

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Profiling script for gathering a lot of data from an API

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions