Web scraping JSON-data from API put into Dataframe

Question 1

The web-scraping process currently takes quite a bit of time, and I wonder if I can structure the code otherwise or improve it in any way.

Code looks like this:

import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
results_2017 = []
results_2018 = []
for game_id in range(2017020001, 2017021271, 1):
 url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
 r_2017 = requests.get(url)
 game_data_2017 = r_2017.json()
 for homeaway in ['home','away']:
 game_dict_2017 = game_data_2017.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
 game_dict_2017['team'] = game_data_2017.get('teams').get(homeaway).get('team').get('name')
 game_dict_2017['homeaway'] = homeaway
 game_dict_2017['game_id'] = game_id
 results_2017.append(game_dict_2017)
df_2017 = pd.DataFrame(results_2017)
for game_id in range(2018020001, 2018020667, 1):
 url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
 r_2018 = requests.get(url)
 game_data_2018 = r_2018.json()
 for homeaway in ['home','away']:
 game_dict_2018 = game_data_2018.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
 game_dict_2018['team'] = game_data_2018.get('teams').get(homeaway).get('team').get('name')
 game_dict_2018['homeaway'] = homeaway
 game_dict_2018['game_id'] = game_id
 results_2018.append(game_dict_2018)
df_2018 = pd.DataFrame(results_2018)
df = df_2017.append(df_2018)

Question 2

Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".

Question 3

Assuming you have a multi-core processor, you can use the multiprocessing module to do your scraping in parallel. Here's a good article that explains how to use it: An introduction to parallel programming using Python's multiprocessing module (you only need to read the first half of the article)

The simplest way to parallelize this would be to do your 2017 loop in a separate process from your 2018 loop. If you need it faster than that, you could further subdivide your 2017 and 2018 ranges. It also depends on how many cores you have. If you only have 2 cores, you won't benefit from dividing it into more than 2 processes (or 4 processes for 4 cores).

Other than that I can't see anything in your code structure you could change to speed it up.

Update: I'm not familiar with the API you're using, but if there's a way to make one API call that returns a list of games with their box scores instead of making a separate request for each game, that would be the best way to speed it up.

Also, according to Mathias in the comments, multiprocessing.Pool.map would be particularly helpful in this case. I'm not familiar with it, but it does look like it would be convenient for this.

Question 4

Thank you for the article and the feedback, really helpful!

Question 5

You could mention multiprocessing.Pool.map that is particularly helpful in this case.

brando brandobrando 1363 bronze badges · Accepted Answer · 2019-01-09 18:51:11Z

Assuming you have a multi-core processor, you can use the multiprocessing module to do your scraping in parallel. Here's a good article that explains how to use it: An introduction to parallel programming using Python's multiprocessing module (you only need to read the first half of the article)

The simplest way to parallelize this would be to do your 2017 loop in a separate process from your 2018 loop. If you need it faster than that, you could further subdivide your 2017 and 2018 ranges. It also depends on how many cores you have. If you only have 2 cores, you won't benefit from dividing it into more than 2 processes (or 4 processes for 4 cores).

Other than that I can't see anything in your code structure you could change to speed it up.

Update: I'm not familiar with the API you're using, but if there's a way to make one API call that returns a list of games with their box scores instead of making a separate request for each game, that would be the best way to speed it up.

Also, according to Mathias in the comments, multiprocessing.Pool.map would be particularly helpful in this case. I'm not familiar with it, but it does look like it would be convenient for this.

You could mention multiprocessing.Pool.map that is particularly helpful in this case.

Stack Exchange Network

Web scraping JSON-data from API put into Dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraping JSON-data from API put into Dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions