The web-scraping process currently takes quite a bit of time, and I wonder if I can structure the code otherwise or improve it in any way.
Code looks like this:
import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
results_2017 = []
results_2018 = []
for game_id in range(2017020001, 2017021271, 1):
url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r_2017 = requests.get(url)
game_data_2017 = r_2017.json()
for homeaway in ['home','away']:
game_dict_2017 = game_data_2017.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
game_dict_2017['team'] = game_data_2017.get('teams').get(homeaway).get('team').get('name')
game_dict_2017['homeaway'] = homeaway
game_dict_2017['game_id'] = game_id
results_2017.append(game_dict_2017)
df_2017 = pd.DataFrame(results_2017)
for game_id in range(2018020001, 2018020667, 1):
url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r_2018 = requests.get(url)
game_data_2018 = r_2018.json()
for homeaway in ['home','away']:
game_dict_2018 = game_data_2018.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
game_dict_2018['team'] = game_data_2018.get('teams').get(homeaway).get('team').get('name')
game_dict_2018['homeaway'] = homeaway
game_dict_2018['game_id'] = game_id
results_2018.append(game_dict_2018)
df_2018 = pd.DataFrame(results_2018)
df = df_2017.append(df_2018)
-
1\$\begingroup\$ Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.". \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2019年01月09日 18:42:46 +00:00Commented Jan 9, 2019 at 18:42
1 Answer 1
Assuming you have a multi-core processor, you can use the multiprocessing module to do your scraping in parallel. Here's a good article that explains how to use it: An introduction to parallel programming using Python's multiprocessing module (you only need to read the first half of the article)
The simplest way to parallelize this would be to do your 2017 loop in a separate process from your 2018 loop. If you need it faster than that, you could further subdivide your 2017 and 2018 ranges. It also depends on how many cores you have. If you only have 2 cores, you won't benefit from dividing it into more than 2 processes (or 4 processes for 4 cores).
Other than that I can't see anything in your code structure you could change to speed it up.
Update: I'm not familiar with the API you're using, but if there's a way to make one API call that returns a list of games with their box scores instead of making a separate request for each game, that would be the best way to speed it up.
Also, according to Mathias in the comments, multiprocessing.Pool.map would be particularly helpful in this case. I'm not familiar with it, but it does look like it would be convenient for this.
-
\$\begingroup\$ Thank you for the article and the feedback, really helpful! \$\endgroup\$MisterButter– MisterButter2019年01月09日 18:55:09 +00:00Commented Jan 9, 2019 at 18:55
-
\$\begingroup\$ You could mention
multiprocessing.Pool.map
that is particularly helpful in this case. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2019年01月09日 22:03:09 +00:00Commented Jan 9, 2019 at 22:03
Explore related questions
See similar questions with these tags.