This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it:
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive
I'll make this sound like an exercise:
Given some link http://ABCD.abc/some_date.html
, take the necessary information from the table on the page.
Say the information looks like this:
Team | Another Team | Col2 | Current | Col4 | Halftime | Scores Team1 | TeamX | info | Current1 | Col4 | Halftime1 | Scores1 Team2 | TeamY | info | Current2 | Col4 | Halftime2 | Scores2 Team3 | TeamW | info | Current3 | Col4 | Halftime3 | Scores3 Team4 | TeamZ | info | Current4 | Col4 | Halftime4 | Scores4
From fileA (data from the file is pickle
d - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current
, Halftime
and Scores
.
Let's suppose the data in the dataframe looks like this:
| Team | Opponent | Col2 | Col3 Col4 | Col5 | Col6 | Date 0 | Team1 | TeamX | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 1 | TeamX | Team1 | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link 2 | Team3 | TeamW | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link 3 | TeamW | Team3 | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link
...
and so on
Now, the task:
- Parse each row from the
dataframe
(access the link using the date from theDate
column of that row), and check if the team from this row can be found in the HTML table. - If you find it, take
Current
,Halftime
andScores
from the table and add the info into the newly createddataframe
columns. - Do this for each row from the dataframe.
Now, I did solve this pretty easy, but it takes up to 1 minute to resolve 137 rows in the dataframe.
I'd like some ideas on how can I optimise it, make better use of pandas
modules and if there's something wrong with the logic.
import pickle
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_df_from_file(pickle_filename):
objects = []
with open(pickle_filename, "rb") as openfile:
objects.append(pickle.load(openfile))
return objects
def add_new_df_columns():
return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))
def get_html_data_from_url(custom_date):
url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
teams, currents, halftimes, scores = [], [], [], []
for row in rows:
cells = row.find_all("td")
teams.append(cells[0].get_text().encode('utf-8'))
currents.append(cells[3].get_text().encode('utf-8'))
halftimes.append(cells[5].get_text().encode('utf-8'))
scores.append(cells[6].get_text().encode('utf-8'))
data = {
'teams': teams,
'currents': currents,
'halftimes': halftimes,
'scores': scores
}
return data
def process_data():
df_objects = add_new_df_columns() # data from file
for index, row in df_objects.iterrows():
html_data = get_html_data_from_url(row['Date']) # dict from html
for index_1, item in enumerate(html_data['teams']):
if row['Team'] in item:
# print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
print(df_objects)
if __name__ == '__main__':
process_data()
After some tests, it looks like add_new_df_columns()
is the function that takes the most time to execute, and that's because I always take the date from the row I'm at that point, and make a request using it.
3 Answers 3
Consider avoiding row iteration and simply use pandas.DataFrame.merge()
on Team and Date columns. Usually, in Python pandas or numpy, vectorized processes are always the recommended course where you pass in a serialized object (vector, list, array, dataframe) to run a bulk operation in one call instead of on individual elements.
To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.
Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.
def get_df_from_file():
with open(FILE_TO_PROCESS, "rb") as openfile:
return pickle.load(openfile)
def get_html_data_from_url(df):
# LIST OF DATAFRAMES
dfList = []
# ITERATE ON UNIQUE DATES
for dt in set(df['Date'].tolist()):
url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
dates, teams, currents, halftimes, scores = [], [], [], [], []
for row in rows:
cells = row.find_all("td")
dates.append(dt)
teams.append(cells[0].get_text().encode('utf-8'))
currents.append(cells[3].get_text().encode('utf-8'))
halftimes.append(cells[5].get_text().encode('utf-8'))
scores.append(cells[6].get_text().encode('utf-8'))
data = {
'Date': dates,
'Team': teams,
'Currents': currents,
'Halftimes': halftimes,
'Scores': scores
}
# APPEND DATAFRAME CREATED FROM EACH DICTIONARY
dfList.append(pd.DataFrame(data))
# CONCATENATE DATAFRAME LIST
finaldf = pd.concat(dfList)
return finaldf
def process_data():
filedf = get_df_from_file('CFB_15_living-2.p')
filedf['Team'] = filedf['Team'].str.lower()
htmldf = get_html_data_from_url(filedf)
htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()
# LEFT JOIN MERGE
mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
mergedf.to_csv('results.csv', sep='\t')
-
\$\begingroup\$ thanks for the solution. But the last three new columns return
NaN
\$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年10月10日 19:58:00 +00:00Commented Oct 10, 2016 at 19:58 -
\$\begingroup\$ I also can't seem to figure out where you made the comparison between the team in the html and the team in the dataframe. As a side note, it's not the same order between the two data sources (html data and dataframe) \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年10月10日 20:10:24 +00:00Commented Oct 10, 2016 at 20:10
-
\$\begingroup\$ Comparison is in
merge()
. Order does not matter in merging. Check the output ofhtmldf
. Does Team and Date column values and dtypes align tofiledf
? Is Team in one df part of Team in another df? Merge requires exact equality of values including cases and spaces. \$\endgroup\$Parfait– Parfait2016年10月10日 20:12:04 +00:00Commented Oct 10, 2016 at 20:12 -
\$\begingroup\$ The output of
htmldf
is ok. Unfortunately, that's why I didif row['Team'] in item
becauserow['Team']
might beBarcelona
anditem
might be123 Barcelona
. ): \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年10月10日 20:16:27 +00:00Commented Oct 10, 2016 at 20:16 -
\$\begingroup\$ is there any workaround to this ? Something that mimics
for a in b
in thepandas
module ? \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年10月10日 20:36:40 +00:00Commented Oct 10, 2016 at 20:36
Ok, so far so good, I found a way to reduce the time of execution by storing the dates in a list.
The process would be:
- as we know, the link is formed as follows: http://link.com/grid_
row['dates']
- so, if there's the same date everywhere, there's no need to request the same page each time.
That being said I've got the following snippet:
import pickle
import requests
import pandas as pd
from bs4 import BeautifulSoup
FILE_TO_PROCESS = 'pickle_file.txt'
def get_df_from_file():
with open(FILE_TO_PROCESS, "rb") as openfile:
return pickle.load(openfile).join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))
def get_html_data_from_url(custom_date):
url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
teams, currents, halftimes, scores = [], [], [], []
for row in rows:
cells = row.find_all("td")
teams.append(cells[0].get_text().encode('utf-8'))
currents.append(cells[3].get_text().encode('utf-8'))
halftimes.append(cells[5].get_text().encode('utf-8'))
scores.append(cells[6].get_text().encode('utf-8'))
data = {
'teams': teams,
'currents': currents,
'halftimes': halftimes,
'scores': scores
}
return data
def process_data():
df_objects = get_df_from_file()
dates = []
first_date = df_objects.iloc[0]['Date']
main_html_data = get_html_data_from_url(first_date)
for index, row in df_objects.iterrows():
if index < 1:
html_data = main_html_data
dates.append(first_date)
else:
if index >= 1 and row['Date'] in dates:
html_data = main_html_data
elif index >= 1 and row['Date'] not in dates:
html_data = get_html_data_from_url(row['Date'])
dates.append(row['Date'])
for index_1, item in enumerate(html_data['teams']):
if row['Team'] in item:
# print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
# print('--------------------------')
df_objects.to_csv('results.csv', sep='\t')
if __name__ == '__main__':
process_data()
More, I also realized that there's no need to store the dataframe
objects in a list when I could actually only return the dataframe and join the needed extra columns, all in the same function.
If you have any other suggestions, I would strongly recommend you guys to go for it.
LE: And that could also fail for the following test case:
- I'm always storing the first date in
main_html_data
- Then if there's a new date, I'll add it to my list
So now my list would look like: dates = ['date_1', 'date_2']
- Now if the date on the third row is again
date_1
, I'll get the html ofdate_2
link, as that's the last one I've checked. No ideas how to resolve this. Yet.
In get_html_data_from_url
you could use a collections.defauldict
to directly append to the list in the dict, without having to worry about the first iteration. Then you could right away assign to the data
dict.
In addition I would make a helper function to .get_text().encode('utf-8')
a cell and a dictionary mapping from the positions in the cells to the key in the data
dict:
from collections import defaultdict
def _encode(cell):
return cell.get_text().encode('utf-8')
def get_html_data_from_url(custom_date):
...
mapping = {0: 'teams', 3: 'currents', 5: 'halftimes', 6:'scores'}
data = defaultdict(list)
for row in rows:
cells = row.find_all("td")
for pos, key in mapping.iteritems():
data[key].append(_encode(cells[pos]))
return data
Explore related questions
See similar questions with these tags.