Matching values from html table for updating values in pandas dataframe

Question 1

This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it:

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive

I'll make this sound like an exercise:

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page.

Say the information looks like this:

Team | Another Team | Col2 | Current | Col4 | Halftime | Scores
Team1 | TeamX | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ | info | Current4 | Col4 | Halftime4 | Scores4

From fileA (data from the file is pickled - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

Let's suppose the data in the dataframe looks like this:

 | Team | Opponent | Col2 | Col3 Col4 | Col5 | Col6 | Date
0 | Team1 | TeamX | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1 | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link 
2 | Team3 | TeamW | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3 | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link

...
and so on

Now, the task:

Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
Do this for each row from the dataframe.

Now, I did solve this pretty easy, but it takes up to 1 minute to resolve 137 rows in the dataframe.

I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.

import pickle
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_df_from_file(pickle_filename):
 objects = []
 with open(pickle_filename, "rb") as openfile:
 objects.append(pickle.load(openfile))
 return objects
def add_new_df_columns():
 return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))
def get_html_data_from_url(custom_date):
 url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
 html = requests.get(url)
 soup = BeautifulSoup(html.text, 'lxml')
 rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
 teams, currents, halftimes, scores = [], [], [], []
 for row in rows:
 cells = row.find_all("td")
 teams.append(cells[0].get_text().encode('utf-8'))
 currents.append(cells[3].get_text().encode('utf-8'))
 halftimes.append(cells[5].get_text().encode('utf-8'))
 scores.append(cells[6].get_text().encode('utf-8'))
 data = {
 'teams': teams,
 'currents': currents,
 'halftimes': halftimes,
 'scores': scores
 }
 return data
def process_data():
 df_objects = add_new_df_columns() # data from file
 for index, row in df_objects.iterrows():
 html_data = get_html_data_from_url(row['Date']) # dict from html
 for index_1, item in enumerate(html_data['teams']):
 if row['Team'] in item:
 # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
 df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
 df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
 df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
 print(df_objects)
if __name__ == '__main__':
 process_data()

After some tests, it looks like add_new_df_columns() is the function that takes the most time to execute, and that's because I always take the date from the row I'm at that point, and make a request using it.

Question 2

Consider avoiding row iteration and simply use pandas.DataFrame.merge() on Team and Date columns. Usually, in Python pandas or numpy, vectorized processes are always the recommended course where you pass in a serialized object (vector, list, array, dataframe) to run a bulk operation in one call instead of on individual elements.

To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.

Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.

def get_df_from_file():
 with open(FILE_TO_PROCESS, "rb") as openfile:
 return pickle.load(openfile)
def get_html_data_from_url(df):
 # LIST OF DATAFRAMES
 dfList = []
 # ITERATE ON UNIQUE DATES 
 for dt in set(df['Date'].tolist()):
 url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
 html = requests.get(url)
 soup = BeautifulSoup(html.text, 'lxml')
 rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
 dates, teams, currents, halftimes, scores = [], [], [], [], []
 for row in rows:
 cells = row.find_all("td")
 dates.append(dt) 
 teams.append(cells[0].get_text().encode('utf-8'))
 currents.append(cells[3].get_text().encode('utf-8'))
 halftimes.append(cells[5].get_text().encode('utf-8'))
 scores.append(cells[6].get_text().encode('utf-8'))
 data = {
 'Date': dates, 
 'Team': teams,
 'Currents': currents,
 'Halftimes': halftimes,
 'Scores': scores
 }
 # APPEND DATAFRAME CREATED FROM EACH DICTIONARY 
 dfList.append(pd.DataFrame(data))
 # CONCATENATE DATAFRAME LIST
 finaldf = pd.concat(dfList)
 return finaldf
def process_data():
 filedf = get_df_from_file('CFB_15_living-2.p')
 filedf['Team'] = filedf['Team'].str.lower()
 htmldf = get_html_data_from_url(filedf)
 htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()
 # LEFT JOIN MERGE
 mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
 mergedf.to_csv('results.csv', sep='\t')

Question 3

thanks for the solution. But the last three new columns return NaN

Question 4

I also can't seem to figure out where you made the comparison between the team in the html and the team in the dataframe. As a side note, it's not the same order between the two data sources (html data and dataframe)

Question 5

Comparison is in merge(). Order does not matter in merging. Check the output of htmldf. Does Team and Date column values and dtypes align to filedf? Is Team in one df part of Team in another df? Merge requires exact equality of values including cases and spaces.

Question 6

The output of htmldf is ok. Unfortunately, that's why I did if row['Team'] in item because row['Team'] might be Barcelona and item might be 123 Barcelona. ):

Question 7

is there any workaround to this ? Something that mimics for a in b in the pandas module ?

Question 8

Ok, so far so good, I found a way to reduce the time of execution by storing the dates in a list.

The process would be:

as we know, the link is formed as follows: http://link.com/grid_ row['dates']
so, if there's the same date everywhere, there's no need to request the same page each time.

That being said I've got the following snippet:

import pickle
import requests
import pandas as pd
from bs4 import BeautifulSoup
FILE_TO_PROCESS = 'pickle_file.txt'
def get_df_from_file():
 with open(FILE_TO_PROCESS, "rb") as openfile:
 return pickle.load(openfile).join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))
def get_html_data_from_url(custom_date):
 url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
 html = requests.get(url)
 soup = BeautifulSoup(html.text, 'lxml')
 rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
 teams, currents, halftimes, scores = [], [], [], []
 for row in rows:
 cells = row.find_all("td")
 teams.append(cells[0].get_text().encode('utf-8'))
 currents.append(cells[3].get_text().encode('utf-8'))
 halftimes.append(cells[5].get_text().encode('utf-8'))
 scores.append(cells[6].get_text().encode('utf-8'))
 data = {
 'teams': teams,
 'currents': currents,
 'halftimes': halftimes,
 'scores': scores
 }
 return data
def process_data():
 df_objects = get_df_from_file()
 dates = []
 first_date = df_objects.iloc[0]['Date']
 main_html_data = get_html_data_from_url(first_date)
 for index, row in df_objects.iterrows():
 if index < 1:
 html_data = main_html_data
 dates.append(first_date)
 else:
 if index >= 1 and row['Date'] in dates:
 html_data = main_html_data
 elif index >= 1 and row['Date'] not in dates:
 html_data = get_html_data_from_url(row['Date'])
 dates.append(row['Date'])
 for index_1, item in enumerate(html_data['teams']):
 if row['Team'] in item:
 # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
 df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
 df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
 df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
 # print('--------------------------')
 df_objects.to_csv('results.csv', sep='\t')
if __name__ == '__main__':
 process_data()

More, I also realized that there's no need to store the dataframe objects in a list when I could actually only return the dataframe and join the needed extra columns, all in the same function.

If you have any other suggestions, I would strongly recommend you guys to go for it.

LE: And that could also fail for the following test case:

I'm always storing the first date in main_html_data
Then if there's a new date, I'll add it to my list

So now my list would look like: dates = ['date_1', 'date_2']

Now if the date on the third row is again date_1, I'll get the html of date_2 link, as that's the last one I've checked. No ideas how to resolve this. Yet.

Question 9

In get_html_data_from_url you could use a collections.defauldict to directly append to the list in the dict, without having to worry about the first iteration. Then you could right away assign to the data dict.

In addition I would make a helper function to .get_text().encode('utf-8') a cell and a dictionary mapping from the positions in the cells to the key in the data dict:

from collections import defaultdict
def _encode(cell):
 return cell.get_text().encode('utf-8')
def get_html_data_from_url(custom_date):
 ...
 mapping = {0: 'teams', 3: 'currents', 5: 'halftimes', 6:'scores'}
 data = defaultdict(list)
 for row in rows:
 cells = row.find_all("td")
 for pos, key in mapping.iteritems():
 data[key].append(_encode(cells[pos]))
 return data

Parfait 9806 silver badges15 bronze badges · Accepted Answer · 2016-10-10 17:14:27Z

Consider avoiding row iteration and simply use pandas.DataFrame.merge() on Team and Date columns. Usually, in Python pandas or numpy, vectorized processes are always the recommended course where you pass in a serialized object (vector, list, array, dataframe) to run a bulk operation in one call instead of on individual elements.

To follow this approach, first you will need to compile the html data for all unique dates found in your file dataframe (pulled from pickle). Also, no need to create empty columns --Currents , Halftimes, Scores-- as the merge will bring them over.

Below first two defined methods should return a dataframe object of which the final function simply merges together. Possibly, the html dataframe build may take some time as you have to parse all unique dated web pages. For this, try implementing pandas.read_html.

def get_df_from_file():
 with open(FILE_TO_PROCESS, "rb") as openfile:
 return pickle.load(openfile)
def get_html_data_from_url(df):
 # LIST OF DATAFRAMES
 dfList = []
 # ITERATE ON UNIQUE DATES 
 for dt in set(df['Date'].tolist()):
 url = 'http://www.scoresandodds.com/grid_{}.html'.format(dt)
 html = requests.get(url)
 soup = BeautifulSoup(html.text, 'lxml')
 rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
 dates, teams, currents, halftimes, scores = [], [], [], [], []
 for row in rows:
 cells = row.find_all("td")
 dates.append(dt) 
 teams.append(cells[0].get_text().encode('utf-8'))
 currents.append(cells[3].get_text().encode('utf-8'))
 halftimes.append(cells[5].get_text().encode('utf-8'))
 scores.append(cells[6].get_text().encode('utf-8'))
 data = {
 'Date': dates, 
 'Team': teams,
 'Currents': currents,
 'Halftimes': halftimes,
 'Scores': scores
 }
 # APPEND DATAFRAME CREATED FROM EACH DICTIONARY 
 dfList.append(pd.DataFrame(data))
 # CONCATENATE DATAFRAME LIST
 finaldf = pd.concat(dfList)
 return finaldf
def process_data():
 filedf = get_df_from_file('CFB_15_living-2.p')
 filedf['Team'] = filedf['Team'].str.lower()
 htmldf = get_html_data_from_url(filedf)
 htmldf['Team'] = htmldf['Team'].str.replace('[0-9]', '').str.strip().str.lower()
 # LEFT JOIN MERGE
 mergedf = pd.merge(filedf, htmldf, on=['Date', 'Team'], how='left')
 mergedf.to_csv('results.csv', sep='\t')

thanks for the solution. But the last three new columns return NaN
I also can't seem to figure out where you made the comparison between the team in the html and the team in the dataframe. As a side note, it's not the same order between the two data sources (html data and dataframe)
Comparison is in merge(). Order does not matter in merging. Check the output of htmldf. Does Team and Date column values and dtypes align to filedf? Is Team in one df part of Team in another df? Merge requires exact equality of values including cases and spaces.
The output of htmldf is ok. Unfortunately, that's why I did if row['Team'] in item because row['Team'] might be Barcelona and item might be 123 Barcelona. ):
is there any workaround to this ? Something that mimics for a in b in the pandas module ?

Stack Exchange Network

Matching values from html table for updating values in pandas dataframe

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Matching values from html table for updating values in pandas dataframe

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions