I am just starting out using Python for data analysis. I am a total beginner and have figured out how to brute force somethings, however I know it is inefficient yet don't know how to do it without messing up what I have.
I have multiple webpages to scrape and then store data in a data frame. The code is identical for all of the pages. How do I set it up as a routine instead just repeating the same code over and over again?
As an example the two urls are:
The HTML is identical so the web scraping working exactly the same for both.
Once scraped, I want them in a single data frame.
The code below works, but to me seems the least sophisticated approach since I know so little. The actual code is likely not pretty but it works.
Any help appreciated on how this should be improved.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
from IPython.display import display
iwd_url = 'https://etfdb.com/etf/IWD/'
uClient = uReq(iwd_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#Isolate header to get name and symbol
h1 = page_soup.h1
#Isolate stock symbol
title = h1.findAll("span",{"class":"label-primary"})
titlet = title[0].text
#print(titlet)
#strip space and line break
strip1 = h1.text.strip()
#strip stock symbol
strip2 = strip1.strip(titlet)
#strip remaining line break
strip3 = strip2.strip()
#print(strip3)
IWD = page_soup.findAll("table",{"class":"chart base-table"})[1]
#Create lists to fill
sectordata=[]
sectorname=[]
sectorweight=[]
for row in IWD.findAll("td"):
sectordata.append(row.text)
#list created
#Assign every other value to proper list to get 2 columns
sectorname = sectordata[::2]
sectorweight = sectordata[1::2]
#Insert name/symbol for clarification/validation
sectorweight.insert(0,titlet)
sectorname.insert(0,strip3)
# create empty data frame in pandas
df = pd.DataFrame()
#Add the first column to the empty dataframe.
df['Sector'] = sectorname
#Now add the second column.
df['Weight'] = sectorweight
##display(df)
### NEXT
iwf_url = 'https://etfdb.com/etf/IWF/'
uClient = uReq(iwf_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#Isolate header to get name and symbol
h1 = page_soup.h1
#Isolate stock symbol
title = h1.findAll("span",{"class":"label-primary"})
titlet = title[0].text
#print(titlet)
#strip space and line break
strip1 = h1.text.strip()
#strip stock symbol
strip2 = strip1.strip(titlet)
#strip remaining line break
strip3 = strip2.strip()
#print(strip3)
IWD = page_soup.findAll("table",{"class":"chart base-table"})[1]
#Create lists to fill
sectordata=[]
sectorname=[]
sectorweight=[]
for row in IWD.findAll("td"):
sectordata.append(row.text)
#list created
#Assign every other value to proper list to get 2 columns
sectorname = sectordata[::2]
sectorweight = sectordata[1::2]
#Insert name/symbol for clarification/validation
sectorweight.insert(0,titlet)
sectorname.insert(0,strip3)
# create empty data frame in pandas
df2 = pd.DataFrame()
#Add the first column to the empty dataframe.
df2['Sector'] = sectorname
#Now add the second column.
df2['Weight'] = sectorweight
#display(df2)
results = df.merge(df2, on = "Sector")
results.columns = ['Sector', 'IWD', 'IWF']
display(results)
Like I said, this works, but it isn't automated and its ham-handed way of getting there. Please help me to get better!
2 Answers 2
Pandas read_html
allows to
Read HTML tables into a list of DataFrame objects.
Using this we can store the urls in a list.
l=['https://etfdb.com/etf/IWD/','https://etfdb.com/etf/IWF/']
Then we read the urls and store them in a list:
dfs=[pd.read_html(i)[5].rename(columns={'Percentage':i.split('/')[-2]}) for i in l]
Once we have this list of dataframes, we can use a reduce merge to merge all the dataframes in the list:
from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='Sector'), dfs)
print(df_final)
Output
Sector IWD IWF
0 Financials 23.02% 3.21%
1 Healthcare 12.08% 14.04%
2 Industrials 9.27% 9.39%
3 Energy 8.98% 0.35%
4 Consumer, Non-Cyclical 8.85% 4.69%
5 Communications 7.7% 11.27%
6 Technology 6.13% 36.46%
7 Consumer, Cyclical 5.86% 14.24%
8 Real Estate 5.15% 2.31%
9 Other 3.54% 2.55%
10 Basic Materials 2.74% 1.34%
11 ETF Cash Component 0.33% 0.14%
I created function get_soup
because this part of code is often used many times in scripts (so it can be reuse in other scripts). But In this script I could put all in get_data
.
get_data
gets url
and uses get_soup
with this url
to get html. Later it scrapes data from html, creates DataFrame
and returns it.
Main part uses get_data
with two urls and gets two dataframes.
I put some other comments in code.
# <-- remove not used modules
# <-- use more popular names
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd
from IPython.display import display
# --- functions ---
# <-- all functions before main part
def get_soup(url):
# <-- more readable names (and more popular)
response = urlopen(url)
html = response.read()
response.close()
soup = BS(html, "html.parser")
return soup
def get_data(url):
soup = get_soup(url)
#Isolate header to get name and symbol
h1 = soup.h1
#Isolate stock symbol
# <-- find() to get only first item
title = h1.find("span",{"class":"label-primary"}).text
#print(title)
#strip space and line break
# <-- use the same variable instead strip,strip2, strip3
# <-- maybe too much comments
header = h1.text.strip()
#strip stock symbol
header = header.strip(title)
#strip remaining line break
header = header.strip()
#print(strip)
# <-- use better name 'table'
table = soup.find_all("table",{"class":"chart base-table"})[1]
#Create lists to fill
#sector_data = [row.text for row in table.find_all("td")]
sector_data = []
# <-- remove lists which will be created later
for row in table.find_all("td"):
sector_data.append(row.text)
#Assign every other value to proper list to get 2 columns
sector_name = sector_data[::2]
sector_weight = sector_data[1::2]
#Insert name/symbol for clarification/validation
sector_weight.insert(0, title)
sector_name.insert(0, header)
# create dataframe in pandas
# <-- create DF directly with data
df = pd.DataFrame({
'Sector': sector_name,
'Weight': sector_weight,
})
#display(df)
return df
# --- main ---
# <-- the same variable `url` because it keep the same type of data
# and I will no need this value later - so I can resuse this name.
url = 'https://etfdb.com/etf/IWD/'
df1 = get_data(url)
url = 'https://etfdb.com/etf/IWF/'
df2 = get_data(url)
results = df1.merge(df2, on="Sector")
results.columns = ['Sector', 'IWD', 'IWF']
display(results)
Explore related questions
See similar questions with these tags.