6
\$\begingroup\$

I am just starting out using Python for data analysis. I am a total beginner and have figured out how to brute force somethings, however I know it is inefficient yet don't know how to do it without messing up what I have.

I have multiple webpages to scrape and then store data in a data frame. The code is identical for all of the pages. How do I set it up as a routine instead just repeating the same code over and over again?

As an example the two urls are:

The HTML is identical so the web scraping working exactly the same for both.

Once scraped, I want them in a single data frame.

The code below works, but to me seems the least sophisticated approach since I know so little. The actual code is likely not pretty but it works.

Any help appreciated on how this should be improved.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
from IPython.display import display
iwd_url = 'https://etfdb.com/etf/IWD/'
uClient = uReq(iwd_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#Isolate header to get name and symbol
h1 = page_soup.h1
#Isolate stock symbol
title = h1.findAll("span",{"class":"label-primary"})
titlet = title[0].text
#print(titlet)
#strip space and line break
strip1 = h1.text.strip()
#strip stock symbol
strip2 = strip1.strip(titlet)
#strip remaining line break
strip3 = strip2.strip()
#print(strip3)
IWD = page_soup.findAll("table",{"class":"chart base-table"})[1]
#Create lists to fill
sectordata=[]
sectorname=[]
sectorweight=[]
for row in IWD.findAll("td"):
 sectordata.append(row.text)
#list created 
#Assign every other value to proper list to get 2 columns
sectorname = sectordata[::2]
sectorweight = sectordata[1::2]
#Insert name/symbol for clarification/validation
sectorweight.insert(0,titlet)
sectorname.insert(0,strip3)
# create empty data frame in pandas
df = pd.DataFrame()
#Add the first column to the empty dataframe.
df['Sector'] = sectorname
#Now add the second column.
df['Weight'] = sectorweight
##display(df)
### NEXT
iwf_url = 'https://etfdb.com/etf/IWF/'
uClient = uReq(iwf_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#Isolate header to get name and symbol
h1 = page_soup.h1
#Isolate stock symbol
title = h1.findAll("span",{"class":"label-primary"})
titlet = title[0].text
#print(titlet)
#strip space and line break
strip1 = h1.text.strip()
#strip stock symbol
strip2 = strip1.strip(titlet)
#strip remaining line break
strip3 = strip2.strip()
#print(strip3)
IWD = page_soup.findAll("table",{"class":"chart base-table"})[1]
#Create lists to fill
sectordata=[]
sectorname=[]
sectorweight=[]
for row in IWD.findAll("td"):
 sectordata.append(row.text)
#list created 
#Assign every other value to proper list to get 2 columns
sectorname = sectordata[::2]
sectorweight = sectordata[1::2]
#Insert name/symbol for clarification/validation
sectorweight.insert(0,titlet)
sectorname.insert(0,strip3)
# create empty data frame in pandas
df2 = pd.DataFrame()
#Add the first column to the empty dataframe.
df2['Sector'] = sectorname
#Now add the second column.
df2['Weight'] = sectorweight
#display(df2)
results = df.merge(df2, on = "Sector")
results.columns = ['Sector', 'IWD', 'IWF']
display(results)

Like I said, this works, but it isn't automated and its ham-handed way of getting there. Please help me to get better!

Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
asked Jul 23, 2019 at 16:14
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

Pandas read_html allows to

Read HTML tables into a list of DataFrame objects.

Using this we can store the urls in a list.

l=['https://etfdb.com/etf/IWD/','https://etfdb.com/etf/IWF/']

Then we read the urls and store them in a list:

dfs=[pd.read_html(i)[5].rename(columns={'Percentage':i.split('/')[-2]}) for i in l] 

Once we have this list of dataframes, we can use a reduce merge to merge all the dataframes in the list:

from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='Sector'), dfs)
print(df_final)

Output

 Sector IWD IWF
0 Financials 23.02% 3.21%
1 Healthcare 12.08% 14.04%
2 Industrials 9.27% 9.39%
3 Energy 8.98% 0.35%
4 Consumer, Non-Cyclical 8.85% 4.69%
5 Communications 7.7% 11.27%
6 Technology 6.13% 36.46%
7 Consumer, Cyclical 5.86% 14.24%
8 Real Estate 5.15% 2.31%
9 Other 3.54% 2.55%
10 Basic Materials 2.74% 1.34%
11 ETF Cash Component 0.33% 0.14%
answered Jul 24, 2019 at 5:34
\$\endgroup\$
3
\$\begingroup\$

I created function get_soup because this part of code is often used many times in scripts (so it can be reuse in other scripts). But In this script I could put all in get_data.

get_data gets url and uses get_soup with this url to get html. Later it scrapes data from html, creates DataFrame and returns it.

Main part uses get_data with two urls and gets two dataframes.

I put some other comments in code.

# <-- remove not used modules
# <-- use more popular names 
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd
from IPython.display import display
# --- functions ---
# <-- all functions before main part
def get_soup(url):
 # <-- more readable names (and more popular)
 response = urlopen(url)
 html = response.read()
 response.close()
 soup = BS(html, "html.parser")
 return soup
 
 
def get_data(url):
 
 soup = get_soup(url)
 #Isolate header to get name and symbol
 h1 = soup.h1
 #Isolate stock symbol
 # <-- find() to get only first item
 title = h1.find("span",{"class":"label-primary"}).text
 #print(title)
 #strip space and line break
 # <-- use the same variable instead strip,strip2, strip3
 # <-- maybe too much comments 
 header = h1.text.strip()
 #strip stock symbol
 header = header.strip(title)
 #strip remaining line break
 header = header.strip()
 #print(strip)
 # <-- use better name 'table'
 table = soup.find_all("table",{"class":"chart base-table"})[1]
 #Create lists to fill
 #sector_data = [row.text for row in table.find_all("td")]
 sector_data = []
 # <-- remove lists which will be created later
 for row in table.find_all("td"):
 sector_data.append(row.text)
 
 #Assign every other value to proper list to get 2 columns
 sector_name = sector_data[::2]
 sector_weight = sector_data[1::2]
 
 #Insert name/symbol for clarification/validation
 sector_weight.insert(0, title)
 sector_name.insert(0, header)
 # create dataframe in pandas
 # <-- create DF directly with data
 df = pd.DataFrame({
 'Sector': sector_name,
 'Weight': sector_weight,
 })
 
 #display(df)
 return df
# --- main ---
# <-- the same variable `url` because it keep the same type of data
# and I will no need this value later - so I can resuse this name.
url = 'https://etfdb.com/etf/IWD/'
df1 = get_data(url)
url = 'https://etfdb.com/etf/IWF/'
df2 = get_data(url)
results = df1.merge(df2, on="Sector")
results.columns = ['Sector', 'IWD', 'IWF']
display(results)
answered Jul 24, 2019 at 0:14
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.