I am a beginner in Python and have just coded a simple web scraper for a webpage article, output to a text file, using BeautifulSoup
and List
.
The code is working fine, but I'm wondering if anybody would know a more efficient way to achieve the same.
import requests
page = requests.get('https://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
# 2. Parsing the page using BeautifulSoup
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
number_of_tags = len(all_p_tags) # No of <p>?
x=0
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip() # Write the <header>
file.write(title)
file.write('\n')
for x in range(number_of_tags):
word = all_p_tags[x].get_text() # Write the content by referencing each item in the list
file.write(word)
file.write('\n')
file.close()
2 Answers 2
#libraries always at top, at least if they are not conditional imported
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.msn.com/en-sg/money/topstories/\
10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp'
page = requests.get(base_url)
content = page.content
# 2. Parsing the page using BeautifulSoup
#removed pandas as you are not using it here.
soup = BeautifulSoup(page.content, 'html.parser')
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
#you don't need to count then
#not initializer needed, remove x = 0
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip() # Write the <header>
file.write(title + ' \n')
for p in all_p_tags:
file.write(p.get_text()+ ' \n')
#files open with a 'with' statement doens't have to be manually closet
There are at least three things that may help to make the code more efficient:
- switch to
lxml
instead ofhtml.parser
(requireslxml
to be installed) - use a
SoupStrainer
to parse only the relevant part of the document - you can switch to
http
instead ofhttps
. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out
Improved code:
import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only)
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip()
file.write(title + ' \n')
for p_tag in soup.select('p') :
file.write(p_tag.get_text() + '\n')
Note that I've also removed the unused variables and imports.
Btw, if it weren't for the title
, we could've pinpointed SoupStrainer
to p
elements only - might've improved performance even more.
file.close()
is unnecessary, js. \$\endgroup\$