Web scraper for a webpage article

Question 1

I am a beginner in Python and have just coded a simple web scraper for a webpage article, output to a text file, using BeautifulSoup and List.

The code is working fine, but I'm wondering if anybody would know a more efficient way to achieve the same.

import requests
page = requests.get('https://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
# 2. Parsing the page using BeautifulSoup
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
number_of_tags = len(all_p_tags) # No of <p>?
x=0
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip() # Write the <header>
 file.write(title)
 file.write('\n')
 for x in range(number_of_tags):
 word = all_p_tags[x].get_text() # Write the content by referencing each item in the list
 file.write(word)
 file.write('\n')
 file.close()

Question 2

Is there a reason you want to make this "efficient"? And that file.close() is unnecessary, js.

Question 3

#libraries always at top, at least if they are not conditional imported
import requests
from bs4 import BeautifulSoup 
 base_url = 'https://www.msn.com/en-sg/money/topstories/\
 10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp'
 page = requests.get(base_url)
 content = page.content 
# 2. Parsing the page using BeautifulSoup
#removed pandas as you are not using it here. 
soup = BeautifulSoup(page.content, 'html.parser') 
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
#you don't need to count then 
#not initializer needed, remove x = 0 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip() # Write the <header>
 file.write(title + ' \n')
 for p in all_p_tags:
 file.write(p.get_text()+ ' \n') 
 #files open with a 'with' statement doens't have to be manually closet

Question 4

There are at least three things that may help to make the code more efficient:

switch to lxml instead of html.parser (requires lxml to be installed)
use a SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

Shailyn OrtizShailyn Ortiz · Answer 1 · 2017-12-06 04:42:19Z

#libraries always at top, at least if they are not conditional imported
import requests
from bs4 import BeautifulSoup 
 base_url = 'https://www.msn.com/en-sg/money/topstories/\
 10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp'
 page = requests.get(base_url)
 content = page.content 
# 2. Parsing the page using BeautifulSoup
#removed pandas as you are not using it here. 
soup = BeautifulSoup(page.content, 'html.parser') 
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
#you don't need to count then 
#not initializer needed, remove x = 0 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip() # Write the <header>
 file.write(title + ' \n')
 for p in all_p_tags:
 file.write(p.get_text()+ ' \n') 
 #files open with a 'with' statement doens't have to be manually closet

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 2 · 2017-12-06 04:43:25Z

There are at least three things that may help to make the code more efficient:

switch to lxml instead of html.parser (requires lxml to be installed)
use a SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

Stack Exchange Network

Web scraper for a webpage article

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraper for a webpage article

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions