1
\$\begingroup\$

I am a beginner in Python and have just coded a simple web scraper for a webpage article, output to a text file, using BeautifulSoup and List.

The code is working fine, but I'm wondering if anybody would know a more efficient way to achieve the same.

import requests
page = requests.get('https://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
# 2. Parsing the page using BeautifulSoup
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
number_of_tags = len(all_p_tags) # No of <p>?
x=0
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip() # Write the <header>
 file.write(title)
 file.write('\n')
 for x in range(number_of_tags):
 word = all_p_tags[x].get_text() # Write the content by referencing each item in the list
 file.write(word)
 file.write('\n')
 file.close()
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Dec 6, 2017 at 4:31
\$\endgroup\$
1
  • \$\begingroup\$ Is there a reason you want to make this "efficient"? And that file.close() is unnecessary, js. \$\endgroup\$ Commented Dec 6, 2017 at 4:33

2 Answers 2

1
\$\begingroup\$
#libraries always at top, at least if they are not conditional imported
import requests
from bs4 import BeautifulSoup 
 base_url = 'https://www.msn.com/en-sg/money/topstories/\
 10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp'
 page = requests.get(base_url)
 content = page.content 
# 2. Parsing the page using BeautifulSoup
#removed pandas as you are not using it here. 
soup = BeautifulSoup(page.content, 'html.parser') 
# 3. Write the context to a text file
all_p_tags = soup.findAll('p') # Put all <p> and their text into a list
#you don't need to count then 
#not initializer needed, remove x = 0 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip() # Write the <header>
 file.write(title + ' \n')
 for p in all_p_tags:
 file.write(p.get_text()+ ' \n') 
 #files open with a 'with' statement doens't have to be manually closet
answered Dec 6, 2017 at 4:42
\$\endgroup\$
0
1
\$\begingroup\$

There are at least three things that may help to make the code more efficient:

  • switch to lxml instead of html.parser (requires lxml to be installed)
  • use a SoupStrainer to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

answered Dec 6, 2017 at 4:43
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.