Code Review

Return to Revisions

2 of 2

Commonmark migration

edited Jun 10, 2020 at 13:24

Community Bot

edited Jun 10, 2020 at 13:24

Community Bot

There are at least three things that may help to make the code more efficient:

switch to lxml instead of html.parser (requires lxml to be installed)
use a SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

answered Dec 6, 2017 at 4:43

alecxe

17.5k
8
52
93

default