Skip to main content
Code Review

Return to Answer

Commonmark migration
Source Link

There are at least three things that may help to make the code more efficient:

  • switch to [lxml instead of html.parser][1]lxml instead of html.parser (requires lxml to be installed)
  • use a [SoupStrainer][2]SoupStrainer to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more. [1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser [2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document

There are at least three things that may help to make the code more efficient:

  • switch to [lxml instead of html.parser][1] (requires lxml to be installed)
  • use a [SoupStrainer][2] to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more. [1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser [2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document

There are at least three things that may help to make the code more efficient:

  • switch to lxml instead of html.parser (requires lxml to be installed)
  • use a SoupStrainer to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

Post Migrated Here from stackoverflow.com (revisions)
Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93

There are at least three things that may help to make the code more efficient:

  • switch to [lxml instead of html.parser][1] (requires lxml to be installed)
  • use a [SoupStrainer][2] to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more. [1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser [2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document

lang-py

AltStyle によって変換されたページ (->オリジナル) /