Return to Answer

Commonmark migration

Source Link

edited Jun 10, 2020 at 13:24

Community Bot

edited Jun 10, 2020 at 13:24

Community Bot

There are at least three things that may help to make the code more efficient:

switch to [lxml instead of html.parser][1]lxml instead of html.parser (requires lxml to be installed)
use a [SoupStrainer][2]SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more. [1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser [2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document

There are at least three things that may help to make the code more efficient:

switch to [lxml instead of html.parser][1] (requires lxml to be installed)
use a [SoupStrainer][2] to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

There are at least three things that may help to make the code more efficient:

switch to lxml instead of html.parser (requires lxml to be installed)
use a SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

Post Migrated Here from stackoverflow.com (revisions)

occurred Dec 18, 2017 at 1:28

Source Link

answered Dec 6, 2017 at 4:43

alecxe

answered Dec 6, 2017 at 4:43

alecxe

17.5k
8
52
93

There are at least three things that may help to make the code more efficient:

switch to [lxml instead of html.parser][1] (requires lxml to be installed)
use a [SoupStrainer][2] to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) 
with open('filename.txt', mode='wt', encoding='utf-8') as file:
 title = soup.find('h1').text.strip()
 file.write(title + ' \n')
 for p_tag in soup.select('p') :
 file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

lang-py