There are at least three things that may help to make the code more efficient:
- switch to [
lxml
instead ofhtml.parser
][1]lxml
instead ofhtml.parser
(requireslxml
to be installed) - use a [
SoupStrainer
][2]SoupStrainer
to parse only the relevant part of the document - you can switch to
http
instead ofhttps
. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out
Improved code:
import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only)
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip()
file.write(title + ' \n')
for p_tag in soup.select('p') :
file.write(p_tag.get_text() + '\n')
Note that I've also removed the unused variables and imports.
Btw, if it weren't for the title
, we could've pinpointed SoupStrainer
to p
elements only - might've improved performance even more.
[1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
[2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document
There are at least three things that may help to make the code more efficient:
- switch to [
lxml
instead ofhtml.parser
][1] (requireslxml
to be installed) - use a [
SoupStrainer
][2] to parse only the relevant part of the document - you can switch to
http
instead ofhttps
. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out
Improved code:
import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only)
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip()
file.write(title + ' \n')
for p_tag in soup.select('p') :
file.write(p_tag.get_text() + '\n')
Note that I've also removed the unused variables and imports.
Btw, if it weren't for the title
, we could've pinpointed SoupStrainer
to p
elements only - might've improved performance even more.
[1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
[2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document
There are at least three things that may help to make the code more efficient:
- switch to
lxml
instead ofhtml.parser
(requireslxml
to be installed) - use a
SoupStrainer
to parse only the relevant part of the document - you can switch to
http
instead ofhttps
. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out
Improved code:
import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only)
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip()
file.write(title + ' \n')
for p_tag in soup.select('p') :
file.write(p_tag.get_text() + '\n')
Note that I've also removed the unused variables and imports.
Btw, if it weren't for the title
, we could've pinpointed SoupStrainer
to p
elements only - might've improved performance even more.
There are at least three things that may help to make the code more efficient:
- switch to [
lxml
instead ofhtml.parser
][1] (requireslxml
to be installed) - use a [
SoupStrainer
][2] to parse only the relevant part of the document - you can switch to
http
instead ofhttps
. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out
Improved code:
import requests
from bs4 import BeautifulSoup, SoupStrainer
page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp')
parse_only = SoupStrainer("body")
soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only)
with open('filename.txt', mode='wt', encoding='utf-8') as file:
title = soup.find('h1').text.strip()
file.write(title + ' \n')
for p_tag in soup.select('p') :
file.write(p_tag.get_text() + '\n')
Note that I've also removed the unused variables and imports.
Btw, if it weren't for the title
, we could've pinpointed SoupStrainer
to p
elements only - might've improved performance even more.
[1]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
[2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document