Scraping HTML using Beautiful Soup

Question 1

I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.

Structure of my source HTML file:

<!DOCTYPE html>
<html>
...
<body>
...
<section id="article-section-1">
 <div id="article-section-1-icon" class="icon">
 <img src="../images/introduction.jpg" />
 </div> 
 <div id="article-section-1-heading" 
 class="heading">
 Some Heading 1
 </div>
 <div id="article-section-1-content" 
 class="content"> 
 This section can have p, img, or even div tags
 </div>
</section>
...
...
<section id="article-section-8">
 <div id="article-section-8-icon" class="icon">
 <img src="../images/introduction.jpg" />
 </div> 
 <div id="article-section-8-heading" 
 class="heading">
 Some Heading
 </div>
 <div id="article-section-8-content" 
 class="content"> 
 This section can have p, img, or even div tags
 </div>
</section>
...
</body>
</html>

My code:

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
all_sections = soup.find_all('section',id=re.compile("article-section-[0-9]"))
for section in all_sections:
 heading = str(section.find_all('div',class_="heading")[0].text).strip()
 contents_list = section.find_all('div',class_="content")[0].contents
 content = ''
 for i in contents_list:
 if i != '\n':
 content = content+str(i)
 print '<html><body><h1>'+heading+'</h1><hr>'+content+'</body></html>'

My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.

Content_list is a list which has items like '\n'. With a loop running over this list, I am removing it. Is there any better way?
I am not interested in article icon, so I am ignoring it in my script.
I am using strip method to remove extra white spaces in the heading. Is there any better way?
Other than new lines, the div element within content can have anything, even nested divs. So far, I have run my script over a few pages I have and it seems to work. Anything here I need to take care of?
Lastly, is there any better way to generate HTML files? Once I scraped data, I will work on generating HTML files. These files will have same structure (CSS, JavaScript, etc) and I have to do is put scraped data into it. Can the above method I used (build a string and put content and headings) be improved in any way?

I am not looking for full code in answers; just give me subtle hints or point me in some direction.

Question 2

regarding 1 you can:

new_content = [c for c in old_content if c != '\n']

or simply

new_content = old_content.replace('\n', '')

Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even replace() to substitute generic parts:

# template
<div class="some value">%FOO%</div>
<div class="some value">%BAR%</div>

and on the python side:

values = {"%FOO%": "the foos", "%BAR%": "the bars"} 
template = open('template').read()
for k, v in values.iteritems():
 template = template.replace(k, v)
print template

Question 3

You can use find instead of findAll as findAll searches every element, thus taking more time. If only the first element matters, using find can save a lot of time.

Jakub M. Jakub M. 1562 bronze badges · Accepted Answer · 2013-09-09 12:10:09Z

regarding 1 you can:

new_content = [c for c in old_content if c != '\n']

or simply

new_content = old_content.replace('\n', '')

Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even replace() to substitute generic parts:

# template
<div class="some value">%FOO%</div>
<div class="some value">%BAR%</div>

and on the python side:

values = {"%FOO%": "the foos", "%BAR%": "the bars"} 
template = open('template').read()
for k, v in values.iteritems():
 template = template.replace(k, v)
print template

Stack Exchange Network

Scraping HTML using Beautiful Soup

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Scraping HTML using Beautiful Soup

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions