I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.
Structure of my source HTML file:
<!DOCTYPE html>
<html>
...
<body>
...
<section id="article-section-1">
<div id="article-section-1-icon" class="icon">
<img src="../images/introduction.jpg" />
</div>
<div id="article-section-1-heading"
class="heading">
Some Heading 1
</div>
<div id="article-section-1-content"
class="content">
This section can have p, img, or even div tags
</div>
</section>
...
...
<section id="article-section-8">
<div id="article-section-8-icon" class="icon">
<img src="../images/introduction.jpg" />
</div>
<div id="article-section-8-heading"
class="heading">
Some Heading
</div>
<div id="article-section-8-content"
class="content">
This section can have p, img, or even div tags
</div>
</section>
...
</body>
</html>
My code:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
all_sections = soup.find_all('section',id=re.compile("article-section-[0-9]"))
for section in all_sections:
heading = str(section.find_all('div',class_="heading")[0].text).strip()
contents_list = section.find_all('div',class_="content")[0].contents
content = ''
for i in contents_list:
if i != '\n':
content = content+str(i)
print '<html><body><h1>'+heading+'</h1><hr>'+content+'</body></html>'
My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.
Content_list
is a list which has items like'\n'
. With a loop running over this list, I am removing it. Is there any better way?- I am not interested in article icon, so I am ignoring it in my script.
- I am using
strip
method to remove extra white spaces in the heading. Is there any better way? - Other than new lines, the
div
element within content can have anything, even nesteddiv
s. So far, I have run my script over a few pages I have and it seems to work. Anything here I need to take care of? - Lastly, is there any better way to generate HTML files? Once I scraped data, I will work on generating HTML files. These files will have same structure (CSS, JavaScript, etc) and I have to do is put scraped data into it. Can the above method I used (build a string and put content and headings) be improved in any way?
I am not looking for full code in answers; just give me subtle hints or point me in some direction.
2 Answers 2
regarding 1 you can:
new_content = [c for c in old_content if c != '\n']
or simply
new_content = old_content.replace('\n', '')
Regarding 5, if your are generating anything that is nontrivial, then it will pay off to learn some template engines like Jinja2. If that is too much, then you can make a simple template in text file and use regex or even replace()
to substitute generic parts:
# template
<div class="some value">%FOO%</div>
<div class="some value">%BAR%</div>
and on the python side:
values = {"%FOO%": "the foos", "%BAR%": "the bars"}
template = open('template').read()
for k, v in values.iteritems():
template = template.replace(k, v)
print template
You can use find
instead of findAll
as findAll
searches every element, thus taking more time. If only the first element matters, using find
can save a lot of time.
Explore related questions
See similar questions with these tags.