I am working on an HTML document to which I need to add certain classes to some elements. In the following code, I am adding class img-responsive
.
def add_img_class1(img_tag):
try:
img_tag['class'] = img_tag['class']+' img-responsive'
except KeyError:
img_tag['class'] = 'img-responsive'
return img_tag
def add_img_class2(img_tag):
if img_tag.has_attr('class'):
img_tag['class'] = img_tag['class']+' img-responsive'
else:
img_tag['class'] = 'img-responsive'
return img_tag
soup = BeautifulSoup(myhtml)
for img_tag in soup.find_all('img'):
img_tag = add_img_class1(img_tag) #or img_tag = add_img_class2(img_tag)
html = soup.prettify(soup.original_encoding)
with open("edited.html","wb") as file:
file.write(html)
- Both functions do same, however one uses exceptions and another has_attr from BS4. Which is better and why?
- Am I doing the right way of writing back to HTML? Or shall convert entire soup to UTF-8 (by
string.encode('UTF-8')
) and write it?
1 Answer 1
The second option is better, because the possible error is explicit. However, in lots of case in Python, you should follow EAFP and go for the try
statement. However, we can do better.
get(value, default)
In BeautifulSoup, attributes behave like dictionaries. This means you can write img_tag.get('class', '')
to get the class if it exists, or the empty string if it doesn't.
def add_img_class(img_tag):
img_tag = img_tag.get('class', '') + ' img-responsive'
You don't need to return the new img_tag
as it is passed by reference. Now that your function is a one-liner, you might as well use the one-liner directly.
Multi-valued attributes
Note that the above code doesn't work! class
is a multi-valued attribute in HTML4 and HTML5, so at least BeautifulSoup 4 returns a list instead of a string. The correct code becomes:
img_tag['class'] = img_tag.get('class', []) + ['img-responsive']
Wich is nicer as you don't have to worry about the extra space between the two values.
Encoding
You don't need to convert to UTF-8 before writing the file back. What's wrong with
?
-
\$\begingroup\$ Using
img['class'] = img.get('class', []) + ['img-responsive']
results in TypeError: coercing to Unicode: need string or buffer, list found butimg['class'] = img.get('class', []) + ' img-responsive
does the trick. \$\endgroup\$Fred Campos– Fred Campos2015年04月29日 09:11:59 +00:00Commented Apr 29, 2015 at 9:11 -
\$\begingroup\$ FredCampos, did you use BeautifulSoup4? Did you parse your document as HTML? The BeautifulSoup 4 docs mentions that
img[class]
should always return a list: crummy.com/software/BeautifulSoup/bs4/doc/… \$\endgroup\$Quentin Pradet– Quentin Pradet2015年04月29日 13:16:40 +00:00Commented Apr 29, 2015 at 13:16