Remove all inline styles using BeautifulSoup

Question 1

I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:

[s.extract() for s in soup('script')]

But how to remove inline styles? For instance the following:

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

Should become:

<p>Text</p>
<img href="somewhere.com">

How to delete the inline class, id, name & style attributes of all elements?

Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.

Question 2

You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:

for tag in soup():
 for attribute in ["class", "id", "name", "style"]:
 del tag[attribute]

Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():

[tag.decompose() for tag in soup("script")]

Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.

Question 3

I was using extract() in case I decided to generate a list of removed code at any point, but decompose() works just as well for completely removing & destroying tags & content. Thanks for the attribute-delete snippet, works like a charm!

Question 4

Makes sense. I'll leave the note about decompose() for anyone else who might stumble across this.

Question 5

I wouldn't do this in BeautifulSoup - you'll spend a lot of time trying, testing, and working around edge cases.

Bleach does exactly this for you. http://pypi.python.org/pypi/bleach

If you were to do this in BeautifulSoup, I'd suggest you go with the "whitelist" approach, like Bleach does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.

Question 6

Cool, I didn't know about Bleach. I wasn't thinking of the use case, but if the goal is to sanitize untrusted HTML, then this definitely seems like a better approach. You get my upvote!

Question 7

Bleach is pretty great. I really like it.

Question 8

Here's my solution for Python3 and BeautifulSoup4:

def remove_attrs(soup, whitelist=tuple()):
 for tag in soup.findAll(True):
 for attr in [attr for attr in tag.attrs if attr not in whitelist]:
 del tag[attr]
 return soup

It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.

Question 9

What about lxml's Cleaner?

from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)

Question 10

Based on jmk's function, i use this function to remove attributes base on a white list:

Work in python2, BeautifulSoup3

def clean(tag,whitelist=[]):
 tag.attrs = None
 for e in tag.findAll(True):
 for attribute in e.attrs:
 if attribute[0] not in whitelist:
 del e[attribute[0]]
 #e.attrs = None #delte all attributes
 return tag
#example to keep only title and href
clean(soup,["title","href"])

Question 11

You shouldn't be passing mutable structures as default function parameter values. As seen here.

Question 12

Not perfect but short:

' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

Question 13

I achieved this using re and regex.

import re
def removeStyle(html):
 style = re.compile(' style\=.*?\".*?\"') 
 html = re.sub(style, '', html)
 return(html)
html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'
removeStyle(html)

Output: <p class="author" id="author_id" name="author_name">Text</p>

You can use this to strip any inline attribute by replacing "style" in the regex with the attribute's name.

jmk 1,98814 silver badges15 bronze badges · Accepted Answer · 2012-10-18 16:41:09Z

You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:

for tag in soup():
 for attribute in ["class", "id", "name", "style"]:
 del tag[attribute]

Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():

[tag.decompose() for tag in soup("script")]

Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.

I was using extract() in case I decided to generate a list of removed code at any point, but decompose() works just as well for completely removing & destroying tags & content. Thanks for the attribute-delete snippet, works like a charm!
Makes sense. I'll leave the note about decompose() for anyone else who might stumble across this.

CollectivesTM on Stack Overflow

Remove all inline styles using BeautifulSoup

7 Answers 7

2 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

7 Answers 7

2 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related