So, I wrote a minimal function to scrape all the text from a webpage:
url = 'http://www.brainpickings.org'
request = requests.get(url)
soup_data = BeautifulSoup(request.content)
texts = soup_data.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
return True
print filter(visible,texts)
But, it doesn't work that smooth. There are still unnecessary tags that are there. Also, if I try to to do a reg-ex removal of various characters that I don't want, I get an
error elif re.match('<!--.*-->', str(element)):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128)
Thus, how can I improve this a bit more to make it better?
Martijn Pieters
1.1m326 gold badges4.2k silver badges3.5k bronze badges
asked Jun 5, 2012 at 8:52
Hick
36.5k48 gold badges162 silver badges253 bronze badges
-
2English meaning nazi: "scrapping" is the process of turning a car into scrap metal; to discard or remove from service. You probably meant "scraping, to scrape", to remove an outer layer with a tool. Corrected your post for you :-)Martijn Pieters– Martijn Pieters2012年06月05日 09:06:52 +00:00Commented Jun 5, 2012 at 9:06
-
1Do not use Regex for HTML parsing, see why.schlamar– schlamar2012年06月05日 09:14:15 +00:00Commented Jun 5, 2012 at 9:14
-
Use splinter zope .Easy to use.Priyank Patel– Priyank Patel2012年06月05日 09:19:27 +00:00Commented Jun 5, 2012 at 9:19
1 Answer 1
With lxml this is pretty easy:
from lxml import html
doc = html.fromstring(content)
print doc.text_content()
Edit: Filtering the head could be done as follows:
print doc.body.text_content()
answered Jun 5, 2012 at 9:31
schlamar
9,5593 gold badges43 silver badges77 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py