Problems faced while scraping webpages in Python

Asked 13 years, 7 months ago

Viewed 134 times

So, I wrote a minimal function to scrape all the text from a webpage:

url = 'http://www.brainpickings.org'
request = requests.get(url)
soup_data = BeautifulSoup(request.content)
texts = soup_data.findAll(text=True)
def visible(element):
 if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
 return False
 return True
print filter(visible,texts)

But, it doesn't work that smooth. There are still unnecessary tags that are there. Also, if I try to to do a reg-ex removal of various characters that I don't want, I get an

error elif re.match('<!--.*-->', str(element)):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128)

Thus, how can I improve this a bit more to make it better?

python

Improve this question

edited Jun 5, 2012 at 9:02

Martijn Pieters's user avatar

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.5k bronze badges

asked Jun 5, 2012 at 8:52

Hick's user avatar

Hick

36.5k48 gold badges162 silver badges253 bronze badges

2

English meaning nazi: "scrapping" is the process of turning a car into scrap metal; to discard or remove from service. You probably meant "scraping, to scrape", to remove an outer layer with a tool. Corrected your post for you :-)

Martijn Pieters
– Martijn Pieters

2012年06月05日 09:06:52 +00:00
Commented Jun 5, 2012 at 9:06
1

Do not use Regex for HTML parsing, see why.

schlamar
– schlamar

2012年06月05日 09:14:15 +00:00
Commented Jun 5, 2012 at 9:14
Use splinter zope .Easy to use.

Priyank Patel
– Priyank Patel

2012年06月05日 09:19:27 +00:00
Commented Jun 5, 2012 at 9:19

Add a comment |

1 Answer 1

Sorted by: Reset to default

With lxml this is pretty easy:

from lxml import html
doc = html.fromstring(content)
print doc.text_content()

Edit: Filtering the head could be done as follows:

print doc.body.text_content()

Improve this answer

edited Jun 5, 2012 at 9:41

answered Jun 5, 2012 at 9:31

schlamar's user avatar

schlamar

9,5593 gold badges43 silver badges77 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

python

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Problems faced while scraping webpages in Python

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related