0

I am trying to fetch text data from a website, but this code shows some error. Please let me know where is the error.

import requests
from bs4 import BeautifulSoup
def getportions(soup):
for p in soup.find_all("p", {"class": ""}): 
 yield p.text
def readpage(address): 
 page = requests.get(address) 
 soup = BeautifulSoup(page.text, "html.parser")
 output_text = ''
 for s in getportions(soup):
 output_text += s.encode("utf8")
 output_text += "\n"
 print (output_text)
 print ("End of article")
 fp = open("content.txt", "w")
 fp.write(output_text)
if __name__ == "__main__":
 readpage("http://yahoo.com")

The error is shown below:

output_text += s.encode("utf8"). TypeError: Can't convert 'bytes' object to str implicitly

Morgan Thrapp
10k3 gold badges51 silver badges68 bronze badges
asked Nov 4, 2016 at 14:52
4
  • .encode returns a bytes object. What are you trying to do? Commented Nov 4, 2016 at 14:54
  • @MorganThrapp I am trying to write contents in a file Commented Nov 4, 2016 at 14:55
  • Do you maybe mean decode? Why do you think you need to do anything with utf-8? Commented Nov 4, 2016 at 14:56
  • @MorganThrapp if I make the object as string then it contains unnecessary chracter Commented Nov 4, 2016 at 14:56

1 Answer 1

2

If you use Python 3, all strings are natively in unicode, and you can specify the encoding when opening a file. You code could become:

def readpage(address): 
 ...
 output_text = ''
 for s in getportions(soup):
 output_text += s
 output_text += "\n"
 print (output_text)
 print ("End of article")
 fp = open("content.txt", "w", encoding='utf8')
 fp.write(output_text)

If you simply want to sanitize the text by replacing all non ascii characters with a ? open the file that way:

 fp = open("content.txt", "w", encoding='ascii', errors='replace')
answered Nov 4, 2016 at 14:59
Sign up to request clarification or add additional context in comments.

3 Comments

It shows error agin: return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u03a3' in position 350: character maps to <undefined>
@NARAYANCHANGDER: Cannot reproduce. Show the code that produces the error and the stacktrace. Utf8 is meant to be able to encode any unicode character...
@NARAYANCHANGDER: ... and I can confirm that I could successfully process u03a3 (Σ)

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.