I am trying to fetch text data from a website, but this code shows some error. Please let me know where is the error.
import requests
from bs4 import BeautifulSoup
def getportions(soup):
for p in soup.find_all("p", {"class": ""}):
yield p.text
def readpage(address):
page = requests.get(address)
soup = BeautifulSoup(page.text, "html.parser")
output_text = ''
for s in getportions(soup):
output_text += s.encode("utf8")
output_text += "\n"
print (output_text)
print ("End of article")
fp = open("content.txt", "w")
fp.write(output_text)
if __name__ == "__main__":
readpage("http://yahoo.com")
The error is shown below:
output_text += s.encode("utf8"). TypeError: Can't convert 'bytes' object to str implicitly
Morgan Thrapp
10k3 gold badges51 silver badges68 bronze badges
asked Nov 4, 2016 at 14:52
NARAYAN CHANGDER
3194 silver badges13 bronze badges
1 Answer 1
If you use Python 3, all strings are natively in unicode, and you can specify the encoding when opening a file. You code could become:
def readpage(address):
...
output_text = ''
for s in getportions(soup):
output_text += s
output_text += "\n"
print (output_text)
print ("End of article")
fp = open("content.txt", "w", encoding='utf8')
fp.write(output_text)
If you simply want to sanitize the text by replacing all non ascii characters with a ? open the file that way:
fp = open("content.txt", "w", encoding='ascii', errors='replace')
answered Nov 4, 2016 at 14:59
Serge Ballesta
150k13 gold badges137 silver badges267 bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
NARAYAN CHANGDER
It shows error agin: return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u03a3' in position 350: character maps to <undefined>
Serge Ballesta
@NARAYANCHANGDER: Cannot reproduce. Show the code that produces the error and the stacktrace. Utf8 is meant to be able to encode any unicode character...
Serge Ballesta
@NARAYANCHANGDER: ... and I can confirm that I could successfully process
u03a3 (Σ)lang-py
.encodereturns abytesobject. What are you trying to do?decode? Why do you think you need to do anything withutf-8?