3

Im trying to retrieve the charset from a webpage(this will change all the time). At the moment Im using beautifulSoup to parse the page and then extract the charset from the header. This was working fine until I ran into a site that had.....

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

My code up until now and which was working with other pages is:

 def get_encoding(soup):
 encod = soup.meta.get('charset')
 if encod == None:
 encod = soup.meta.get('content-type')
 if encod == None:
 encod = soup.meta.get('content')
 return encod

Would anyone have a good idea about how to add to this code to retrieve the charset from the above example. Would tokenizing it and trying to retrieve the charset that way be an idea? and how would you go about it without having to change the whole function? Right now the above code is returning "text/html; charset=utf-8" which is causing a LookupError because this is an unknown encoding.

Thanks

The final code that I ended up using:

 def get_encoding(soup):
 encod = soup.meta.get('charset')
 if encod == None:
 encod = soup.meta.get('content-type')
 if encod == None:
 content = soup.meta.get('content')
 match = re.search('charset=(.*)', content)
 if match:
 encod = match.group(1)
 else:
 dic_of_possible_encodings = chardet.detect(unicode(soup))
 encod = dic_of_possible_encodings['encoding'] 
 return encod
asked Aug 21, 2013 at 13:39
1
  • I have used chardet but I wanted to be 100% accurate and so want to try and grab the encoding from the page itself. Commented Aug 21, 2013 at 13:41

2 Answers 2

4
import re
def get_encoding(soup):
 if soup and soup.meta:
 encod = soup.meta.get('charset')
 if encod == None:
 encod = soup.meta.get('content-type')
 if encod == None:
 content = soup.meta.get('content')
 match = re.search('charset=(.*)', content)
 if match:
 encod = match.group(1)
 else:
 raise ValueError('unable to find encoding')
 else:
 raise ValueError('unable to find encoding')
 return encod
林果皞
7,8734 gold badges59 silver badges77 bronze badges
answered Aug 21, 2013 at 13:48
Sign up to request clarification or add additional context in comments.

1 Comment

Brilliant. Thank you. Really need to learn myself some regex.
0

In my case soup.meta only returns the first meta-tag found in the soup. Here is @Fruit's answer extended to find the charset in any meta-tag within the given html.

from bs4 import BeautifulSoup
import re
def get_encoding(soup):
 encoding = None
 if soup:
 for meta_tag in soup.find_all("meta"):
 encoding = meta_tag.get('charset')
 if encoding: break
 else:
 encoding = meta_tag.get('content-type')
 if encoding: break
 else:
 content = meta_tag.get('content')
 if content:
 match = re.search('charset=(.*)', content)
 if match:
 encoding = match.group(1)
 break
 if encoding:
 # cast to str if type(encoding) == bs4.element.ContentMetaAttributeValue
 return str(encoding).lower()
soup = BeautifulSoup(html)
print(get_encoding_from_meta(soup))
answered Apr 20, 2021 at 12:22

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.