Python 2.7 - Pandas UnicodeEncodeError with data from pyodbc

Question 1

I'm trying to pull data from SQL Server using pyodbc and load it into a dataframe, then export it to an HTML file, except I keep receiving the following Unicode error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 15500: ordinal not in range(128)

Here is my current setup (encoding instructions per docs):

cnxn = pyodbc.connect('DSN=Planning;UID=USER;PWD=PASSWORD;')
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252', to=unicode)
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252', to=unicode)
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='cp1252', to=unicode)
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')
cursor = cnxn.cursor()
with open('Initial Dataset.sql') as f:
 initial_query = f.read()
cursor.execute(initial_query)
columns = [column[0] for column in cursor.description]
initial_data = cursor.fetchall()
i_df = pd.DataFrame.from_records(initial_data, columns=columns)
i_df.to_html('initial.html')

An odd but useful point to note is that when I try to export a CSV:

i_df.to_csv('initial.csv')

I get the same error, however when I add:

i_df.to_csv('initial.csv', encoding='utf-8')

It works. Can someone help me understand this encoding issue?

Side note: I've also tried using a sqlalchemy connection and pandas.read_sql() and the same error persists.

Question 2

The error means you are trying to encode an (Unicode) character not representable in ASCII to ASCII. I'm just guessing, but your data frame returned by pandas is encoded in utf-8. I suspect the to=unicode is wrong, but just a shot in the dark.

Question 3

I understand what the error means, I just don't understand why it's occurring. The dataframe is utf-8 encoded. The docs for pandas.to_html are rather scant. Why would it try to convert to ASCII when generating the HTML?

Question 4

I'm not sure, but I would check the pandas.to_html source code to see what's happening there (Maybe encoding defaults to ASCII, I dont know).

Question 5

You shouldn't need any setencoding/setdecoding calls at all when working with SQL Server, especially not encoding to UTF-8, which SQL Server ODBC does not use (it uses UTF-16, and that is the default encoding for pyodbc).

Question 6

From here: "SQL Server's recent drivers match the specification, so no configuration is necessary. Using the pyodbc defaults is recommended."

Question 7

The second answer on this question seems to be an acceptable workaround, except for Python 2.x users, you must use io, so:

import io
html = df.to_html()
with io.open("mypage.html", "w", encoding="utf-8") as file:
 file.write(html)

It was not included in the latest release, but it looks like the next version of pandas will have an encoding option for to_html(), see docs (line 2228).

Question 8

Yes, that's correct. The encoding should be applied to the output file, not the communications between pyodbc and the SQL Server.

Question 9

The problem ultimately lies in pandas, as to_html() seems to enforce ASCII encoding. It appears they will be fixing that issue in an upcoming release.

Question 10

"to_html() seems to enforce ASCII encoding" - No, more likely that to_html uses the default encoding for the file when you only pass it a string (filepath) for buf=, and the default string encoding for Python_2 is ASCII.

Question 11

@GordThompson okay, but with Python 2 and no way to to tell the function to use a different encoding, that is practically the same thing, no?

Question 12

The way to "tell the function" is to pass it a buf argument that is a StringIO-like object instead of just a (string) path.

Jon Behnken 5601 gold badge3 silver badges15 bronze badges · Accepted Answer · 2019-11-01 15:27:15Z

1

The second answer on this question seems to be an acceptable workaround, except for Python 2.x users, you must use io, so:

import io
html = df.to_html()
with io.open("mypage.html", "w", encoding="utf-8") as file:
 file.write(html)

It was not included in the latest release, but it looks like the next version of pandas will have an encoding option for to_html(), see docs (line 2228).

Share

Improve this answer

answered Nov 1, 2019 at 15:27

Jon Behnken's user avatar

Jon Behnken

5601 gold badge3 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gord Thompson

Gord Thompson Over a year ago

Yes, that's correct. The encoding should be applied to the output file, not the communications between pyodbc and the SQL Server.

2019年11月01日T15:29:44.187Z+00:00

Jon Behnken

Jon Behnken Over a year ago

The problem ultimately lies in pandas, as to_html() seems to enforce ASCII encoding. It appears they will be fixing that issue in an upcoming release.

2019年11月01日T15:36:34.287Z+00:00

Gord Thompson

Gord Thompson Over a year ago

"to_html() seems to enforce ASCII encoding" - No, more likely that to_html uses the default encoding for the file when you only pass it a string (filepath) for buf=, and the default string encoding for Python_2 is ASCII.

2019年11月01日T15:47:22.19Z+00:00

Jon Behnken

Jon Behnken Over a year ago

@GordThompson okay, but with Python 2 and no way to to tell the function to use a different encoding, that is practically the same thing, no?

2019年11月01日T15:52:33.18Z+00:00

Gord Thompson

Gord Thompson Over a year ago

The way to "tell the function" is to pass it a buf argument that is a StringIO-like object instead of just a (string) path.

2019年11月01日T15:54:50.357Z+00:00

CollectivesTM on Stack Overflow

Python 2.7 - Pandas UnicodeEncodeError with data from pyodbc

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related