Problem Importing to MYSQL with Pandas: UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to <undefined>

Asked 4 years, 6 months ago

Viewed 825 times

I am trying to import multiple csv's in a dataframe at one time using pandas to_sql, to a MySQL database. After creating the engine, I am running the following:

folder_path = (file_path)
os.chdir(folder_path)
for file in os.listdir(folder_path):
 if '.csv' in file:
 df = pd.read_csv(file, low_memory = False)
 table_name = str(file.strip('.csv'))
 df.to_sql(table_name, con = engine, if_exists = 'replace')

However, when I run the code, I get the following error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to "

Even when I try using the import wizard to uload that specific table the error is appearing on, it only imports 50 out of the 42,000 records.

Any help is appreciated!

Improve this question

asked Jul 8, 2021 at 10:52

shuaf98's user avatar

shuaf98

371 gold badge1 silver badge6 bronze badges

Can you share a sample of your csv data? My assumption, without much information is that could be related with the way your CSV data is in the file. Check line 49, 50 and 51 of the CSV

Rui Costa
– Rui Costa

2021年07月08日 11:03:29 +00:00
Commented Jul 8, 2021 at 11:03
The data is from this kaggle dataset: kaggle.com/mrmorj/dataset-of-songs-in-spotify . The error is appearing from the first file, genres-v2. There are definitely lines which I see that don't contain UTF-8, however they aren't around 50. Any advice on how to quickly delete all rows which contain non-utf-8 characters, before import?

shuaf98
– shuaf98

2021年07月08日 11:56:24 +00:00
Commented Jul 8, 2021 at 11:56
Always specify the encoding (e.g. in read_csv). Do not trust Python will find it for you (and unfortunately WIndows still use unpredictable defaults)

Giacomo Catenazzi
– Giacomo Catenazzi

2021年07月08日 11:59:14 +00:00
Commented Jul 8, 2021 at 11:59
@shuaf98 which file are you using? You need to give me a bit more than that for me to be able to help :)

Rui Costa
– Rui Costa

2021年07月08日 11:59:15 +00:00
Commented Jul 8, 2021 at 11:59
Rui, the file is the genres_v2 file that is on the kaggle link I posted. Giacoma, the the encoding is specified currently as UTF-8. Is there a different encoding that would work instead?

shuaf98
– shuaf98

2021年07月08日 12:24:01 +00:00
Commented Jul 8, 2021 at 12:24

Add a comment |

1 Answer 1

Sorted by: Reset to default

I am not sure if this is the "correct" way of doing it, but I found a regex which selects only characters in UTF-8, removing the rest, for each field in the dataframe:

df.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

Ideally though, I would like to keep the non UTF-8 characters, if there are any other solutions.

Improve this answer

answered Jul 8, 2021 at 16:52

shuaf98's user avatar

shuaf98

371 gold badge1 silver badge6 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

default

CollectivesTM on Stack Overflow

Problem Importing to MYSQL with Pandas: UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to <undefined>

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related