0

I am trying to import multiple csv's in a dataframe at one time using pandas to_sql, to a MySQL database. After creating the engine, I am running the following:

folder_path = (file_path)
os.chdir(folder_path)
for file in os.listdir(folder_path):
 if '.csv' in file:
 df = pd.read_csv(file, low_memory = False)
 table_name = str(file.strip('.csv'))
 df.to_sql(table_name, con = engine, if_exists = 'replace')

However, when I run the code, I get the following error: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to "

Even when I try using the import wizard to uload that specific table the error is appearing on, it only imports 50 out of the 42,000 records.

Any help is appreciated!

asked Jul 8, 2021 at 10:52
5
  • Can you share a sample of your csv data? My assumption, without much information is that could be related with the way your CSV data is in the file. Check line 49, 50 and 51 of the CSV Commented Jul 8, 2021 at 11:03
  • The data is from this kaggle dataset: kaggle.com/mrmorj/dataset-of-songs-in-spotify . The error is appearing from the first file, genres-v2. There are definitely lines which I see that don't contain UTF-8, however they aren't around 50. Any advice on how to quickly delete all rows which contain non-utf-8 characters, before import? Commented Jul 8, 2021 at 11:56
  • Always specify the encoding (e.g. in read_csv). Do not trust Python will find it for you (and unfortunately WIndows still use unpredictable defaults) Commented Jul 8, 2021 at 11:59
  • @shuaf98 which file are you using? You need to give me a bit more than that for me to be able to help :) Commented Jul 8, 2021 at 11:59
  • Rui, the file is the genres_v2 file that is on the kaggle link I posted. Giacoma, the the encoding is specified currently as UTF-8. Is there a different encoding that would work instead? Commented Jul 8, 2021 at 12:24

1 Answer 1

1

I am not sure if this is the "correct" way of doing it, but I found a regex which selects only characters in UTF-8, removing the rest, for each field in the dataframe:

df.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

Ideally though, I would like to keep the non UTF-8 characters, if there are any other solutions.

answered Jul 8, 2021 at 16:52
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.