0

I have two folders, each folder contains words in various .txt files, one folder is named 'good' while the other is named 'bad', I want to write a python script that will import all the data into a dataframe and the dataframe will have 'Id' column, 'word' column and 'label' column. The label column will either be 'good' or 'bad' based on the folder name.

I have written the following python script, but i seem to be having issues with file encoding type, I have installed the 'cahrdet' library to detect the file encoding type but i still get this error:

UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence
good_path = "myfolder/good"
bad_path = "myfolder/bad"
ids = []
words = []
labels = []
for filename in os.listdir(good_path):
 with open(os.path.join(good_path, filename), "rb") as f:
 result = chardet.detect(f.read())
 encoding = result["encoding"]
 with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("good")
for filename in os.listdir(bad_path):
 with open(os.path.join(bad_path, filename), "rb") as f:
 result = chardet.detect(f.read())
 encoding = result["encoding"]
 with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("bad")
# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})
asked Mar 25, 2023 at 4:36
1
  • Where have these text files come from? Why isn't the encoding known beforehand? Commented Mar 25, 2023 at 5:41

2 Answers 2

0

You can try setting the encoding to utf-8 directly

python3 is fully supported


for filename in os.listdir(good_path):
 with open(os.path.join(good_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("good")
for filename in os.listdir(bad_path):
 with open(os.path.join(bad_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("bad")
answered Mar 25, 2023 at 5:09
Sign up to request clarification or add additional context in comments.

3 Comments

thanks @chrisfang I did, but I got this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1054: invalid start byte
@highclef you get the encoding from chardet But that's not necessarily true. detect logic -> createed UniversalDetector object and feed it. 'feed' determines that the header protocol matches the common utf family, If not, the data is then re matched, of course, the encoding type may be empty.
@highclef errors='ignore' if we want to ensure that the program is readable when it is executing properly. Of course, this is not the best way The best way or need to determine your txt file data source, your txt data format is how to save the original
0

Thank you all, I was able to filter out the text files with (character) the wrong encoding and exclude them with a try-except block to catch the UnicodeDecodeError exception.

answered Mar 25, 2023 at 7:09

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.