I have two folders, each folder contains words in various .txt files, one folder is named 'good' while the other is named 'bad', I want to write a python script that will import all the data into a dataframe and the dataframe will have 'Id' column, 'word' column and 'label' column. The label column will either be 'good' or 'bad' based on the folder name.
I have written the following python script, but i seem to be having issues with file encoding type, I have installed the 'cahrdet' library to detect the file encoding type but i still get this error:
UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence
good_path = "myfolder/good"
bad_path = "myfolder/bad"
ids = []
words = []
labels = []
for filename in os.listdir(good_path):
with open(os.path.join(good_path, filename), "rb") as f:
result = chardet.detect(f.read())
encoding = result["encoding"]
with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("good")
for filename in os.listdir(bad_path):
with open(os.path.join(bad_path, filename), "rb") as f:
result = chardet.detect(f.read())
encoding = result["encoding"]
with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("bad")
# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})
-
Where have these text files come from? Why isn't the encoding known beforehand?GordonAitchJay– GordonAitchJay2023年03月25日 05:41:50 +00:00Commented Mar 25, 2023 at 5:41
2 Answers 2
You can try setting the encoding to utf-8 directly
python3 is fully supported
for filename in os.listdir(good_path):
with open(os.path.join(good_path, filename), "r", encoding="utf-8") as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("good")
for filename in os.listdir(bad_path):
with open(os.path.join(bad_path, filename), "r", encoding="utf-8") as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("bad")
3 Comments
detect logic -> createed UniversalDetector object and feed it. 'feed' determines that the header protocol matches the common utf family, If not, the data is then re matched, of course, the encoding type may be empty.errors='ignore' if we want to ensure that the program is readable when it is executing properly. Of course, this is not the best way The best way or need to determine your txt file data source, your txt data format is how to save the originalThank you all, I was able to filter out the text files with (character) the wrong encoding and exclude them with a try-except block to catch the UnicodeDecodeError exception.
Comments
Explore related questions
See similar questions with these tags.