Read content of several txt files into python

Question 1

I have two folders, each folder contains words in various .txt files, one folder is named 'good' while the other is named 'bad', I want to write a python script that will import all the data into a dataframe and the dataframe will have 'Id' column, 'word' column and 'label' column. The label column will either be 'good' or 'bad' based on the folder name.

I have written the following python script, but i seem to be having issues with file encoding type, I have installed the 'cahrdet' library to detect the file encoding type but i still get this error:

UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence

good_path = "myfolder/good"
bad_path = "myfolder/bad"
ids = []
words = []
labels = []
for filename in os.listdir(good_path):
 with open(os.path.join(good_path, filename), "rb") as f:
 result = chardet.detect(f.read())
 encoding = result["encoding"]
 with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("good")
for filename in os.listdir(bad_path):
 with open(os.path.join(bad_path, filename), "rb") as f:
 result = chardet.detect(f.read())
 encoding = result["encoding"]
 with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("bad")
# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})

Question 2

Where have these text files come from? Why isn't the encoding known beforehand?

Question 3

You can try setting the encoding to utf-8 directly

python3 is fully supported


for filename in os.listdir(good_path):
 with open(os.path.join(good_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("good")
for filename in os.listdir(bad_path):
 with open(os.path.join(bad_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("bad")

Question 4

thanks @chrisfang I did, but I got this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1054: invalid start byte

Question 5

@highclef you get the encoding from chardet But that's not necessarily true. detect logic -> createed UniversalDetector object and feed it. 'feed' determines that the header protocol matches the common utf family, If not, the data is then re matched, of course, the encoding type may be empty.

Question 6

@highclef errors='ignore' if we want to ensure that the program is readable when it is executing properly. Of course, this is not the best way The best way or need to determine your txt file data source, your txt data format is how to save the original

Question 7

Thank you all, I was able to filter out the text files with (character) the wrong encoding and exclude them with a try-except block to catch the UnicodeDecodeError exception.

chrisfang 3501 silver badge6 bronze badges · Answer 1 · 2023-03-25 05:09:51Z

0

You can try setting the encoding to utf-8 directly

python3 is fully supported


for filename in os.listdir(good_path):
 with open(os.path.join(good_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("good")
for filename in os.listdir(bad_path):
 with open(os.path.join(bad_path, filename), "r", encoding="utf-8") as f:
 word_content = f.read()
 ids.append(filename)
 words.append(word_content)
 labels.append("bad")

Share

Improve this answer

answered Mar 25, 2023 at 5:09

chrisfang's user avatar

chrisfang

3501 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

highclef

highclef Over a year ago

thanks @chrisfang I did, but I got this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1054: invalid start byte

2023年03月25日T05:36:20.54Z+00:00

chrisfang

chrisfang Over a year ago

@highclef you get the encoding from chardet But that's not necessarily true. detect logic -> createed UniversalDetector object and feed it. 'feed' determines that the header protocol matches the common utf family, If not, the data is then re matched, of course, the encoding type may be empty.

2023年03月25日T06:09:02.33Z+00:00

chrisfang

chrisfang Over a year ago

@highclef errors='ignore' if we want to ensure that the program is readable when it is executing properly. Of course, this is not the best way The best way or need to determine your txt file data source, your txt data format is how to save the original

2023年03月25日T06:09:18.943Z+00:00

highclef 1891 silver badge7 bronze badges · Answer 2 · 2023-03-25 07:09:04Z

Thank you all, I was able to filter out the text files with (character) the wrong encoding and exclude them with a try-except block to catch the UnicodeDecodeError exception.

CollectivesTM on Stack Overflow

Read content of several txt files into python

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related