My end goal is to simplify a text into purely words from one text file to a new one. However it is in french and uses latin chracaters like é, ç or ù. However my code only changes them into a space while it works with ascii characters.
fro example it takes "Messieurs les Présidents," and changes it to "messieurs les pr sidents"
def convert():
for i in files_names:
f1 = open(f"speeches/{i}","r")
L = f1.readlines()
cleaned_text=" "
for j in L:
for k in j :
if ord(k)>=65 and ord(k)<=90: #Changing to lower case
f=chr(ord(k)+32)
cleaned_text+=f
elif (ord(k)>=97 and ord(k)<=122): #keeping lower case letters
cleaned_text+=k
elif (ord(k)>=224 and ord(k)<=254): #keeping lower case latins
cleaned_text+=k
print(k)
else:
if cleaned_text[-1]!=" ":
cleaned_text+=" "
f1.close()
f2 = open(f"./cleaned/{i}","w")
for i in cleaned_text[1:]:
f2.write(i)
f2.close()
this is what my code looks like, I added a seperate if statement to print any entries in the latin case but there are none.
-
2This has nothing to do with "latin characters", it has to do with unicode. If you don't handle combining characters python won't do that for you.Masklinn– Masklinn2023年11月18日 10:53:11 +00:00Commented Nov 18, 2023 at 10:53
-
you will find the characters you want here: lookuptables.com/text/extended-ascii-tabledarren– darren2023年11月18日 11:45:06 +00:00Commented Nov 18, 2023 at 11:45
1 Answer 1
In the end the problem wasn't with python since it is based in UTF-8. It was with the os import that doesn't open files in UTF-8 automatically and had to be told. Here's the fixed and cleaned up code for anyone who's looking to have a little fun.
def convert():
for i in files_names: #here the file names were extracted outside of the function
f1 = open(f"............../{i}","r", encoding="utf-8") #enter the name of your directory
start_text = f1.readlines()
cleaned_text=" "
for lines in start_text:
for letters in lines.lower() :
if letters in "abcdefghijklmnopqrstuvwxyzüéâäåçêëèïîìôöòûùÿáíóúñà":
cleaned_text+=letters
else:
if cleaned_text[-1]!=" ": #only adding a space if there isn't one already
cleaned_text+=" "
f1.close()
f2 = open(f"./................./{i}","w", encoding='utf-8') #enter the name of your new directory
f2.write(cleaned_text)
f2.close()