0

My end goal is to simplify a text into purely words from one text file to a new one. However it is in french and uses latin chracaters like é, ç or ù. However my code only changes them into a space while it works with ascii characters.

fro example it takes "Messieurs les Présidents," and changes it to "messieurs les pr sidents"

def convert():
 for i in files_names:
 f1 = open(f"speeches/{i}","r")
 L = f1.readlines()
 cleaned_text=" "
 for j in L:
 for k in j :
 if ord(k)>=65 and ord(k)<=90: #Changing to lower case
 f=chr(ord(k)+32)
 cleaned_text+=f
 elif (ord(k)>=97 and ord(k)<=122): #keeping lower case letters
 cleaned_text+=k
 elif (ord(k)>=224 and ord(k)<=254): #keeping lower case latins
 cleaned_text+=k
 print(k)
 else:
 if cleaned_text[-1]!=" ":
 cleaned_text+=" "
 f1.close()
 f2 = open(f"./cleaned/{i}","w")
 for i in cleaned_text[1:]:
 f2.write(i)
 f2.close()

this is what my code looks like, I added a seperate if statement to print any entries in the latin case but there are none.

asked Nov 18, 2023 at 10:44
2
  • 2
    This has nothing to do with "latin characters", it has to do with unicode. If you don't handle combining characters python won't do that for you. Commented Nov 18, 2023 at 10:53
  • you will find the characters you want here: lookuptables.com/text/extended-ascii-table Commented Nov 18, 2023 at 11:45

1 Answer 1

1

In the end the problem wasn't with python since it is based in UTF-8. It was with the os import that doesn't open files in UTF-8 automatically and had to be told. Here's the fixed and cleaned up code for anyone who's looking to have a little fun.

def convert():
for i in files_names: #here the file names were extracted outside of the function
 f1 = open(f"............../{i}","r", encoding="utf-8") #enter the name of your directory
 start_text = f1.readlines()
 cleaned_text=" "
 for lines in start_text:
 for letters in lines.lower() :
 if letters in "abcdefghijklmnopqrstuvwxyzüéâäåçêëèïîìôöòûùÿáíóúñà":
 cleaned_text+=letters
 else:
 if cleaned_text[-1]!=" ": #only adding a space if there isn't one already
 cleaned_text+=" "
 f1.close()
 f2 = open(f"./................./{i}","w", encoding='utf-8') #enter the name of your new directory
 f2.write(cleaned_text)
 f2.close()
answered Nov 19, 2023 at 16:25
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.