I am trying to look at a .txt file and make a list of words in it. I want the words to be strings, but the ouput makes them lists.
import csv, math, os
os.chdir(r'C:\Users\jmela\canopy')
f=open("romeo.txt")
words = []
for row in csv.reader(f):
line = str(row)
for word in line.split():
if word not in words:
print word
words.append(word)
words.sort()
print words
Does anyone know what I am doing wrong?
3 Answers 3
based on your latest comment, doesn't look like you really need to use csv reader. just try this:
words = []
for line in open("romeo.txt", "r"):
for word in line.split():
if word not in words:
words.append(word)
words.sort()
print words
and like Kevin suggested, use set() instead of list.
3 Comments
Don't read the text file as csv then. Simply remove all punctuation and non-letter/non-space characters like this:
def replacePunct(string):
alphabets = " abcdefghijklmnopqrstuvwxyz"
for s in string:
if s not in alphabets:
string = string.replace(s, " ")
replacePunct(string)
string = string.split()
string = [x for x in string if x != " "]
return {set(string): len(string)}
1 Comment
You could use a set to hold your words. This would give you a unique word list. Any non-alpha characters and converted to spaces. The line is split into words and lowercased to make sure they match.
word_set = set()
re_nonalpha = re.compile('[^a-zA-Z ]+')
with open(r"romeo.txt", "r") as f_input:
for line in f_input:
line = re_nonalpha.sub(' ', line) # Convert all non a-z to spaces
for word in line.split():
word_set.add(word.lower())
word_list = list(word_set)
word_list.sort()
print word_list
This would give you a list holding the following words:
['already', 'and', 'arise', 'breaks', 'but', 'east', 'envious', 'fair', 'grief', 'is', 'it', 'juliet', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'who', 'window', 'with', 'yonder']
Updated to also remove any punctuation.
[in them. See @Kasra comment for why