Differences in word counts in two files

Question 1

How can I change my two functions to be more readable functions (i.e. in steps), because I am using pycharm.

import collections
def compute_word_importance(fpath1,fpath2):
 mylist1 = []
 mylist2 = []
 with open(fpath1, 'rt', encoding='UTF-8') as f:
 for line1 in f:
 for word in line1.split(" "):
 if word.strip('\n'):
 mylist1.append(word.strip())
 with open(fpath2, 'rt', encoding='UTF-8') as x:
 for line2 in x:
 for word in line2.split(" "):
 if word.strip('\n'):
 mylist2.append(word.strip())
 if mylist1==0 and mylist2==0:
 return None
 c=collections.Counter(mylist1)
 c.subtract(mylist2)
 return c

Question 2

I think your for loops can be changed to:

with open(fpath1, 'rt', encoding='UTF-8') as f:
 for line in f:
 mylist1.append(word.strip() for word in line.split())
with open(fpath2, 'rt', encoding='UTF-8') as f:
 for line in f:
 mylist2.append(word.strip() for word in line.split())

At this point in time one realises that these loops are a little to similar, so let us continue with joining them into something like:

for filepath in (fpath1, fpath2):
 with open(filepath, 'rt', encoding='UTF-8') as inputfile:
 for line in inputfile:
 mylistX.append(word.strip() for word in line.split())

But then we see that we can't append everything into mylist1 or mylist2, which already in the names are indicating with the addition of 1 and 2 that they should really be an array already. So lets do that:

lists = []
for filepath in (fpath1, fpath2):
 tmp_list = []
 with open(filepath, 'rt', encoding='UTF-8') as inputfile:
 for line in inputfile:
 tmp_list.append(word.strip() for word in line.split())
 lists.append(tmp_list)

And then to wrap it up we need some changes to the end of the code to adjust to the newly created list of lists:

if len(lists[0]) == 0 or len(lists[1]) == 0:
 return None
counter = collections.Counter(lists[0])
 counter.subtract(lists[1])
 return counter

It is left as an exercise for the user to extend the parameters into a list of filepaths, and extend the logic to return a counter with all remaining files subtracted if that is a wanted extension.

Question 3

Change lines like this:

 for word in line1.split(" "):

To lines like:

 for word in (w.strip() for w in line1.split(" ")):

UPDATE:

Here are two ways you could tighten up your function:

1:

 def compute_word_importance(fpath1,fpath2):
 count = collections.Counter()
 with open(fpath1, 'rt', encoding='UTF-8') as fp:
 for line in f:
 count.update([w.strip() for word in line.split()])
 with open(fpath2, 'rt', encoding='UTF-8') as fp:
 for line in fp:
 count.subtract([w.strip() for word in line.split()])
 return len(counter) and counter or none

2:

 def compute_word_importance(fpath1,fpath2):
 def get_words(path):
 words = []
 with open(path, 'rt', encoding='UTF-8') as fp:
 for line1 in f:
 words.extend([w.strip() for word in line.split()])
 counter = collections.Counter(get_words(fpath1))
 counter.subtract(get_words(fpath2))
 return len(counter) and counter or none

Question 4

+one. better than my suggestion in the comment above. And pythonic. And uses a generator expression.

Question 5

is there any possible way to make my function more readable because it looks complicated to me

Question 6

i updated my post , Can someone please edit my function to simpler way

Question 7

Instead of (w.strip() for w in line1.split(" ")) you could use map: map(str.strip, line1.split()). You also don't need to pass " " to split since it will automatically default to splitting based on whitespace.

holroy holroy 11.7k1 gold badge27 silver badges59 bronze badges · Answer 1 · 2015-12-10 03:21:49Z

I think your for loops can be changed to:

with open(fpath1, 'rt', encoding='UTF-8') as f:
 for line in f:
 mylist1.append(word.strip() for word in line.split())
with open(fpath2, 'rt', encoding='UTF-8') as f:
 for line in f:
 mylist2.append(word.strip() for word in line.split())

At this point in time one realises that these loops are a little to similar, so let us continue with joining them into something like:

for filepath in (fpath1, fpath2):
 with open(filepath, 'rt', encoding='UTF-8') as inputfile:
 for line in inputfile:
 mylistX.append(word.strip() for word in line.split())

But then we see that we can't append everything into mylist1 or mylist2, which already in the names are indicating with the addition of 1 and 2 that they should really be an array already. So lets do that:

lists = []
for filepath in (fpath1, fpath2):
 tmp_list = []
 with open(filepath, 'rt', encoding='UTF-8') as inputfile:
 for line in inputfile:
 tmp_list.append(word.strip() for word in line.split())
 lists.append(tmp_list)

And then to wrap it up we need some changes to the end of the code to adjust to the newly created list of lists:

if len(lists[0]) == 0 or len(lists[1]) == 0:
 return None
counter = collections.Counter(lists[0])
 counter.subtract(lists[1])
 return counter

It is left as an exercise for the user to extend the parameters into a list of filepaths, and extend the logic to return a counter with all remaining files subtracted if that is a wanted extension.

Turn Turn 1213 bronze badges · Answer 2 · 2015-12-10 02:47:27Z

Change lines like this:

 for word in line1.split(" "):

To lines like:

 for word in (w.strip() for w in line1.split(" ")):

UPDATE:

Here are two ways you could tighten up your function:

1:

 def compute_word_importance(fpath1,fpath2):
 count = collections.Counter()
 with open(fpath1, 'rt', encoding='UTF-8') as fp:
 for line in f:
 count.update([w.strip() for word in line.split()])
 with open(fpath2, 'rt', encoding='UTF-8') as fp:
 for line in fp:
 count.subtract([w.strip() for word in line.split()])
 return len(counter) and counter or none

2:

 def compute_word_importance(fpath1,fpath2):
 def get_words(path):
 words = []
 with open(path, 'rt', encoding='UTF-8') as fp:
 for line1 in f:
 words.extend([w.strip() for word in line.split()])
 counter = collections.Counter(get_words(fpath1))
 counter.subtract(get_words(fpath2))
 return len(counter) and counter or none

+one. better than my suggestion in the comment above. And pythonic. And uses a generator expression.
is there any possible way to make my function more readable because it looks complicated to me
i updated my post , Can someone please edit my function to simpler way
Instead of (w.strip() for w in line1.split(" ")) you could use map: map(str.strip, line1.split()). You also don't need to pass " " to split since it will automatically default to splitting based on whitespace.

Stack Exchange Network

Differences in word counts in two files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Differences in word counts in two files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions