Counting Words in Files - MATLAB style

Question 1

For my Matlab class I'm taking, I was given the task to write a function ReadAndCountWords that takes in the name of a text file (specifically from this zip file) as an input argument and then prints out the words contained in that file in order of how many times the word occurs. The function doesn't have to produce any output through output arguments. A call to the function might produce a result like this:

>> ReadAndCountWords('Speeches/Abraham_Lincoln_The_Gettysburg_Address.txt');
All words:
word: that count: 13
word: the count: 11
word: we count: 10
word: here count: 8
word: to count: 8
word: a count: 7
word: and count: 6
word: for count: 5
word: have count: 5
word: it count: 5
word: nation count: 5
word: of count: 5
word: dedicated count: 4
word: in count: 4
word: this count: 4
word: are count: 3
word: cannot count: 3
word: dead count: 3
word: great count: 3
word: is count: 3
word: people count: 3
word: shall count: 3
word: so count: 3
word: they count: 3
word: us count: 3
word: who count: 3
word: be count: 2
word: but count: 2
word: can count: 2
word: conceived count: 2
word: dedicate count: 2
word: devotion count: 2
word: far count: 2
word: from count: 2
word: gave count: 2
word: living count: 2
word: long count: 2
word: men count: 2
word: new count: 2
word: not count: 2
word: on count: 2
word: or count: 2
word: our count: 2
word: rather count: 2
word: these count: 2
word: war count: 2
word: what count: 2
word: which count: 2
word: above count: 1
word: add count: 1
word: advanced count: 1
word: ago count: 1
word: all count: 1
word: altogether count: 1
word: any count: 1
word: as count: 1
word: battlefield count: 1
word: before count: 1
word: birth count: 1
word: brave count: 1
word: brought count: 1
word: by count: 1
word: cause count: 1
word: civil count: 1
word: come count: 1
word: consecrate count: 1
word: consecrated count: 1
word: continent count: 1
word: created count: 1
word: detract count: 1
word: did count: 1
word: died count: 1
word: do count: 1
word: earth count: 1
word: endure count: 1
word: engaged count: 1
word: equal count: 1
word: fathers count: 1
word: field count: 1
word: final count: 1
word: fitting count: 1
word: forget count: 1
word: forth count: 1
word: fought count: 1
word: four count: 1
word: freedom count: 1
word: full count: 1
word: god count: 1
word: government count: 1
word: ground count: 1
word: hallow count: 1
word: highly count: 1
word: honored count: 1
word: increased count: 1
word: larger count: 1
word: last count: 1
word: liberty count: 1
word: little count: 1
word: live count: 1
word: lives count: 1
word: measure count: 1
word: met count: 1
word: might count: 1
word: never count: 1
word: nobly count: 1
word: nor count: 1
word: note count: 1
word: now count: 1
word: perish count: 1
word: place count: 1
word: poor count: 1
word: portion count: 1
word: power count: 1
word: proper count: 1
word: proposition count: 1
word: remaining count: 1
word: remember count: 1
word: resolve count: 1
word: resting count: 1
word: say count: 1
word: score count: 1
word: sense count: 1
word: seven count: 1
word: should count: 1
word: struggled count: 1
word: take count: 1
word: task count: 1
word: testing count: 1
word: their count: 1
word: those count: 1
word: thus count: 1
word: under count: 1
word: unfinished count: 1
word: vain count: 1
word: whether count: 1
word: will count: 1
word: work count: 1
word: world count: 1
word: years count: 1

Some guidelines I was given:

The code should drop all punctuation, except for ' (contraction) marks. For example, "don't" should be considered one word.

Once the code has divided things into words, it should eliminate ' marks (contractions) from the interior of words (so "don't" should be listed as "dont").

All words should be converted to lower case.

A word is considered to match only if it is a precise match using the strcmp routine ("discovered" and "discover" are different words).

Searching for a word in a cell array can be useful.

The code will need to print out the words ordered by the number of occurrences from most to least. Those words that have the same number of occurrences should be sorted alphabetically. So, if two words both occur 2 times, the word that is earlier alphabetically should occur
first in the listed output.

And for extra credit:

The Speeches folder contains one more file: stop_words.txt. In text processing, it is often considered useful to eliminate words that we expect to occur with extremely high frequency because they are filler words (and carry little to no actual information). For example, the words "a", "the" and "this" occur with high frequency and carry no useful information about the file itself because virtually all files will contain many of these words. Such words are often referred to as stop words. The file stop_words.txt contains an example of such a list of words.

For extra credit, add a process to your code that reads in the set of stop words, and when you print out the words that occur in a speech, you should exclude all stop words.

Note:

You must be able to call your code both ways:

Showing the results as indicated in the project write-up above

And showing the results with stop words excluded

Example output for extra credit:

Without stop words:
word: nation count: 5
word: dedicated count: 4
word: dead count: 3
word: great count: 3
word: people count: 3
word: shall count: 3
word: conceived count: 2
word: dedicate count: 2
word: devotion count: 2
word: far count: 2
word: gave count: 2
word: living count: 2
word: long count: 2
word: men count: 2
word: new count: 2
word: war count: 2
word: add count: 1
word: advanced count: 1
word: ago count: 1
word: altogether count: 1
word: battlefield count: 1
word: birth count: 1
word: brave count: 1
word: brought count: 1
word: cause count: 1
word: civil count: 1
word: come count: 1
word: consecrate count: 1
word: consecrated count: 1
word: continent count: 1
word: created count: 1
word: detract count: 1
word: did count: 1
word: died count: 1
word: earth count: 1
word: endure count: 1
word: engaged count: 1
word: equal count: 1
word: fathers count: 1
word: field count: 1
word: final count: 1
word: fitting count: 1
word: forget count: 1
word: forth count: 1
word: fought count: 1
word: freedom count: 1
word: god count: 1
word: government count: 1
word: ground count: 1
word: hallow count: 1
word: highly count: 1
word: honored count: 1
word: increased count: 1
word: larger count: 1
word: liberty count: 1
word: little count: 1
word: live count: 1
word: lives count: 1
word: measure count: 1
word: met count: 1
word: nobly count: 1
word: note count: 1
word: perish count: 1
word: place count: 1
word: poor count: 1
word: portion count: 1
word: power count: 1
word: proper count: 1
word: proposition count: 1
word: remaining count: 1
word: remember count: 1
word: resolve count: 1
word: resting count: 1
word: say count: 1
word: score count: 1
word: sense count: 1
word: seven count: 1
word: struggled count: 1
word: task count: 1
word: testing count: 1
word: unfinished count: 1
word: vain count: 1
word: work count: 1
word: world count: 1
word: years count: 1

My implementation (what I'm looking to have reviewed):

ReadAndCountWords.m:

function ReadAndCountWords(fileName, stopFile)
if (exist('stopFile', 'var')) 
 stopid = fopen(stopFile);
 stopData = textscan(stopid, '%s');
 stopData = lower(stopData{1});
else
 stopData = [];
end
fileid = fopen(fileName);
data = textscan(fileid, '%s');
data = regexprep(lower(data{1}), '[^a-z]', '');
[words, ~, labels] = unique(data);
count = histc(labels, 1:max(labels));
[count, indices] = sort(count, 'descend');
words = words(indices);
if (isempty(stopData))
 fprintf('All words:\n');
else
 fprintf('Without stop words:\n');
end
for i = 1:length(count)
 if(~isempty(words{i}) && ~any(strcmp(stopData, words{i})))
 fprintf('word: %-20s count %5d\n', words{i}, count(i));
 end
end
fclose('all');
end

And the driver (don't review this please):

Word_Count_Speeches.m:

diaryFile = 'project3Results.txt';
if exist(diaryFile)
 delete(diaryFile);
end
% Count for all speeches
diary(diaryFile);
fileName = 'Speeches/Abraham_Lincoln_The_Gettysburg_Address.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Abraham_Lincoln_First_Inaugural.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Abraham_Lincoln_Second_Inaugural.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Franklin_Delano_Roosevelt_First_Inaugural.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Franklin_Delano_Roosevelt_Pearl_Harbor_Address.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/John_F_Kennedy_Inaugural.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Malcolm_X_The_Ballot_Or_The_Bullet.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Martin_Luther_King_I_Have_A_Dream.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Susan_B_Anthony_On_Women_s_Right_To_Vote.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
fileName = 'Speeches/Theodore_Roosevelt_The_Duties_Of_American_Citizenship.txt';
fprintf('For file %s:\n',fileName);
ReadAndCountWords(fileName);
diary off;
clear diaryFile fileName;

Running my code with tic; Word_Count_Speeches; toc;, my code ran in 3.047776 seconds.

Are there ways that I can clean up my function more? Are there ways I can get rid of those for loops I have and use vectorization instead? Can I speed up my code to make it more efficient?

Question 2

Rather than going through all of labels looking for the biggest in this line:

count = histc(labels, 1:max(labels))

you can pick this number off directly with numel(labels):

count = histc(labels, 1:numel(labels))

Alternatively, you can use accumarray:

count = accumarray(labels,1);

On this line in the loop

if(~isempty(words{i}) && ~any(strcmp(stopData, words{i})))

scanning through the stopData list on every iteration is expensive. Instead, you could use intersect to filter out the stopData before this print loop.

Rather than exist to see if a variable has been passed in,

if (exist('stopFile', 'var'))

I prefer to use nargin.

if (nargin < 2)

Matthew Simoneau Matthew Simoneau 1887 bronze badges · Accepted Answer · 2014-11-13 15:30:12Z

Rather than going through all of labels looking for the biggest in this line:

count = histc(labels, 1:max(labels))

you can pick this number off directly with numel(labels):

count = histc(labels, 1:numel(labels))

Alternatively, you can use accumarray:

count = accumarray(labels,1);

On this line in the loop

if(~isempty(words{i}) && ~any(strcmp(stopData, words{i})))

scanning through the stopData list on every iteration is expensive. Instead, you could use intersect to filter out the stopData before this print loop.

Rather than exist to see if a variable has been passed in,

if (exist('stopFile', 'var'))

I prefer to use nargin.

if (nargin < 2)

Stack Exchange Network

Counting Words in Files - MATLAB style

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting Words in Files - MATLAB style

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions