#Concept
Concept
#Functionality
Functionality
##Code Style
Code Style
##Conclusion
Conclusion
#Concept
#Functionality
##Code Style
##Conclusion
Concept
Functionality
Code Style
Conclusion
Let's go through a couple of things. Concept, then missing functionality (.... you have some).
#Concept
An N-Gram is a sequence of N words that have been found in a span of text. You can have 1-grams, 2-grams, 3-grams, .... n-grams. You identify these n-grams by finding all possible n-wide spans of text, and storing them. In the sentence the quick brown fox
, there are three 2-grams 'the quick', 'quick brown' and 'brown fox'. There are two 3-grams 'the quick brown' and 'quick brown fox'.
When processing natural languages, it is often statistically convenient to weigh the likelihood of a particular word happening 'next' after an existing sequence of words.
That is what this problem is about. given a span of 'n', find all the (n+1)-grams. Then, taking any n-width words, look for all the (n+1)-grams that start with those words. Randomly chose one to select the 'next' word. Then repeat the process until you run out of (n+1)-grams that match, or you hit the 'sentence' limit. You have just built a sentence that is statistically 'likely'. A smarter system will 'weight' the next word based on the frequency of the (n+1)-grams that were found in the text. I.e. If the original text has 'the white house' 10 times, and 'the white swan' just once, then it will 'randomly' choose 'house' 10 times more than 'swan'.
OK, that gives you some context for the problem.
#Functionality
The challenge/requirement was to take the n
value as an input. You have hard-coded it as 2. In other words, you have 2-grams and 3-grams, when you are supposed to have n-grams and (n+1)-grams.
for(int listIndex = 0 ;listIndex < givenList.size() - 2; listIndex++)
That 2
should be an n
, and all the logic changes needed to fix that.
Similarly, the x1
and x2
variables should be a list, or an array, because there could be more than 2.
You already have commented that the length of the output sentence is supposed to be user-input. You should make it user-input, as well as the first n
words.
You have missed the point on the generating the sentence as well... You assume that there will always be a valid/matching random word to add to the sentence until you run out of words. This is not true. You may be part way through a sentence when you discover that the last n
words in your sentence do not match any available (n+1)-grams, and there is thus nothing you can add to the sentence, and you have to stop short.
##Code Style
Your code should be broken out in to more functions. You currently have just one which is used to get the next random word. You should have others to read the input file. You should probably have another that populates the map, etc.
Your main method is very heavy-weight, and should have function-extraction applied.
You also have indentation that is all over the place, and makes things hard to spot. It took me a while just to see that randomWord
was a function.
##Conclusion
Your code is only partially working, and some core functionality is missing. You are a good way along to getting a working solution. Hopefully the background on the problem will help you to understand what the problem is trying to solve.... basically: based on statistics from existing texts, randomly generate a new sentence that uses those statistics to predict what the next words in the sentence will be.