Creating a Dictionary<string,string> to generate the most frequent continuations (without Linq)

Question 1

Main is added to compare the expected result with the result of the function, the war field is GetMostFrequentNextWords.

What data types can be better instead of Dictionary<string, Dictionary<string, int>>?
Are there any problems with the usage of this data type?
Is my code easy to read?

In Particular, the creation of Dictionaries for new keys takes a lot of code volume.

Are keyValuePairInternal, helpDictionary good names for variables?
Is the use of the variable keyTwoWords justified?

I am searching for all my problems

The task:

N-gram it's N adjacent words in one sentence. for example, from the text:

"She stood up. Then she left."

you can take the next 2-grams

"she stood",
"stood up",
"then she"
and "she left",
but not the next case "up then".

and 3-grams

"she stood up"
and "then she left",
but not the case "stood up then".

with a list of sentences (the list is consist of words, that are assembled in the list of sentences) create a vocabulary of the most frequent continuations of 2-grams and 3-grams. It's the vocabulary, in which all possible beginnings of 2-grams and 3-grams are Keys, and the Values are their most frequent continuations. If there are few continuations with equal frequency, you should store a string that is smaller (with help of String.CompareOrdinal).

Example Text: "a b c d. b c d. e b c a d." You should get the next vocabulary:

1 "a": "b"

2 "b": "c"

3 "c": "d"

4 "e": "b"

5 "a b": "c"

6 "b c": "d"

7 "e b": "c"

8 "c a": "d"

From 2-grams "a b" and "a d" that met once, there is only one pair in Dictionary "a": "b", because it's the smaller one. From the next two 2-grams "c d" и "c a" There is only one most frequent pair "c": "d". And from 3-grams "b c d" and "b c a" Next pair is in vocabulary as the most frequent one "b c": "d".

using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp11
{
 public static class SentencesParserTask
 {
 public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
 {
 var result = new Dictionary<string, string>();
 var helpDictionary = new Dictionary<string, Dictionary<string, int>>();
 for (int i = 0; i < text.Count; i++)
 {
 for (int j = 0; j < text[i].Count; j++)
 {
 if (text[i].Count - j >= 3)
 {
 string keyTwoWords = text[i][j] + " " + text[i][j + 1];
 if (!helpDictionary.ContainsKey(keyTwoWords))
 helpDictionary[keyTwoWords] = new Dictionary<string, int>();
 if (!helpDictionary[keyTwoWords].ContainsKey(text[i][j + 2]))
 helpDictionary[keyTwoWords][text[i][j + 2]] = 0;
 helpDictionary[keyTwoWords][text[i][j + 2]]++;
 }
 if (text[i].Count - j >= 2)
 {
 if (!helpDictionary.ContainsKey(text[i][j]))
 helpDictionary[text[i][j]] = new Dictionary<string, int>();
 if (!helpDictionary[text[i][j]].ContainsKey(text[i][j + 1]))
 helpDictionary[text[i][j]][text[i][j + 1]] = 0;
 helpDictionary[text[i][j]][text[i][j + 1]]++;
 }
 }
 }
 foreach (KeyValuePair<string, Dictionary<string, int>> keyValuePair in helpDictionary)
 {
 if (!result.ContainsKey(keyValuePair.Key))
 result[keyValuePair.Key] = "";
 int maxFrequencyCount = 0;
 string maxFrequencyString = "";
 foreach (KeyValuePair<string, int> keyValuePairInternal in keyValuePair.Value)
 {
 if ((keyValuePairInternal.Value > maxFrequencyCount) || ((keyValuePairInternal.Value == maxFrequencyCount) &&
 (string.CompareOrdinal(maxFrequencyString, keyValuePairInternal.Key) > 0)))
 {
 result[keyValuePair.Key] = keyValuePairInternal.Key;
 maxFrequencyCount = keyValuePairInternal.Value;
 maxFrequencyString = keyValuePairInternal.Key;
 }
 }
 }
 return result;
 }
 
 static void Main(string[] args)
 {
 //a b c d. b c d. e b c a d.
 var list = new List<List<string>>();
 list.Add(new List<string>());
 list[0].Add("a");
 list[0].Add("b");
 list[0].Add("c");
 list[0].Add("d");
 list.Add(new List<string>());
 list[1].Add("b");
 list[1].Add("c");
 list[1].Add("d");
 list.Add(new List<string>());
 list[2].Add("e");
 list[2].Add("b");
 list[2].Add("c");
 list[2].Add("a");
 list[2].Add("d");
 Console.WriteLine("The origin text : \n" );
 foreach (var sentence in list)
 {
 foreach(var word in sentence)
 {
 Console.Write(" " + word);
 }
 Console.Write(".");
 }
 Console.Write("\n\n ");
 Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
 var dictionary = new Dictionary<string, string>(GetMostFrequentNextWords(list));
 int index = 1;
 Console.WriteLine("Your output (THE ORDER ISNT IMPORTANT) :");
 foreach (var keyValuePair in dictionary)
 {
 Console.WriteLine(index++ + " " + keyValuePair.Key + ": " + keyValuePair.Value);
 }
 Console.ReadLine();
 }
 }
}

Question 2

Have you tried to split your GetMostFrequentNextWords into small, manageable functions?

Question 3

@PeterCsala I lack such experience. Should I create methods for every foreach or should I somehow simplify them? If the second i cant imagine anything now with my skills

Question 4

Lets do some refactoring together

`GetMostFrequentNextWords`

If you look at this function then you can easily distinguish two parts
- Populate the helpDictionary
- Convert the helpDictionary to the desired format
So, lets extract these codes into their own methods
With these in our hand the only responsibility of the GetMostFrequentNextWords method is that it needs to call the methods in the right order

public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
{
 var intermediateCollection = GetIntermediateCollection(text);
 return TransformIntermediateToFinal(intermediateCollection);
}

or in short

public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
 => TransformIntermediateToFinal(GetIntermediateCollection(text));

`GetIntermediateCollection`

If you look at the body of the nested for loop then you can spot that both if branches do the same thing but on different indexes/indices
So, the common parts could (and should) be extracted and only the unique part should be provided as input

private static Dictionary<string, Dictionary<string, int>> GetIntermediateCollection(List<List<string>> text)
{
 var intermediateCollection = new Dictionary<string, Dictionary<string, int>>();
 foreach (var sentence in text)
 {
 for (int word = 0; word < sentence.Count; word++)
 {
 if (sentence.Count - word >= 3)
 UpdateCollection(intermediateCollection, sentence[word] + " " + sentence[word + 1], sentence[word + 2]);
 if (sentence.Count - word >= 2)
 UpdateCollection(intermediateCollection, sentence[word], sentence[word + 1]);
 }
 }
 return intermediateCollection;
}

I've tired to use more meaningful iterator names than i and j

`UpdateCollection`

This is the common part of the two if branches
I've used innerKey and outerKey as parameter names to indicate their intended usage

private static void UpdateCollection(Dictionary<string, Dictionary<string, int>> twoLevelCollection, string outerKey, string innerKey)
{
 if (!twoLevelCollection.ContainsKey(outerKey))
 twoLevelCollection[outerKey] = new Dictionary<string, int>();
 if (!twoLevelCollection[outerKey].ContainsKey(innerKey))
 twoLevelCollection[outerKey][innerKey] = 0;
 twoLevelCollection[outerKey][innerKey]++;
}

`TransformIntermediateToFinal`

Here I took advantage of C#'s deconstruction feature in the foreach loops

private static Dictionary<string, string> TransformIntermediateToFinal(Dictionary<string, Dictionary<string, int>> twoLevelCollection)
{
 var result = new Dictionary<string, string>();
 foreach (var (grams, innerGrams) in twoLevelCollection)
 {
 if (!result.ContainsKey(grams))
 result[grams] = "";
 int maxFrequencyCount = 0;
 string maxFrequencyString = "";
 foreach (var (word, frequency) in innerGrams)
 {
 if (frequency > maxFrequencyCount
 || (frequency == maxFrequencyCount
 && string.CompareOrdinal(maxFrequencyString, word) > 0))
 (result[grams], maxFrequencyString, maxFrequencyCount) = (word, word, frequency);
 }
 }
 return result;
}

Just like you in your original code I also had problem to find good naming for the iterator variables
- Maybe it make sense to spend some time on it to find better names than these

`Main`

And last but not least the caller side

var input = new List<List<string>>
{
 new List<string>() { "a", "b", "c", "d"},
 new List<string>() { "b", "c", "d"},
 new List<string>() { "e", "b", "c", "a", "d"},
};
Console.WriteLine("The origin text : \n");
foreach (var sentence in input)
{
 Console.Write(string.Join(" ", sentence));
 Console.Write(".");
}
Console.Write("\n\n ");
Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
Console.WriteLine("Actual output (the ordering doesn't matter):");
var mostFreuqentOnes = new SentencesParser().GetMostFrequentNextWords(input);
foreach (var (idx, lhs, rhs) in mostFreuqentOnes.Select((kv, idx) => (idx, kv.Key, kv.Value)))
 Console.WriteLine(idx+1 + " " + lhs + ": " + rhs);

Here you can find a working example

Question 5

Thanks a lot. I appreciate your help. When you were solving my question did you think about my choice of Dictionary<string,Dictionry>> data type. Was my way to solve it good?

Question 6

@Sheeba334 No, I was not thinking about your data type choice. If you wish I can update my post on Monday.

Question 7

You don't have to write all code. If it's not difficult just think about my choice and others that can be better. And a short plan of how to do it with a new data type will be enough. Don't want to waste your time. I have time so I can wait. Appreciate your help

Question 8

@Sheeba334 Iterating through a Dictionary is considered slow compared to iterating through a List or Array (you do this only once). On the other hand the ContainsKey has quite good performance (you do this several times). Since there is no built-in structure for multi-dimensional string keyed arrays in .NET that's why Dictionary<string, Dictionary<string, int>> seems to me a good choice for this problem.

Peter Csala Peter Csala 10.7k1 gold badge16 silver badges36 bronze badges · Accepted Answer · 2022-11-25 12:39:54Z

Lets do some refactoring together

`GetMostFrequentNextWords`

If you look at this function then you can easily distinguish two parts
- Populate the helpDictionary
- Convert the helpDictionary to the desired format
So, lets extract these codes into their own methods
With these in our hand the only responsibility of the GetMostFrequentNextWords method is that it needs to call the methods in the right order

public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
{
 var intermediateCollection = GetIntermediateCollection(text);
 return TransformIntermediateToFinal(intermediateCollection);
}

or in short

public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
 => TransformIntermediateToFinal(GetIntermediateCollection(text));

`GetIntermediateCollection`

If you look at the body of the nested for loop then you can spot that both if branches do the same thing but on different indexes/indices
So, the common parts could (and should) be extracted and only the unique part should be provided as input

private static Dictionary<string, Dictionary<string, int>> GetIntermediateCollection(List<List<string>> text)
{
 var intermediateCollection = new Dictionary<string, Dictionary<string, int>>();
 foreach (var sentence in text)
 {
 for (int word = 0; word < sentence.Count; word++)
 {
 if (sentence.Count - word >= 3)
 UpdateCollection(intermediateCollection, sentence[word] + " " + sentence[word + 1], sentence[word + 2]);
 if (sentence.Count - word >= 2)
 UpdateCollection(intermediateCollection, sentence[word], sentence[word + 1]);
 }
 }
 return intermediateCollection;
}

I've tired to use more meaningful iterator names than i and j

`UpdateCollection`

This is the common part of the two if branches
I've used innerKey and outerKey as parameter names to indicate their intended usage

private static void UpdateCollection(Dictionary<string, Dictionary<string, int>> twoLevelCollection, string outerKey, string innerKey)
{
 if (!twoLevelCollection.ContainsKey(outerKey))
 twoLevelCollection[outerKey] = new Dictionary<string, int>();
 if (!twoLevelCollection[outerKey].ContainsKey(innerKey))
 twoLevelCollection[outerKey][innerKey] = 0;
 twoLevelCollection[outerKey][innerKey]++;
}

`TransformIntermediateToFinal`

Here I took advantage of C#'s deconstruction feature in the foreach loops

private static Dictionary<string, string> TransformIntermediateToFinal(Dictionary<string, Dictionary<string, int>> twoLevelCollection)
{
 var result = new Dictionary<string, string>();
 foreach (var (grams, innerGrams) in twoLevelCollection)
 {
 if (!result.ContainsKey(grams))
 result[grams] = "";
 int maxFrequencyCount = 0;
 string maxFrequencyString = "";
 foreach (var (word, frequency) in innerGrams)
 {
 if (frequency > maxFrequencyCount
 || (frequency == maxFrequencyCount
 && string.CompareOrdinal(maxFrequencyString, word) > 0))
 (result[grams], maxFrequencyString, maxFrequencyCount) = (word, word, frequency);
 }
 }
 return result;
}

Just like you in your original code I also had problem to find good naming for the iterator variables
- Maybe it make sense to spend some time on it to find better names than these

`Main`

And last but not least the caller side

var input = new List<List<string>>
{
 new List<string>() { "a", "b", "c", "d"},
 new List<string>() { "b", "c", "d"},
 new List<string>() { "e", "b", "c", "a", "d"},
};
Console.WriteLine("The origin text : \n");
foreach (var sentence in input)
{
 Console.Write(string.Join(" ", sentence));
 Console.Write(".");
}
Console.Write("\n\n ");
Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
Console.WriteLine("Actual output (the ordering doesn't matter):");
var mostFreuqentOnes = new SentencesParser().GetMostFrequentNextWords(input);
foreach (var (idx, lhs, rhs) in mostFreuqentOnes.Select((kv, idx) => (idx, kv.Key, kv.Value)))
 Console.WriteLine(idx+1 + " " + lhs + ": " + rhs);

Here you can find a working example

Thanks a lot. I appreciate your help. When you were solving my question did you think about my choice of Dictionary<string,Dictionry>> data type. Was my way to solve it good?
@Sheeba334 No, I was not thinking about your data type choice. If you wish I can update my post on Monday.
You don't have to write all code. If it's not difficult just think about my choice and others that can be better. And a short plan of how to do it with a new data type will be enough. Don't want to waste your time. I have time so I can wait. Appreciate your help
@Sheeba334 Iterating through a Dictionary is considered slow compared to iterating through a List or Array (you do this only once). On the other hand the ContainsKey has quite good performance (you do this several times). Since there is no built-in structure for multi-dimensional string keyed arrays in .NET that's why Dictionary<string, Dictionary<string, int>> seems to me a good choice for this problem.

Stack Exchange Network

Creating a Dictionary<string,string> to generate the most frequent continuations (without Linq)

The task:

1 Answer 1

`GetMostFrequentNextWords`

`GetIntermediateCollection`

`UpdateCollection`

`TransformIntermediateToFinal`

`Main`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating a Dictionary<string,string> to generate the most frequent continuations (without Linq)

The task:

1 Answer 1

GetMostFrequentNextWords

GetIntermediateCollection

UpdateCollection

TransformIntermediateToFinal

Main

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`GetMostFrequentNextWords`

`GetIntermediateCollection`

`UpdateCollection`

`TransformIntermediateToFinal`

`Main`