6
\$\begingroup\$

Main is added to compare the expected result with the result of the function, the war field is GetMostFrequentNextWords.

  • What data types can be better instead of Dictionary<string, Dictionary<string, int>>?
  • Are there any problems with the usage of this data type?
  • Is my code easy to read?

In Particular, the creation of Dictionaries for new keys takes a lot of code volume.

  • Are keyValuePairInternal, helpDictionary good names for variables?
  • Is the use of the variable keyTwoWords justified?

I am searching for all my problems

The task:

N-gram it's N adjacent words in one sentence. for example, from the text:

"She stood up. Then she left."

you can take the next 2-grams

  • "she stood",
  • "stood up",
  • "then she"
  • and "she left",
  • but not the next case "up then".

and 3-grams

  • "she stood up"
  • and "then she left",
  • but not the case "stood up then".

with a list of sentences (the list is consist of words, that are assembled in the list of sentences) create a vocabulary of the most frequent continuations of 2-grams and 3-grams. It's the vocabulary, in which all possible beginnings of 2-grams and 3-grams are Keys, and the Values are their most frequent continuations. If there are few continuations with equal frequency, you should store a string that is smaller (with help of String.CompareOrdinal).

Example Text: "a b c d. b c d. e b c a d." You should get the next vocabulary:

1 "a": "b"

2 "b": "c"

3 "c": "d"

4 "e": "b"

5 "a b": "c"

6 "b c": "d"

7 "e b": "c"

8 "c a": "d"

From 2-grams "a b" and "a d" that met once, there is only one pair in Dictionary "a": "b", because it's the smaller one. From the next two 2-grams "c d" и "c a" There is only one most frequent pair "c": "d". And from 3-grams "b c d" and "b c a" Next pair is in vocabulary as the most frequent one "b c": "d".

using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp11
{
 public static class SentencesParserTask
 {
 public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
 {
 var result = new Dictionary<string, string>();
 var helpDictionary = new Dictionary<string, Dictionary<string, int>>();
 for (int i = 0; i < text.Count; i++)
 {
 for (int j = 0; j < text[i].Count; j++)
 {
 if (text[i].Count - j >= 3)
 {
 string keyTwoWords = text[i][j] + " " + text[i][j + 1];
 if (!helpDictionary.ContainsKey(keyTwoWords))
 helpDictionary[keyTwoWords] = new Dictionary<string, int>();
 if (!helpDictionary[keyTwoWords].ContainsKey(text[i][j + 2]))
 helpDictionary[keyTwoWords][text[i][j + 2]] = 0;
 helpDictionary[keyTwoWords][text[i][j + 2]]++;
 }
 if (text[i].Count - j >= 2)
 {
 if (!helpDictionary.ContainsKey(text[i][j]))
 helpDictionary[text[i][j]] = new Dictionary<string, int>();
 if (!helpDictionary[text[i][j]].ContainsKey(text[i][j + 1]))
 helpDictionary[text[i][j]][text[i][j + 1]] = 0;
 helpDictionary[text[i][j]][text[i][j + 1]]++;
 }
 }
 }
 foreach (KeyValuePair<string, Dictionary<string, int>> keyValuePair in helpDictionary)
 {
 if (!result.ContainsKey(keyValuePair.Key))
 result[keyValuePair.Key] = "";
 int maxFrequencyCount = 0;
 string maxFrequencyString = "";
 foreach (KeyValuePair<string, int> keyValuePairInternal in keyValuePair.Value)
 {
 if ((keyValuePairInternal.Value > maxFrequencyCount) || ((keyValuePairInternal.Value == maxFrequencyCount) &&
 (string.CompareOrdinal(maxFrequencyString, keyValuePairInternal.Key) > 0)))
 {
 result[keyValuePair.Key] = keyValuePairInternal.Key;
 maxFrequencyCount = keyValuePairInternal.Value;
 maxFrequencyString = keyValuePairInternal.Key;
 }
 }
 }
 return result;
 }
 
 static void Main(string[] args)
 {
 //a b c d. b c d. e b c a d.
 var list = new List<List<string>>();
 list.Add(new List<string>());
 list[0].Add("a");
 list[0].Add("b");
 list[0].Add("c");
 list[0].Add("d");
 list.Add(new List<string>());
 list[1].Add("b");
 list[1].Add("c");
 list[1].Add("d");
 list.Add(new List<string>());
 list[2].Add("e");
 list[2].Add("b");
 list[2].Add("c");
 list[2].Add("a");
 list[2].Add("d");
 Console.WriteLine("The origin text : \n" );
 foreach (var sentence in list)
 {
 foreach(var word in sentence)
 {
 Console.Write(" " + word);
 }
 Console.Write(".");
 }
 Console.Write("\n\n ");
 Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
 var dictionary = new Dictionary<string, string>(GetMostFrequentNextWords(list));
 int index = 1;
 Console.WriteLine("Your output (THE ORDER ISNT IMPORTANT) :");
 foreach (var keyValuePair in dictionary)
 {
 Console.WriteLine(index++ + " " + keyValuePair.Key + ": " + keyValuePair.Value);
 }
 Console.ReadLine();
 }
 }
}
Peter Csala
10.7k1 gold badge16 silver badges36 bronze badges
asked Nov 23, 2022 at 12:01
\$\endgroup\$
2
  • \$\begingroup\$ Have you tried to split your GetMostFrequentNextWords into small, manageable functions? \$\endgroup\$ Commented Nov 24, 2022 at 15:21
  • \$\begingroup\$ @PeterCsala I lack such experience. Should I create methods for every foreach or should I somehow simplify them? If the second i cant imagine anything now with my skills \$\endgroup\$ Commented Nov 25, 2022 at 4:24

1 Answer 1

3
\$\begingroup\$

Lets do some refactoring together

GetMostFrequentNextWords

  • If you look at this function then you can easily distinguish two parts
    • Populate the helpDictionary
    • Convert the helpDictionary to the desired format
  • So, lets extract these codes into their own methods
  • With these in our hand the only responsibility of the GetMostFrequentNextWords method is that it needs to call the methods in the right order
public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
{
 var intermediateCollection = GetIntermediateCollection(text);
 return TransformIntermediateToFinal(intermediateCollection);
}

or in short

public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
 => TransformIntermediateToFinal(GetIntermediateCollection(text));

GetIntermediateCollection

  • If you look at the body of the nested for loop then you can spot that both if branches do the same thing but on different indexes/indices
  • So, the common parts could (and should) be extracted and only the unique part should be provided as input
private static Dictionary<string, Dictionary<string, int>> GetIntermediateCollection(List<List<string>> text)
{
 var intermediateCollection = new Dictionary<string, Dictionary<string, int>>();
 foreach (var sentence in text)
 {
 for (int word = 0; word < sentence.Count; word++)
 {
 if (sentence.Count - word >= 3)
 UpdateCollection(intermediateCollection, sentence[word] + " " + sentence[word + 1], sentence[word + 2]);
 if (sentence.Count - word >= 2)
 UpdateCollection(intermediateCollection, sentence[word], sentence[word + 1]);
 }
 }
 return intermediateCollection;
}
  • I've tired to use more meaningful iterator names than i and j

UpdateCollection

  • This is the common part of the two if branches
  • I've used innerKey and outerKey as parameter names to indicate their intended usage
private static void UpdateCollection(Dictionary<string, Dictionary<string, int>> twoLevelCollection, string outerKey, string innerKey)
{
 if (!twoLevelCollection.ContainsKey(outerKey))
 twoLevelCollection[outerKey] = new Dictionary<string, int>();
 if (!twoLevelCollection[outerKey].ContainsKey(innerKey))
 twoLevelCollection[outerKey][innerKey] = 0;
 twoLevelCollection[outerKey][innerKey]++;
}

TransformIntermediateToFinal

private static Dictionary<string, string> TransformIntermediateToFinal(Dictionary<string, Dictionary<string, int>> twoLevelCollection)
{
 var result = new Dictionary<string, string>();
 foreach (var (grams, innerGrams) in twoLevelCollection)
 {
 if (!result.ContainsKey(grams))
 result[grams] = "";
 int maxFrequencyCount = 0;
 string maxFrequencyString = "";
 foreach (var (word, frequency) in innerGrams)
 {
 if (frequency > maxFrequencyCount
 || (frequency == maxFrequencyCount
 && string.CompareOrdinal(maxFrequencyString, word) > 0))
 (result[grams], maxFrequencyString, maxFrequencyCount) = (word, word, frequency);
 }
 }
 return result;
}
  • Just like you in your original code I also had problem to find good naming for the iterator variables
    • Maybe it make sense to spend some time on it to find better names than these

Main

  • And last but not least the caller side
var input = new List<List<string>>
{
 new List<string>() { "a", "b", "c", "d"},
 new List<string>() { "b", "c", "d"},
 new List<string>() { "e", "b", "c", "a", "d"},
};
Console.WriteLine("The origin text : \n");
foreach (var sentence in input)
{
 Console.Write(string.Join(" ", sentence));
 Console.Write(".");
}
Console.Write("\n\n ");
Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
Console.WriteLine("Actual output (the ordering doesn't matter):");
var mostFreuqentOnes = new SentencesParser().GetMostFrequentNextWords(input);
foreach (var (idx, lhs, rhs) in mostFreuqentOnes.Select((kv, idx) => (idx, kv.Key, kv.Value)))
 Console.WriteLine(idx+1 + " " + lhs + ": " + rhs);

Here you can find a working example

answered Nov 25, 2022 at 12:39
\$\endgroup\$
4
  • \$\begingroup\$ Thanks a lot. I appreciate your help. When you were solving my question did you think about my choice of Dictionary<string,Dictionry>> data type. Was my way to solve it good? \$\endgroup\$ Commented Nov 26, 2022 at 14:19
  • \$\begingroup\$ @Sheeba334 No, I was not thinking about your data type choice. If you wish I can update my post on Monday. \$\endgroup\$ Commented Nov 27, 2022 at 6:27
  • 1
    \$\begingroup\$ You don't have to write all code. If it's not difficult just think about my choice and others that can be better. And a short plan of how to do it with a new data type will be enough. Don't want to waste your time. I have time so I can wait. Appreciate your help \$\endgroup\$ Commented Nov 27, 2022 at 6:44
  • 1
    \$\begingroup\$ @Sheeba334 Iterating through a Dictionary is considered slow compared to iterating through a List or Array (you do this only once). On the other hand the ContainsKey has quite good performance (you do this several times). Since there is no built-in structure for multi-dimensional string keyed arrays in .NET that's why Dictionary<string, Dictionary<string, int>> seems to me a good choice for this problem. \$\endgroup\$ Commented Nov 28, 2022 at 8:07

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.