Main
is added to compare the expected result with the result of the function, the war field is GetMostFrequentNextWords
.
- What data types can be better instead of
Dictionary<string, Dictionary<string, int>>
? - Are there any problems with the usage of this data type?
- Is my code easy to read?
In Particular, the creation of Dictionaries for new keys takes a lot of code volume.
- Are
keyValuePairInternal
,helpDictionary
good names for variables? - Is the use of the variable
keyTwoWords
justified?
I am searching for all my problems
The task:
N-gram it's N adjacent words in one sentence. for example, from the text:
"She stood up. Then she left."
you can take the next 2-grams
- "she stood",
- "stood up",
- "then she"
- and "she left",
- but not the next case "up then".
and 3-grams
- "she stood up"
- and "then she left",
- but not the case "stood up then".
with a list of sentences (the list is consist of words, that are assembled in the list of sentences) create a vocabulary of the most frequent continuations of 2-grams and 3-grams. It's the vocabulary, in which all possible beginnings of 2-grams and 3-grams are Keys, and the Values are their most frequent continuations. If there are few continuations with equal frequency, you should store a string that is smaller (with help of String.CompareOrdinal
).
Example Text: "a b c d. b c d. e b c a d." You should get the next vocabulary:
1 "a": "b"
2 "b": "c"
3 "c": "d"
4 "e": "b"
5 "a b": "c"
6 "b c": "d"
7 "e b": "c"
8 "c a": "d"
From 2-grams "a b" and "a d" that met once, there is only one pair in Dictionary "a": "b", because it's the smaller one. From the next two 2-grams "c d" и "c a" There is only one most frequent pair "c": "d". And from 3-grams "b c d" and "b c a" Next pair is in vocabulary as the most frequent one "b c": "d".
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp11
{
public static class SentencesParserTask
{
public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
{
var result = new Dictionary<string, string>();
var helpDictionary = new Dictionary<string, Dictionary<string, int>>();
for (int i = 0; i < text.Count; i++)
{
for (int j = 0; j < text[i].Count; j++)
{
if (text[i].Count - j >= 3)
{
string keyTwoWords = text[i][j] + " " + text[i][j + 1];
if (!helpDictionary.ContainsKey(keyTwoWords))
helpDictionary[keyTwoWords] = new Dictionary<string, int>();
if (!helpDictionary[keyTwoWords].ContainsKey(text[i][j + 2]))
helpDictionary[keyTwoWords][text[i][j + 2]] = 0;
helpDictionary[keyTwoWords][text[i][j + 2]]++;
}
if (text[i].Count - j >= 2)
{
if (!helpDictionary.ContainsKey(text[i][j]))
helpDictionary[text[i][j]] = new Dictionary<string, int>();
if (!helpDictionary[text[i][j]].ContainsKey(text[i][j + 1]))
helpDictionary[text[i][j]][text[i][j + 1]] = 0;
helpDictionary[text[i][j]][text[i][j + 1]]++;
}
}
}
foreach (KeyValuePair<string, Dictionary<string, int>> keyValuePair in helpDictionary)
{
if (!result.ContainsKey(keyValuePair.Key))
result[keyValuePair.Key] = "";
int maxFrequencyCount = 0;
string maxFrequencyString = "";
foreach (KeyValuePair<string, int> keyValuePairInternal in keyValuePair.Value)
{
if ((keyValuePairInternal.Value > maxFrequencyCount) || ((keyValuePairInternal.Value == maxFrequencyCount) &&
(string.CompareOrdinal(maxFrequencyString, keyValuePairInternal.Key) > 0)))
{
result[keyValuePair.Key] = keyValuePairInternal.Key;
maxFrequencyCount = keyValuePairInternal.Value;
maxFrequencyString = keyValuePairInternal.Key;
}
}
}
return result;
}
static void Main(string[] args)
{
//a b c d. b c d. e b c a d.
var list = new List<List<string>>();
list.Add(new List<string>());
list[0].Add("a");
list[0].Add("b");
list[0].Add("c");
list[0].Add("d");
list.Add(new List<string>());
list[1].Add("b");
list[1].Add("c");
list[1].Add("d");
list.Add(new List<string>());
list[2].Add("e");
list[2].Add("b");
list[2].Add("c");
list[2].Add("a");
list[2].Add("d");
Console.WriteLine("The origin text : \n" );
foreach (var sentence in list)
{
foreach(var word in sentence)
{
Console.Write(" " + word);
}
Console.Write(".");
}
Console.Write("\n\n ");
Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
var dictionary = new Dictionary<string, string>(GetMostFrequentNextWords(list));
int index = 1;
Console.WriteLine("Your output (THE ORDER ISNT IMPORTANT) :");
foreach (var keyValuePair in dictionary)
{
Console.WriteLine(index++ + " " + keyValuePair.Key + ": " + keyValuePair.Value);
}
Console.ReadLine();
}
}
}
1 Answer 1
Lets do some refactoring together
GetMostFrequentNextWords
- If you look at this function then you can easily distinguish two parts
- Populate the
helpDictionary
- Convert the
helpDictionary
to the desired format
- Populate the
- So, lets extract these codes into their own methods
- With these in our hand the only responsibility of the
GetMostFrequentNextWords
method is that it needs to call the methods in the right order
public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
{
var intermediateCollection = GetIntermediateCollection(text);
return TransformIntermediateToFinal(intermediateCollection);
}
or in short
public static Dictionary<string, string> GetMostFrequentNextWords(List<List<string>> text)
=> TransformIntermediateToFinal(GetIntermediateCollection(text));
GetIntermediateCollection
- If you look at the body of the nested for loop then you can spot that both
if
branches do the same thing but on different indexes/indices - So, the common parts could (and should) be extracted and only the unique part should be provided as input
private static Dictionary<string, Dictionary<string, int>> GetIntermediateCollection(List<List<string>> text)
{
var intermediateCollection = new Dictionary<string, Dictionary<string, int>>();
foreach (var sentence in text)
{
for (int word = 0; word < sentence.Count; word++)
{
if (sentence.Count - word >= 3)
UpdateCollection(intermediateCollection, sentence[word] + " " + sentence[word + 1], sentence[word + 2]);
if (sentence.Count - word >= 2)
UpdateCollection(intermediateCollection, sentence[word], sentence[word + 1]);
}
}
return intermediateCollection;
}
- I've tired to use more meaningful iterator names than
i
andj
UpdateCollection
- This is the common part of the two if branches
- I've used
innerKey
andouterKey
as parameter names to indicate their intended usage
private static void UpdateCollection(Dictionary<string, Dictionary<string, int>> twoLevelCollection, string outerKey, string innerKey)
{
if (!twoLevelCollection.ContainsKey(outerKey))
twoLevelCollection[outerKey] = new Dictionary<string, int>();
if (!twoLevelCollection[outerKey].ContainsKey(innerKey))
twoLevelCollection[outerKey][innerKey] = 0;
twoLevelCollection[outerKey][innerKey]++;
}
TransformIntermediateToFinal
- Here I took advantage of C#'s deconstruction feature in the
foreach
loops
private static Dictionary<string, string> TransformIntermediateToFinal(Dictionary<string, Dictionary<string, int>> twoLevelCollection)
{
var result = new Dictionary<string, string>();
foreach (var (grams, innerGrams) in twoLevelCollection)
{
if (!result.ContainsKey(grams))
result[grams] = "";
int maxFrequencyCount = 0;
string maxFrequencyString = "";
foreach (var (word, frequency) in innerGrams)
{
if (frequency > maxFrequencyCount
|| (frequency == maxFrequencyCount
&& string.CompareOrdinal(maxFrequencyString, word) > 0))
(result[grams], maxFrequencyString, maxFrequencyCount) = (word, word, frequency);
}
}
return result;
}
- Just like you in your original code I also had problem to find good naming for the iterator variables
- Maybe it make sense to spend some time on it to find better names than these
Main
- And last but not least the caller side
var input = new List<List<string>>
{
new List<string>() { "a", "b", "c", "d"},
new List<string>() { "b", "c", "d"},
new List<string>() { "e", "b", "c", "a", "d"},
};
Console.WriteLine("The origin text : \n");
foreach (var sentence in input)
{
Console.Write(string.Join(" ", sentence));
Console.Write(".");
}
Console.Write("\n\n ");
Console.WriteLine("Expected output: \n1 a: b \n2 b: c \n3 c: d \n4 e: b \n5 a b: c \n6 b c: d \n7 e b: c \n8 c a: d \n\n ");
Console.WriteLine("Actual output (the ordering doesn't matter):");
var mostFreuqentOnes = new SentencesParser().GetMostFrequentNextWords(input);
foreach (var (idx, lhs, rhs) in mostFreuqentOnes.Select((kv, idx) => (idx, kv.Key, kv.Value)))
Console.WriteLine(idx+1 + " " + lhs + ": " + rhs);
-
\$\begingroup\$ Thanks a lot. I appreciate your help. When you were solving my question did you think about my choice of Dictionary<string,Dictionry>> data type. Was my way to solve it good? \$\endgroup\$Sheeba334– Sheeba3342022年11月26日 14:19:27 +00:00Commented Nov 26, 2022 at 14:19
-
\$\begingroup\$ @Sheeba334 No, I was not thinking about your data type choice. If you wish I can update my post on Monday. \$\endgroup\$Peter Csala– Peter Csala2022年11月27日 06:27:17 +00:00Commented Nov 27, 2022 at 6:27
-
1\$\begingroup\$ You don't have to write all code. If it's not difficult just think about my choice and others that can be better. And a short plan of how to do it with a new data type will be enough. Don't want to waste your time. I have time so I can wait. Appreciate your help \$\endgroup\$Sheeba334– Sheeba3342022年11月27日 06:44:22 +00:00Commented Nov 27, 2022 at 6:44
-
1\$\begingroup\$ @Sheeba334 Iterating through a Dictionary is considered slow compared to iterating through a List or Array (you do this only once). On the other hand the
ContainsKey
has quite good performance (you do this several times). Since there is no built-in structure for multi-dimensional string keyed arrays in .NET that's whyDictionary<string, Dictionary<string, int>>
seems to me a good choice for this problem. \$\endgroup\$Peter Csala– Peter Csala2022年11月28日 08:07:28 +00:00Commented Nov 28, 2022 at 8:07
GetMostFrequentNextWords
into small, manageable functions? \$\endgroup\$foreach
or should I somehow simplify them? If the second i cant imagine anything now with my skills \$\endgroup\$