This program is based on as old retired Google interview question where you're supposed to create a program that will split a string into space-separated words, if the string is composed of valid words.
I've implemented two approaches, one that uses a normal array as the dictionary, making the implementation pretty simple, and one that uses a trie to store the dictionary, which will save memory but make the algorithm a bit more complicated. That being said, both implementations follow pretty much the same approach. Also, I'm not interested in a comparison of the two implementations i've written, I just wanted to try my hands on two different approaches.
Let's for fun's sake imagine that performance matters, so I'm looking for feedback on both performance, readability and whatever else you think should be different. Thanks!
using System;
using System.Collections.Generic;
using System.Linq;
class MainClass {
// The dictionary to check the words against. Normally this would be loaded from a file.
// If the dictionary is too big, loading it directly from the file into a trie would be better.
private static string[] dictArray = new string[] {
"all",
"base",
"ball",
"baseball",
"game",
"ballgame" // Not a real word, used for testing.
};
public static void Main () {
// Print the result of a couple of tests. If the test returns null, print "null" instead.
Console.WriteLine("Array dictionary:");
Console.WriteLine(SplitWordWithArray("baseballbase", dictArray) ?? "null"); // "baseball base"
Console.WriteLine(SplitWordWithArray("gamebaseball", dictArray) ?? "null"); // "game baseball"
Console.WriteLine(SplitWordWithArray("basketballgame", dictArray) ?? "null"); // null
Console.WriteLine(SplitWordWithArray("baseballgame", dictArray) ?? "null"); // This is a corner case. Will currently split into "baseball game", but since "base ballgame" could also be valid in this instance, I wouldn't consider the case handled.
// Same as above, but this time using the trie-based implementation.
Console.WriteLine("\nTrie dictionary:");
Node dictTrie = CreateDictTrie(dictArray);
Console.WriteLine(SplitWordWithTrie("baseballbase", dictTrie) ?? "null"); // "baseball base"
Console.WriteLine(SplitWordWithTrie("gamebaseball", dictTrie) ?? "null"); // "game baseball"
Console.WriteLine(SplitWordWithTrie("basketballgame", dictTrie) ?? "null"); // null
Console.WriteLine(SplitWordWithTrie("baseballgame", dictTrie) ?? "null"); // This is a corner case. Will currently split into "baseball game", but since "base ballgame" could also be valid in this instance, I wouldn't consider the case handled.
}
// Create the trie to use for spell checking, from an array of valid words.
private static Node CreateDictTrie(string[] dictArray) {
// Create the root node of the trie.
Node root = new Node();
// For every word in the dictionary, go through all the character in the word.
for (int i = 0; i < dictArray.Length; i++) {
string word = dictArray[i];
// Set the current node as the currently empty root node of the trie.
Node currentNode = root;
// Set the current node to the node with the key corresponding to the character in the word.
// If no node with this key exists, create a new one and add it under this key.
for (int j = 0; j < word.Length; j++) {
string character = word[j].ToString();
if (!currentNode.ContainsKey(character)) {
currentNode[character] = new Node();
}
currentNode = currentNode[character];
}
// Once the end of the word has been reached, set the last node as the end of the word.
currentNode.IsWord = true;
}
return root;
}
// Below you'll find two implementations of the word splitter. They both do the same thing,
// but one of them uses an array as the dictionary while the other one uses a trie.
// The methods will split an input string into multiple space-seperated words from a dictionary.
// If the input word is in the dictionary return it.
// If the input word doesn't contain any dictionary words, return null.
// So "applepieplate" will split into "apple pie plate".
// This implementation uses a trie as a dictionary.
private static string SplitWordWithTrie(string word, Node dictTrie) {
word = word.ToLower();
// If the complete input string is already a word, just return it without splitting it.
if (dictTrie.ContainsWord(word)) {
return word;
}
// Set the current node to the root of the dictionary trie.
Node currentNode = dictTrie;
string output = null;
for (int i = 0; i < word.Length; i++) {
string character = word[i].ToString();
// If the character isn't in the current node's keys, the word isn't in the dictionary.
if (!currentNode.ContainsKey(character)) {
break;
}
// Set the current node to the child node with the current character as the key.
currentNode = currentNode[character];
if (currentNode.IsWord) {
// Get the part of the string that has been determined to be a word.
string firstPart = word.Substring(0, i + 1);
// Then take the rest of the input string and run it through this method as the input string.
// This way the secondPart variable will either end up being null, meaning that the rest
// of the string isn't a valid word, or the space-seperated version of the input word.
string secondPart = SplitWordWithTrie(word.Substring(i + 1), dictTrie);
if (secondPart != null) {
// Both parts are valid and can be set as candidates for the final output.
// The reason the output is not just returned here, is because we would rather return compound
// words, if any exists. So instead of returning "base ball base", we return "baseball base".
output = firstPart + " " + secondPart;
}
}
}
return output;
}
// This implementation uses an array as a dictionary.
private static string SplitWordWithArray(string word, string[] dictList) {
word = word.ToLower();
// If the complete input string is already a word, just return it without splitting it.
if (dictList.Contains(word)) {
return word;
}
string output = null;
for (int i = 1; i < word.Length; i++) {
// Get the fist i characters of the string and check if they're a valid word.
string firstPart = word.Substring(0, i);
if (dictList.Contains(firstPart)) {
// Then take the rest of the input string and run it through this method as the input string.
// This way the secondPart variable will either end up being null, meaning that the rest
// of the string isn't a valid word, or the space-seperated version of the input word.
string secondPart = SplitWordWithArray(word.Substring(i), dictList);
if (secondPart != null) {
// Both parts are valid and can be set as candidates for the final output.
// The reason the output is not just returned here, is because we would rather return compound
// words, if any exists. So instead of returning "base ball base", we return "baseball base".
output = firstPart + " " + secondPart;
}
}
}
return output;
}
}
// Node class used for the trie structure.
public class Node : Dictionary<string, Node> {
// Use OrdinalIgnoreCase to ensure the character used as the key is case insensitive.
public Node() : base(StringComparer.OrdinalIgnoreCase) {}
// Whether the string terminating at this node is a valid word.
public bool IsWord { get; set; }
// Check whether a word is contained within the node's children.
public bool ContainsWord(string word) {
word = word.ToLower();
Node currentNode = this;
for (int i = 0; i < word.Length; i++) {
string character = word[i].ToString();
if (!currentNode.ContainsKey(character)) {
return false;
}
currentNode = currentNode[character];
}
return currentNode.IsWord;
}
}
2 Answers 2
A couple things I would consider:
In the case of an algorithm like this, you usually want to return a list of all candidates instead of just one of them. (You return the last candidate found which should be all the compound words.)
I would consider (at the cost of significantly complicating the algorithm) returning a
List<string>
from each, as this will allow the user to pick which they like the best. (Basic auto-correct guidelines.)Next, in the case of your
Trie
work, it looks like you could replace thestring
key with achar
key, which should have a performance advantage. (Thestring
type is always a reference type,char
is a value type, this can help coerce more performance out of thechar
type in a case like this.)Without rigorous testing I can't guarantee the accuracy of this next bit, but a concept to test would be to make
output
aStringBuilder
in each case, give it a length ofword.Length
, and then build strings like that instead of creating new strings constantly in your loops.Lastly, you could consider a case when you are provided with an invalid word in one portion of it (like your
basketballgame
test) and if the first part is not valid, but the second part is valid then return your output as well.
Otherwise this is good code, and it definitely solves the problem as efficiently as I can think of.
First of all, I think the code is well written and the algorithms are implemented pragmatically and simple.
Performace
SplitWordWithArray
As suggested by @Paparazzi, it would be more performant to use a HashSet instead of the list.
When using a HashSet with
StringComparer.OrdinalIgnoreCase
,word.ToLower()
becomes needless.
SplitWordWithTrie
- There is no need for
word.ToLower()
(neither in theNode
class nor in the methodSplitWordWithTrie
) because theNode
class usesStringComparer.OrdinalIgnoreCase
General Code Structure
Because the there are 2 approaches of the same functionality, IMHO it makes sense to implement an interface (or abstract base class) 2 times. That has the following advantages:
- Infrastructure code (e.g. Unit tests or performance tests) can be written against the abstraction (just iterate over all implementations)
- It is possible to add other approaches (or compare slightly different version) without changing infrastructure code
HashSet
could very easily cancel out looping through a 6 item array. \$\endgroup\$