Split string to space-separated words based on dictionary

Question 1

This program is based on as old retired Google interview question where you're supposed to create a program that will split a string into space-separated words, if the string is composed of valid words.

I've implemented two approaches, one that uses a normal array as the dictionary, making the implementation pretty simple, and one that uses a trie to store the dictionary, which will save memory but make the algorithm a bit more complicated. That being said, both implementations follow pretty much the same approach. Also, I'm not interested in a comparison of the two implementations i've written, I just wanted to try my hands on two different approaches.

Let's for fun's sake imagine that performance matters, so I'm looking for feedback on both performance, readability and whatever else you think should be different. Thanks!

using System;
using System.Collections.Generic;
using System.Linq;
class MainClass {
 // The dictionary to check the words against. Normally this would be loaded from a file.
 // If the dictionary is too big, loading it directly from the file into a trie would be better.
 private static string[] dictArray = new string[] {
 "all",
 "base",
 "ball",
 "baseball",
 "game",
 "ballgame" // Not a real word, used for testing.
 };
 public static void Main () {
 // Print the result of a couple of tests. If the test returns null, print "null" instead.
 Console.WriteLine("Array dictionary:");
 Console.WriteLine(SplitWordWithArray("baseballbase", dictArray) ?? "null"); // "baseball base"
 Console.WriteLine(SplitWordWithArray("gamebaseball", dictArray) ?? "null"); // "game baseball"
 Console.WriteLine(SplitWordWithArray("basketballgame", dictArray) ?? "null"); // null
 Console.WriteLine(SplitWordWithArray("baseballgame", dictArray) ?? "null"); // This is a corner case. Will currently split into "baseball game", but since "base ballgame" could also be valid in this instance, I wouldn't consider the case handled.
 // Same as above, but this time using the trie-based implementation.
 Console.WriteLine("\nTrie dictionary:");
 Node dictTrie = CreateDictTrie(dictArray);
 Console.WriteLine(SplitWordWithTrie("baseballbase", dictTrie) ?? "null"); // "baseball base"
 Console.WriteLine(SplitWordWithTrie("gamebaseball", dictTrie) ?? "null"); // "game baseball"
 Console.WriteLine(SplitWordWithTrie("basketballgame", dictTrie) ?? "null"); // null
 Console.WriteLine(SplitWordWithTrie("baseballgame", dictTrie) ?? "null"); // This is a corner case. Will currently split into "baseball game", but since "base ballgame" could also be valid in this instance, I wouldn't consider the case handled.
 }
 // Create the trie to use for spell checking, from an array of valid words.
 private static Node CreateDictTrie(string[] dictArray) {
 // Create the root node of the trie.
 Node root = new Node();
 // For every word in the dictionary, go through all the character in the word.
 for (int i = 0; i < dictArray.Length; i++) {
 string word = dictArray[i];
 // Set the current node as the currently empty root node of the trie.
 Node currentNode = root;
 // Set the current node to the node with the key corresponding to the character in the word.
 // If no node with this key exists, create a new one and add it under this key.
 for (int j = 0; j < word.Length; j++) {
 string character = word[j].ToString();
 if (!currentNode.ContainsKey(character)) {
 currentNode[character] = new Node();
 }
 currentNode = currentNode[character];
 }
 // Once the end of the word has been reached, set the last node as the end of the word.
 currentNode.IsWord = true;
 }
 return root;
 }
 // Below you'll find two implementations of the word splitter. They both do the same thing,
 // but one of them uses an array as the dictionary while the other one uses a trie.
 // The methods will split an input string into multiple space-seperated words from a dictionary.
 // If the input word is in the dictionary return it.
 // If the input word doesn't contain any dictionary words, return null.
 // So "applepieplate" will split into "apple pie plate".
 // This implementation uses a trie as a dictionary.
 private static string SplitWordWithTrie(string word, Node dictTrie) {
 word = word.ToLower();
 // If the complete input string is already a word, just return it without splitting it.
 if (dictTrie.ContainsWord(word)) {
 return word;
 }
 // Set the current node to the root of the dictionary trie.
 Node currentNode = dictTrie;
 string output = null;
 for (int i = 0; i < word.Length; i++) {
 string character = word[i].ToString();
 // If the character isn't in the current node's keys, the word isn't in the dictionary.
 if (!currentNode.ContainsKey(character)) {
 break;
 }
 // Set the current node to the child node with the current character as the key.
 currentNode = currentNode[character];
 if (currentNode.IsWord) {
 // Get the part of the string that has been determined to be a word.
 string firstPart = word.Substring(0, i + 1);
 // Then take the rest of the input string and run it through this method as the input string.
 // This way the secondPart variable will either end up being null, meaning that the rest
 // of the string isn't a valid word, or the space-seperated version of the input word.
 string secondPart = SplitWordWithTrie(word.Substring(i + 1), dictTrie);
 if (secondPart != null) {
 // Both parts are valid and can be set as candidates for the final output.
 // The reason the output is not just returned here, is because we would rather return compound 
 // words, if any exists. So instead of returning "base ball base", we return "baseball base".
 output = firstPart + " " + secondPart;
 }
 }
 }
 return output;
 }
 // This implementation uses an array as a dictionary.
 private static string SplitWordWithArray(string word, string[] dictList) {
 word = word.ToLower();
 // If the complete input string is already a word, just return it without splitting it.
 if (dictList.Contains(word)) {
 return word;
 }
 string output = null;
 for (int i = 1; i < word.Length; i++) {
 // Get the fist i characters of the string and check if they're a valid word.
 string firstPart = word.Substring(0, i);
 if (dictList.Contains(firstPart)) {
 // Then take the rest of the input string and run it through this method as the input string.
 // This way the secondPart variable will either end up being null, meaning that the rest
 // of the string isn't a valid word, or the space-seperated version of the input word.
 string secondPart = SplitWordWithArray(word.Substring(i), dictList);
 if (secondPart != null) {
 // Both parts are valid and can be set as candidates for the final output.
 // The reason the output is not just returned here, is because we would rather return compound 
 // words, if any exists. So instead of returning "base ball base", we return "baseball base".
 output = firstPart + " " + secondPart;
 }
 }
 }
 return output;
 }
}
// Node class used for the trie structure.
public class Node : Dictionary<string, Node> {
 // Use OrdinalIgnoreCase to ensure the character used as the key is case insensitive.
 public Node() : base(StringComparer.OrdinalIgnoreCase) {}
 // Whether the string terminating at this node is a valid word.
 public bool IsWord { get; set; }
 // Check whether a word is contained within the node's children.
 public bool ContainsWord(string word) {
 word = word.ToLower();
 Node currentNode = this;
 for (int i = 0; i < word.Length; i++) {
 string character = word[i].ToString();
 if (!currentNode.ContainsKey(character)) {
 return false;
 }
 currentNode = currentNode[character];
 }
 return currentNode.IsWord;
 }
}

Question 2

Use HashSet and StringBuilder. basenot returns null

Question 3

@Paparazzi Could you please elaborate a bit on what you've said? Why would I want to use HashSets and StringBuilders? Better memory performance? Execution time seems to be just a bit slower when using both of those, so I'm thinking that's not it. Also, what do you mean by "basenot returns null". The input string "basenot" is supposed to return null, since the word "not" is not in the test dictionary.

Question 4

Array contains is O(n)

Question 5

I see. I think repl.it messed up my tests then, as the HashSet version was a bit slower, which it shouldn't be, as HashSet.Contains is O(1), right?

Question 6

@tobloef - with such a small dictionary the overhead of the HashSet could very easily cancel out looping through a 6 item array.

Question 7

A couple things I would consider:

In the case of an algorithm like this, you usually want to return a list of all candidates instead of just one of them. (You return the last candidate found which should be all the compound words.)

I would consider (at the cost of significantly complicating the algorithm) returning a List<string> from each, as this will allow the user to pick which they like the best. (Basic auto-correct guidelines.)
Next, in the case of your Trie work, it looks like you could replace the string key with a char key, which should have a performance advantage. (The string type is always a reference type, char is a value type, this can help coerce more performance out of the char type in a case like this.)
Without rigorous testing I can't guarantee the accuracy of this next bit, but a concept to test would be to make output a StringBuilder in each case, give it a length of word.Length, and then build strings like that instead of creating new strings constantly in your loops.
Lastly, you could consider a case when you are provided with an invalid word in one portion of it (like your basketballgame test) and if the first part is not valid, but the second part is valid then return your output as well.

Otherwise this is good code, and it definitely solves the problem as efficiently as I can think of.

Question 8

First of all, I think the code is well written and the algorithms are implemented pragmatically and simple.

Performace

SplitWordWithArray

As suggested by @Paparazzi, it would be more performant to use a HashSet instead of the list.
When using a HashSet with StringComparer.OrdinalIgnoreCase, word.ToLower() becomes needless.

SplitWordWithTrie

There is no need for word.ToLower() (neither in the Node class nor in the method SplitWordWithTrie) because the Node class uses StringComparer.OrdinalIgnoreCase

General Code Structure

Because the there are 2 approaches of the same functionality, IMHO it makes sense to implement an interface (or abstract base class) 2 times. That has the following advantages:

Infrastructure code (e.g. Unit tests or performance tests) can be written against the abstraction (just iterate over all implementations)
It is possible to add other approaches (or compare slightly different version) without changing infrastructure code

Der Kommissar Der Kommissar 20.3k4 gold badges70 silver badges158 bronze badges · Accepted Answer · 2017-01-04 05:43:00Z

A couple things I would consider:

In the case of an algorithm like this, you usually want to return a list of all candidates instead of just one of them. (You return the last candidate found which should be all the compound words.)

I would consider (at the cost of significantly complicating the algorithm) returning a List<string> from each, as this will allow the user to pick which they like the best. (Basic auto-correct guidelines.)
Next, in the case of your Trie work, it looks like you could replace the string key with a char key, which should have a performance advantage. (The string type is always a reference type, char is a value type, this can help coerce more performance out of the char type in a case like this.)
Without rigorous testing I can't guarantee the accuracy of this next bit, but a concept to test would be to make output a StringBuilder in each case, give it a length of word.Length, and then build strings like that instead of creating new strings constantly in your loops.
Lastly, you could consider a case when you are provided with an invalid word in one portion of it (like your basketballgame test) and if the first part is not valid, but the second part is valid then return your output as well.

Otherwise this is good code, and it definitely solves the problem as efficiently as I can think of.

Stack Exchange Network

Split string to space-separated words based on dictionary

2 Answers 2

Performace

General Code Structure

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split string to space-separated words based on dictionary

2 Answers 2

Performace

General Code Structure

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions