1

Backstory (You can skip)

I am writing a pronunciation library for irregular words. Take something like the following:

T1E1s // tee one E one es | tee one E ones
1994-1995// 1994 (minus|dash|to|) 1995 
19-mm // 19 dash mmmmmmmmmmmmmm | 19 dash millimeter | 19 dash em em
4x4 // 4 ex 4 | 4 times 4 | 4 by 4

What you see for every word, are the different possible interpretations. Tackling the issue is pretty taxing, but is honestly pretty straight forward. Basically, I parse the word into various types denoted by this enumerator, as such:

 enum StringType
 { // Heirarchical
 Acronymn , // "mm","мм"
 Measurement , // pound
 Number , // 0-9
 Character_Times , // x ×ばつ X
 Character_Dash , // -
 Character_ForwardSlash , // /
 Character_Latin_Lower , // a
 Character_Latin_Upper , // A
 Character_Latin_Plural , // s
 Consanants_Latin_Proper , // Fff
 Consanants_Latin_Lower , // fff
 Consanants_Latin_Mixed , // fFF
 Consanants_Latin_Upper , // FFF
 Word_Latin_Proper , // Foo
 Word_Latin_Lower , // foo
 Word_Latin_Mixed , // fOO
 Word_Latin_Upper , // FOO
 Consanants_Cyrillic_Proper, // Fff
 Consanants_Cyrillic_Lower , // fff
 Consanants_Cyrillic_Mixed , // fFF
 Consanants_Cyrillic_Upper , // FFF
 Word_Cyrillic_Proper , // Фоо
 Word_Cyrillic_Lower , // фоо
 Word_Cyrillic_Mixed , // фОО
 Word_Cyrillic_Upper , // ФОО
 };

thus, a word like 19-mm is parsed like so:

Halt: void mapper(QString) /home/akiva/Programming/Blanket/main.cpp:206
[QStringList] sp.parsedStrings()
QStringList:: QStringList
0 : 19
1 : -
2 : mm
QList<QCD::StringType>:: QList<QCD::StringType>
0 : Number
1 : Character_Dash
2 : Acronymn
QString:: "19-mm"

The taxing part is where I have to tackle each case with its own implementation, and by the end of this, I imagine I will have something like 500 different combinations I will need to program functions for. This is where things get messy, because I do not want this:

 if (string.types() == QList<StringType>() << StringType::Number << StringType::Dash << StringType::Acronymn) { Number_Dash_Acronymn(); }
else if (string.types() == QList<StringType>() << StringType::Number << StringType::Dash << StringType::Number) { Number_Dash_Number(); }
else if (string.types() == QList<StringType>() << StringType::Number << StringType::Times << StringType::Number) { Number_Times_Number(); }
// + 500 additional if else statements

500 if else statements is not acceptable. An enumerator with 500 different values is also disgusting for all the reasons that you can imagine. I had floated the idea of using bitflags, but that ended up being far too limited (32 bits = only 32 parameters). Thus I think I have the best possible approach detailed below:

Actual Issue:

I want something like this:

switch (stringTypes) 
case Number + Dash + Word_Latin_Lower: {
 /* code */
 break;
} case Word_Latin_Lower + Dash + Number: { 
 /* code */
 break;
} default:
 ct_Error("Failed to account for the combination: ", stringTypes);
 break;
}

The obvious issue being that the first two cases have the same value, despite being in a different order. Regardless, if the code was functional, it would be foldable, readable, efficient, and easy to sort. Not to mention, I won't have to touch my header file to add new enumerators or functions, thus drastically helping my compile times.

Thus, how should I give combined enumerators guaranteed unique values, so much so that the order it is given also guarantee a unique value, while still maintaining readability?

Thanks.

asked May 13, 2019 at 1:53
14
  • 3
    7 components with 20 states: you could make a bitfield with 7 groups of 5 bits. 4 if the component only has 16 states. If you had 4 components with 20 states and 3 with 16 states it would be exactly 32 bits. But also you'd have 655,360,000 combinations which is why I don't think you will be handling them all individually. Commented May 13, 2019 at 3:35
  • 1
    How about hashing the string concatenation using a 64-bit hash with good distribution (e.g. FNV-1a)? That should make any combination unique with a very, very high probability (and you can still check by stuffing them into a dictionary). Commented May 13, 2019 at 5:19
  • 1
    @Akiva I elaborated on the thought in an answer. Commented May 13, 2019 at 21:37
  • 1
    Have you considered using separate bit fields (perhaps in a struct) instead of a single value? Commented May 14, 2019 at 2:13
  • 2
    You will have to compute the size based on your actual needs, but I doubt it. If you have n bits of information, it requires n bits of storage; no getting around that. Personally I find switch awkward. A series of if blocks (with early return of course) is a much clearer way to express flow of control, and it saves you a level of indent. Commented May 14, 2019 at 7:55

4 Answers 4

1

It is not that straightforward but doable nonetheless. First you can create a new enum for your 500 or so string type sequences. To make this readable, eliminate all underscores from your current StringType enum members and use these underscores to separate the parts in your new StringTypeSequence enum. So you could get something like this:

StringTypeSequence
{
 Number_Dash_WordLatinLower,
 WordLatinLower_Dash_Number,
 ...
 et cetera, all the way up to your 500 or so
}

Then you write a parser that picks up the member names from your code file and turns them into hash values. You use these hash values to update your enum:

StringTypeSequence
{
 Number_Dash_WordLatinLower = 0x7C34012B,
 WordLatinLower_Dash_Number = 0x058AAE67,
 ...
 et cetera, all the way up to your 500 or so
}

I just made up some values, the real ones will be different. Anyway...

You can now create your switch statement, casing on the StringTypeSequence members.

You then parse your text and you find a "Number + Dash + WordLatinLower" sequence. You concatenate the enum member names into the string "Number_Dash_WordLatinLower", calculate the hash for it and switch on it. You will end up at the case for enum member Number_Dash_WordLatinLower. If you end up in the default clause you found a new sequence you had not anticipated yet and you throw an exception, presenting the hash for the enum value and case to be added.

I am not sure if your version of C++ allows you to obtain the member name string of an enum value, you may need a CLR type language with reflection capability for that. If you cannot do that, just add an array of strings to your StringType enum that matches the enum members so you can use the enum values as indexes into the array to get to the names.

answered May 13, 2019 at 21:34
7
  • I am not sure if your version of C++ allows you to obtain the member name string of an enum value, I use Qt, so yes. QMetaEnum and Q_ENUM. First you can create a new enum for your 500 or so string type sequences. If I have 500 cases to account for, and I manually make 500 enumerators for these, which is a bad idea because my compile times will shoot through the roof every time I have to add a new instance (given they can't be forward declared), why would I then need to go through the trouble of giving each a unique value? They are all already enumerated. Commented May 13, 2019 at 23:07
  • A big enum is not likely to result in a noticeible compile time increase. You need the fixed values because you need to map string type sequences you find, that are unknown at compile time, to enum members at runtime. Commented May 13, 2019 at 23:17
  • 1
    @MartinMaat a huge number of values in enum won't increase compile times but doing it step by step: add several values, implement them, add several more, repeat is a royal PITA. Also if you use better enums (a library) and your hashing function is constexpr it would be trivial to generate two maps (one each way) at compile time in several lines of code. Although better enums require a wrapper in Qt since they (by design) don't have a default constructor. Commented May 13, 2019 at 23:52
  • Also your answer would be improved by adding some information on combining hashes. Commented May 13, 2019 at 23:53
  • @Jan Dorniak It do not think it would be a real PITA, more like a gentle glowy feeling every time you find a new sequence. Besides, this is not different from any solution, it is independent of the science behind it (the possible combinations you choose to handle are a given, what you do with unanticipated sequences is a different problem, how often you want to update is a free choice). Uniqueness of hashes can be easily guaranteed by having the generator stuff them in a dictionary that will bark on the first collision, in which extremely unlikely event you would change a member name slightly. Commented May 14, 2019 at 5:40
1

Summary

  1. Use a class to store the fields of StringType
  2. Refactor your 500 methods into command classes and use the chain of responsibility pattern.

Eliminate those 500 ad-hoc checks

If you have 500 functions that could be called, I'd say a bigger problem is the massive amount of code to check for each condition and invoke each function. Even with a switch statement, it's likely to be a bear to read and maintain.

Instead, I'd like to suggest you use the chain of responsibility design pattern so that the list of functions is abstract and you can iterate over it. The next sections show you how.

Define a class to contain the StringType information

First, define a class that can contain all the information you currently require in StringType. Yes, a class, not an integer (which does not have enough storage capacity to handle the possible combinations). For example:

class StringType
{
public:
 StringType() : Acronymn(), Dash(), Number() {}
 bool Acronymn;
 bool Dash;
 bool Number;
};

To use this class, you'd set individual fields instead of setting bits in an enum.

If you are concerned this will take up too much space, you can use bit fields to compact the data structure. I doubt this is really necessary though.

Define an abstraction for checking and handling different string types

Second, write a base class that exposes functions to (1) check if the class can handle a particular StringType, and (2) to handle it.

class StringHandler
{
public:
 virtual bool CanHandle(StringType* stringType) = 0;
 virtual void Execute() = 0;
};

Implement the abstraction

You then implement this class for each possible method call that results from a combination of flags. Here are two examples:

class NumberAcronymnHandler : public StringHandler
{
public:
 bool CanHandle(StringType* stringType) override
 {
 return (stringType->Number && stringType->Acronymn);
 };
 void Execute() override
 {
 Number_Acronymn(); //Here's where you do the work
 }
};
class NumberDashAndAcronymnHandler : public StringHandler
{
public:
 bool CanHandle(StringType* stringType) override
 {
 return (stringType->Number && stringType->Acronymn && stringType->Dash);
 };
 void Execute() override
 {
 Number_Dash_Acronymn(); //Here's where you do the work
 }
};

Eliminate the switch statement by iterating over the command classes

To figure out which class can handle a StringType, iterate over an array of these classes and check each one. This example supports 2 types, but if there were 500, the code would be more or less the same; it's just a loop over an array.

void Example(StringType* stringType)
{
 StringHandler* map[2] = 
 {
 new NumberAcronymnHandler(), 
 new NumberDashAndAcronymnHandler() 
 };
 for (int i=0; i<2; i++)
 {
 StringHandler* handler = map[i];
 if (handler->CanHandle(stringType))
 {
 handler->Execute();
 }
 }
};

This solves the issue with your enumerated type, and also adds maintainability and structure to your 500+ methods.

answered May 15, 2019 at 3:01
0

You are right in thinking that 500 if-else conditions is unreadable. Same thing for 500 switch cases.

The first thing to consider is whether having 500 different functions actually makes sense. Are those functions all very different or will they contain a lot of duplication? If so, how can you factor out this duplication? You should aim to find a canonical representation of these functions, i.e. only write the data/code that differentiates it from all other functions.

To structure actually picking the right function to call, you simply build a lookup table (hash table, if possible) linking the sequence to a callback function.

answered May 13, 2019 at 19:15
1
  • The first thing to consider is whether having 500 different functions actually makes sense. Considered, and yes. They will all be very different, and trying to deduplicate them will require more assumptions upon my part, and has led to many embarrassing errors. world war eye, 1993 minus 94, 37 dash millimeter. Its language, and its so fickle, that even a human reader is often tripped up by how people write things. Commented May 13, 2019 at 22:56
0

Inspired by Martin Maat's answer but using number hashing instead of string hashing.

Disclaimer: There is one downside to all of this solutions: they use hashes. And as is usual with hashes: while improbable collisions are a possibility. So it might end up that you will and up with a collision where two combinations resolve to the same hash and you have to treat it as a special case. This is a nasty bug waiting to happen so document this behaviour well if you do end up using this.

Solution 1:

// you need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
 return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}
// use like this
switch(qHash(stringTypes): {
case qHash(QList<StringType>{ Number, Dash, Word_Latin_Lower }): {
 /* code */
 break;
}
case qHash(QList<StringType>{ Word_Latin_Lower, Dash, Number }): {
 /* code */
 break;
}
default:
 ct_Error("Failed to account for the combination: ", stringTypes);
 break;
}

Solution 2:

This one has a bit more syntactic sugar

// you need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
 return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}
// this is pureliy syntactic sugar
uint StringTypeListHash(const QList<QCD::StringType>& list)
{
 return qHash(list);
}
// for this usage
switch(StringTypeListHash(stringTypes): {
case StringTypeListHash({ Number, Dash, Word_Latin_Lower }): {
 /* code */
 break;
}
case StringTypeListHash({ Word_Latin_Lower, Dash, Number }): {
 /* code */
 break;
}
default:
 ct_Error("Failed to account for the combination: ", stringTypes);
 break;
}

Solution 3:

Operator overloading. Very pretty code, should work. I chose the xor operator because it is most commonly associated with combining hashes.

I'm not 100% sure this won't have unintended sideffects if QCD::StringType is an enum instead of enum class. As usual - take caution with operator overloading and don't blindly use code someone on the internet whipped up without even trying to compile it.

// you still need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
 return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}
// here comes the magic
namespace QCD {
 // this is basically the implementation of boost::hash_combine
 // taken from: https://stackoverflow.com/a/27952689/6178613
 uint hash_combine( uint lhs, uint rhs ) {
 lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
 return lhs;
 }
 size_t operator^(StringType lhs, StringType rhs) {
 using enum_int_type = std::underlying_type<StringType>::type
 std::size_t hash = hash_combine(qHash(static_cast<enum_int_type>(lhs)), qHash(static_cast<enum_int_type>(rhs)));
 }
 size_t operator^(StringType lhs, size_t rhs) {
 using enum_int_type = std::underlying_type<StringType>::type
 std::size_t hash = hash_combine(qHash(static_cast<enum_int_type>(lhs)), rhs);
 }
 size_t operator^(size_t lhs, StringType rhs) {
 using enum_int_type = std::underlying_type<StringType>::type
 std::size_t hash = hash_combine(lhs, qHash(static_cast<enum_int_type>(rhs)));
 }
}
// usage
switch(qHash(stringTypes): {
case Number ^ Dash ^ Word_Latin_Lower: {
 /* code */
 break;
}
case Word_Latin_Lower ^ Number ^ Dash: {
 /* code */
 break;
}
default:
 ct_Error("Failed to account for the combination: ", stringTypes);
 break;
}
answered May 14, 2019 at 1:49

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.