Desperately seeking a hashing function

Question 1

A recent CR exchange led me to dust off and adapt some code I'd written a few years ago. Long, long ago, when core meant ferro-magnetic memory cells, I spent an hour formulating a "simple" hash function combining it with a sparse LUT to convert month names to month ordinals. More recently, to pass a rainy afternoon, I developed the following code to search-out alternatives (perhaps simpler mangling of characters of lexicon words; perhaps discovering a "Perfect Hash" for that lexicon.)

This has been a "personal toy coding project" up until the recent revitalisation of its (possible) utility. "Recalled to Life!" for any Dickens fans in the audience.

This is presented here both as a humble offering to any who might find it useful, AND to solicit comments and suggestions as to how it might be improved. (The nested iterations are somewhat embarrassing, but I've lost the motivation to revise and refactor it to use recursion when permuting the mangling functions.)

Welcome would be any/all significant commentary (please, no "should use const" comments.) Especially welcome would be code of a revised version that does this all better than this currently does! For your pleasure...

For posting here, the code below utilises the "Month to Ordinal" function, then the "Weekday to Ordinal" function, then goes wild with hundreds of alternative hashing operations for the latter that also work. (It scrolls the screen somewhat, but does conclude.) Adding another quest using a different lexicon should be self-evident.

// SeekHash:
// Permute a bunch of operations looking for a combination
// that results in no collisions
// when applied to a small lexicon of strings.
// Report those hashing operations and the "string LUT" for each.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
static char *mkCodeStr( int ph[], int nHashes ) {
 // Given an array of small value integers and the useful element count,
 // make a string containing 'A', 'B', 'C', ... each at meaningful offsets.
 // Set unimportant bytes to '@' to achieve internal padding.
 // String used as LUT; maximum len is 32 (to suit SMALL lexicon only)
 // Yes, only 26 letters in alphabet, but...
 static char buf[32 + 1];
 memset( buf, '@', 32 ); // init "background"
 int maxHash = 0;
 for( int i = 0; i < nHashes; i++ ) {
 buf[ ph[i] ] = 'A' + i; // assign alpha representative into applicable element
 maxHash = ph[i] > maxHash ? ph[i] : maxHash; // remember max element assigned
 }
 buf[maxHash+1] = '0円'; // trim excess string length
 return buf;
}
typedef char (*fnc_t)( char );
// There are many 'trivial' bit operations possible to 'alter' a single byte/char.
static char bit2( char x ) { return (x>>1)&0x01; }
static char add1( char x ) { return x+1; }
static char sub1( char x ) { return x-1; }
static char invr( char x ) { return ~x; }
static char shR1( char x ) { return x>>1; }
static char shR2( char x ) { return x>>2; }
static char shL1( char x ) { return x<<1; }
static char shL2( char x ) { return x<<2; }
static char same( char x ) { return x; }
static char null( char x ) { return 0; }
typedef int (*op_t)( char, char, char );
// There are many operations to combine 3 bytes. These are a few of them.
static int sum_sum( char x, char y, char z ) { return x + y + z; }
static int xor_sum( char x, char y, char z ) { return x ^ y + z; }
static int sum_xor( char x, char y, char z ) { return x + y ^ z; }
static int lft_fld( char x, char y, char z ) { return sub1(x) + y ^ z; }
#define ENTRY(x) { #x, x, } // String-ize and the function address (pointer)
void seekHash3chars( char *wordList[], int nWords ) {
 // Each of leftmost 3 chars of every word will be subjected to a bit manipulation function.
 // 3 funcPtrs and an array of the "byte's bit mangling" operations available
 struct {
 char *name;
 fnc_t fnc;
 } *f0, *f1, *f2, fncs[] = {
 ENTRY( null ),
 ENTRY( same ),
 ENTRY( shL1 ),
 ENTRY( shL2 ),
 ENTRY( shR1 ),
 ENTRY( shR2 ),
 ENTRY( invr ),
 ENTRY( sub1 ),
 ENTRY( add1 ),
 ENTRY( bit2 ),
 };
 const int nFncs = sizeof fncs / sizeof fncs[0];
 // List of "byte combining" operations (in same style as above)
 struct {
 char *name;
 op_t fnc;
 } *op, ops[] = {
 ENTRY( sum_sum ),
 ENTRY( sum_xor ),
 ENTRY( xor_sum ),
 ENTRY( lft_fld ),
 };
 const int nOps = sizeof ops / sizeof ops[0];
 int hashes[32]; // Up to 32 distinct hash values, one for each lexicon entry
 unsigned int results = 0; // Up to 32 bit flags set as result of hash (to quickly detect/reject collisions)
 bool foundSome = false; // flag remains clear if no 4 bit operations work. Try 5 bits...
 for( int mask = 0xF; !foundSome && mask <= 0x1F; mask |= mask<<1 ) {
 // Try hashes between 0-15, then try larger hashes 0-31
 for( op = ops; op < &ops[nOps]; op++ ) {
 // Try each of the "byte combining" operations
 for( f0 = fncs; f0 < &fncs[nFncs]; f0++ ) {
 // Use each byte mangle on 1st character
 for( f1 = fncs; f1 < &fncs[nFncs]; f1++ ) {
 // Use each byte mangle on 2nd character
 for( f2 = fncs; f2 < &fncs[nFncs]; f2++ ) {
 // Use each byte mangle on 3rd character
 results = 0;
 for( int w = 0; w < nWords; w++ ) {
 // Go thru entire lexicon (eg 12 month names)
 // Mangle first 3 characters
 char x = f0->fnc(wordList[w][0] &0x1F ); // low 5 bits of each char
 char y = f1->fnc(wordList[w][1] &0x1F );
 char z = f2->fnc(wordList[w][2] &0x1F );
 int possHash = op->fnc( x, y, z ) & mask; // Combine 3 results and mask to 4/5 bits
 hashes[w] = possHash; // remember this hash value of this lexicon "word"
 // If collision detected, stop trying immediately and try something else
 if( (0x01 << possHash) & results )
 break;
 results |= (0x01 << possHash); // mark as used and continue with more words
 }
 if( w >= nWords ) {
 // WOW! This combination of operations on this lexicon results in NO collisions!!
 // Report operations, some view of results and statistics,
 // then formulate LUT as a string of '@' and 'A' to 'Z-ish'
 printf( "%s (%s %s %s) hashes -", op->name, f0->name, f1->name, f2->name );
 int min = hashes[0], max = hashes[0];
 for( w = 0; w < nWords; w++ ) {
 printf( " %2d", hashes[w] );
 if( min > hashes[w] ) min = hashes[w];
 if( max < hashes[w] ) max = hashes[w];
 }
 printf( " min(%d) max(%d) range(%d) '%s'\n", min, max, max - min + 1, mkCodeStr( hashes, w ) );
 foundSome = true;
 }
 }
 }
 }
 }
 }
}
static void demo( char *lst[], int num, int (*fnc)(char*) ) {
 // Output lexicon strings and each derived 'index' in two columns
 for( int i = 0; i < num ; i++ )
 printf( "%11s - %02d # %-5s - %02d\n", lst[ i ], fnc( lst[i] ), lst[ i+num ], fnc( lst[i+num] ) );
 printf( "\n" );
}
static int monthOrd( char *mn ) { return "DIE@CB@LJF@HAG@K"[ mn[1]/4&7 ^ mn[2]*2 &0xF ] &0xF; }
static void demo_MonthOrd( void ) {
// Convert a month name (or 3 letter abbrev.) to index (1-12). Beware false positives!
 char *ms[] = { 
 "January", "February", "March", "April", "May", "June", "July", "August", "Sept", "October", "Nov", "December",
 "JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC",
 };
 const int num = sizeof(ms)/sizeof(ms[0]);
 demo( ms, num/2, monthOrd );
// seekHash3chars( ms, num/2 ); // disabled for CR posting...
}
static int wkdayOrd( char cp[] ) { return "65013427"[*cp/2 + ~cp[1] & 0x7] & 0x7; }
static void demo_WkDayOrd( void ) {
// Convert a day name (or 3 letter abbrev.) to index (1-7, Sun=1). Beware false positives!
 char *ds[] = {
 "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",
 "SU", "MO", "TU", "WE", "TH", "FR", "SA"
 };
 const int num = sizeof(ds)/sizeof(ds[0]);
 demo( ds, num/2, wkdayOrd );
 // Show searching for candidate operation combinations
 puts( "\n\nPress enter for demo of search in action" ); getchar();
 seekHash3chars( ds, num/2 ); // enabled for CR posting...
}
int main( void ) {
 printf( "demo_MonthOrd()\n" ); demo_MonthOrd();
 puts( "\n\nPress enter for demo of weekday lexicon" ); getchar();
 printf( "demo_WkDayOrd()\n" ); demo_WkDayOrd();
 return 0;
}

NB: In this version of the code, mkCodeStr() trims the LUT string to a minimum length, effectively discarding contiguous '@' characters to the right of the highest valid mapping alpha character. This works fine in a scenario in which the candidate being hashed and mapped is definitely a member of the lexicon (but which member?). For instance, when dealing with tokens like month names that appear at a known location in a row of a computer generated table of records. No "caller verification" would be required.

In uses wherein the candidate token could be any random word, its hash could result in a value (0-15 or 0-31), this truncation of the LUT string must be manually undone before use.

Example: this program may propose a 'trimmed' LUT string like "@@BADCE@FG" (10 chars). This would be fine if the candidates were guaranteed to be one of the lexicon's words. For dealing with any 'random' word, the LUT string needs to be padded on the right "@@BADCE@FG@@@@@@" (16 chars) as the candidate string has potential, after bit masking operations, to lead to any hash value from 0 to 15.

Acknowledging that 'gperf' will write the code to achieve a speed-efficient "perfect hashing result". The examples I've seen, however, seem to consider memory to be a large resource. Perhaps unsuitable for small microcontroller code...

Restating a caveat: These functions blindly deal with the first 3 characters in an array (a "string"). The result is a usable ordinal based on the lexicon provided, BUT false positives are unavoidable. The calling code must, if need be, verify (strcmp()) against only ONE member of the entire lexicon. The intent is to speed up the translation of candidate strings to useful ordinal values without recourse to 'groping' as performed by, for instance, a binary search of a list.

Question 2

Additional note: I've always been deluded, thinking "the smaller the LUT, the better". Only in the last few days have I come to realise that a larger LUT (with more 'holes' that result in returning 0 (not found)) would run faster. The caller would receive fewer false positive indicators that need to be verified to be rejected, and more true negative results... Live and learn... :-)

Question 3

I'm not sure about the differences between c and c++, are you able to use include <functional> for std::hash

Question 4

@CharlesHenington I'm sorry. I cannot answer that. It's a long story. I got used to this IDE about 30 years ago, writing an app in C++ after years of C. Eventually, contracts led me back to C, then early retirement doing what I want with the familiar IDE. Still use that antiquated compiler (C++) but favour DIY projects (like this) using only C stuff. While I'm confident the world moves on and paradigms change, I'm happy mucking on my own, not trying to build apps out of "packaged-up" modules. No fun in that... Cheers!...

Question 5

Make sure the code compiles

Your code is missing a #include <stdbool.h>, and there is an error compiling the line if( w >= nWords ), since the w from the previous for-loop is no longer in scope.

Performance

You are trying to create a perfect hash function given a fixed set of possible inputs. There are many ways to do this, but of course a good implementation would:

Find a perfect hash function in something close to \$O(N)\$ time
Use a LUT that is close to \$\log_2(N)\$ bits in size

Your code generates a small LUT, so that's great. And technically, it's also \$O(N)\$ time complexity, since despite the 6 level nested for-loop, only one of them depends on \$N\$, and the rest is all using a fixed number of iterations. But of course the constant factor is huge.

Limitations

While the hash function might be running on a microcontroller, the hash function generator most likely does not. So that one can use as much memory as you want, and it should also be able to look beyond the first three characters. For a large enough input, there is also no guarantee that using three bytes and combining that with the unary and ternary operations you defined, you will actually be able to find a perfect hash function. Certainly not if you restrict yourself to only the first three bytes.

To make your code usable in general, get rid of these limitations and make it always generate a perfect hash function regardless of the input set.

Don't hardcode the input

Do you really want to recompile the program every time you want to generate a hash function for a different set of inputs? I would change the program such that it reads the input from stdin, and then write the output to stdout. The output could be in the form of a small piece of C code that implements the perfect hash function that was found.

Question 6

stdbool.h isn't needed in C23 and later, where true and false are predefined constants (en.cppreference.com/w/c/language/bool_constant). But yes, before that you'd need it, and the question didn't say which version of C it was written in.

Question 7

Great response! I'm again caught-in-the-act of passing off C++ code as C. Apologies to all. (Easy to correct). Agree with you that a more capable generator would be better. But, to compete with Gperf?? The power of "first to market" with a quality product... Thanks for looking over this posting...

Question 8

Re: "Limitations" You're right! I acknowledge the limited domain of usefulness. I am somewhat 'attached' to the 'simplicity' of this approach. For a while I considered tackling the 'lexicon' of C's original keywords (perhaps simply including a 4th character). Went no further after thinking that dealing with if0円 and do0円 would necessitate "special consideration", either lex-ing the source, or in the hashing function. Gave up before even considering newer keywords like restrict, etc... You're right in noting this approach, as presented, has limited applicability... Cheers! :-)

Question 9

Again, appreciate your taking the time to respond. Thank you. IF some reader might find use or amusement in this code, they're encouraged to add "sophistication" writing their own UI, perhaps autogenerate the possible return blah; statement, and even add more 'mangle/merge' possibilities. Hoping they'd share their enhancements with the world back here in an answer to this posting. Cheers... :-)

Question 10

"Perfect hashing" I'm sure you know this; posting for the novice reader... The LUT for a "perfect hash" would be "minimal", containing no 'gaps' representing "true negative" calculated values. For 'random' input, every candidate must be subsequently verfied to differentiate "true" from "false" positives. OTOH, a "good enough" hashing function with an oversized LUT would result in many more "true negative" look-ups, saving execution time (running faster). (A "good enough" function may even involve fewer shinanigans calculating any hash.) Should be acknowledged... Cheers!

Question 11

*Posting this as a (self-)answer to avoid modifying the original "question".

@G.Sliepen rightly points out that this 'technique' has severe limitations [emphasis mine]. One such case (using only 3 letters) would be the seeming impossibility of uniquely distinguishing between "there", "their" and "they're" (a shortcoming shared with some humans.)

Depending on the 'use case', there's nothing preventing the coder from (perhaps 'cleverly') dividing a large lexicon into multiple smaller lexicons, and using multiple instances of this 'technique' to zero in on whether the candidate string can be 'identified'. A "not found" return, or a rejected "false positive" (one strcmp()) could lead to trying the same candidate with a second, alternate hashing function appropriate to the second sub-region of what is a larger lexicon. Cascading in this way, the candidate could be successively(!) evaluated against sizeable 'chunks' of a very large lexicon with minimal effort. (Although this would eventually become less efficient than simply using a conventional binary search of a sorted list.)

For any who might want to try their hand, here are a few suggestions (although I cannot think of where these would need to be quickly reduced to an integer value):

the names of the seven dwarves
the seven "Wonders of the World" (Modern? or Ancient?)
the planets (although the 1st letter is all one needs, but for "Mercury" and "Mars")
the old "Resistor Colour Code" ("Bad Boys..." for those who know that mnemonic)
the 12 zodiac "star signs" (a version presented here two years ago)
similar lexicons but for months/weekdays written in a "foreign" language

While this offering is not a panacea, I hope it is enlightening to those who've given it some thought. There is no intention, nor claim, that this, in any way, approaches a complex analysis of an arbitrary collection of "words". But, it may be a useful tool to toss into the toolbox for use (or mere amusement) on a rainy day.

(I was pleased to uncover a hashing function for 'weekday names' that used only the first 2 letters.)

Final words

Mucking-about with this concept and code has been a lot of fun. There's nothing intrinsically difficult, once one 'groks' what it is trying to achieve. I do hope it finds use by one-or-two readers, even just to enumerate Happy, Bashful, Doc and the others...

Reporting an update to this answer, for the first time I've just tried the dumb way of partially implementing this suggested 'cascade' of "well, try looking in this subsection of a large list". My goal was enumerating each of the names of the 50 US states...

The first hurdles were the two "North ???", two "South ???", four "New ???" states, plus two "Miss????" states. PLUS, a theoretical 'full house' of unique hashes (in this version) would max-out at only 32.

Being 'clever', I divided the list into two, put one "North ???" and one "South ???" into each group; doing likewise for the two "Miss??" states. To account for four "New ???" states, I added a 'settable' offset (either 0 or 4) so that the attempts to determine non-colliding hash values could use word[w][ 0, 1, 2 ] or word[w][ 4, 5, 6 ]. (Obviously, "Iowa" and "Ohio" were not amongst the second group of names.)
Result: NO attempted permutation of these operations avoided collisions; tough luck!

Undaunted, I tried using subsection populations of 17 + 17 + 16. Each of these three DID succeed offering up several 'workable' hashing possibilities. Just to note, the original sorted list of 50 state names had by this time, become quite a jumbled collection. A further translation LUT, integer to integer, not implemented, would be needed if it were important that "Alaska" be indexed as 1, and "Wyoming" as 50.

Having wrestled that bear into submission, and in view of both "big O" considerations and the desire for the best (fastest) return from this function, the clouds had parted to reveal a further potential improvement I will leave alone...

Based on the lexicon of 50 state names, a binary search may involve up to 5 iterations (strcmp()) to arrive at ~1/2 of its matches (or, worse, rejections of a candidate string.) Using these chunks of 17+17+16, would, in worst case, reach its conclusion in three steps. Note: if the candidate string is absolutely a correctly spelled state name, no verifying strcmp() would be required; just some "light arithmetic or bitwise operations" with the values of 3 bytes of the candidate. And, there's a 1:3 chance that the candidate is found in the first subsection tried.

This last idea led to another variation that could be interesting to explore. Instead of my simple-minded guesswork assigning states to groups, if one has enough resources, one could seek to stuff as many "compatible" states into the first and second groups as possible. (A 'knapsack' problem.) Supposing, properly tested, the lexicon was such that some combination of ca. 25 of the 50 'keywords' could be uniquely "hashed" into the same (first) subsection, then ca. 50% (not "big O"; not "worst case") of 'in use' searches would result in finding the target with just a few "light bitwise" operations and one strcmp() (only if required).

This all has been a bit of fun. The origin of all this came from a novice C programmer who wanted to find a way to speed up a rather long running program that 'felt as if' it was clogging up an already strained multi-user minicomputer ('suboptimal' meant additional lost cycles given over to unproductive process 'swapping'). I hope this all has been in some way interesting to the reader.

Question 12

Comment instead of edit change -> 'Recently Active'... Yech!! ... Overlooked "Alabama" and "Alaska" (1st 3 chars) and stated "Alaska" == 1 by mistake. Subtle collision if using word[4, 5, 6] as sample chars: hashes of "Washington" and "Wyoming" will always collide!! Not as 'simple' as it appeared to be at 1:30AM!! Cheers! :-)

Question 13

Proof of concept (ie: using "cascades" of sizeable subsections) with a lexicon of 50 US "state names" showed this had possibilities. Allocation of 'words' to subsections, and devising 'workarounds' was part of the "game". FWIW, I've done something similar with the 102 elements named in Tom Lehrer's 1959 song "The Elements". The reader is encouraged to do their own explorations of discovery.

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2025-04-18 13:47:14Z

Make sure the code compiles

Your code is missing a #include <stdbool.h>, and there is an error compiling the line if( w >= nWords ), since the w from the previous for-loop is no longer in scope.

Performance

You are trying to create a perfect hash function given a fixed set of possible inputs. There are many ways to do this, but of course a good implementation would:

Find a perfect hash function in something close to \$O(N)\$ time
Use a LUT that is close to \$\log_2(N)\$ bits in size

Your code generates a small LUT, so that's great. And technically, it's also \$O(N)\$ time complexity, since despite the 6 level nested for-loop, only one of them depends on \$N\$, and the rest is all using a fixed number of iterations. But of course the constant factor is huge.

Limitations

While the hash function might be running on a microcontroller, the hash function generator most likely does not. So that one can use as much memory as you want, and it should also be able to look beyond the first three characters. For a large enough input, there is also no guarantee that using three bytes and combining that with the unary and ternary operations you defined, you will actually be able to find a perfect hash function. Certainly not if you restrict yourself to only the first three bytes.

To make your code usable in general, get rid of these limitations and make it always generate a perfect hash function regardless of the input set.

Don't hardcode the input

Do you really want to recompile the program every time you want to generate a hash function for a different set of inputs? I would change the program such that it reads the input from stdin, and then write the output to stdout. The output could be in the form of a small piece of C code that implements the perfect hash function that was found.

stdbool.h isn't needed in C23 and later, where true and false are predefined constants (en.cppreference.com/w/c/language/bool_constant). But yes, before that you'd need it, and the question didn't say which version of C it was written in.
Great response! I'm again caught-in-the-act of passing off C++ code as C. Apologies to all. (Easy to correct). Agree with you that a more capable generator would be better. But, to compete with Gperf?? The power of "first to market" with a quality product... Thanks for looking over this posting...
Re: "Limitations" You're right! I acknowledge the limited domain of usefulness. I am somewhat 'attached' to the 'simplicity' of this approach. For a while I considered tackling the 'lexicon' of C's original keywords (perhaps simply including a 4th character). Went no further after thinking that dealing with if0円 and do0円 would necessitate "special consideration", either lex-ing the source, or in the hashing function. Gave up before even considering newer keywords like restrict, etc... You're right in noting this approach, as presented, has limited applicability... Cheers! :-)
Again, appreciate your taking the time to respond. Thank you. IF some reader might find use or amusement in this code, they're encouraged to add "sophistication" writing their own UI, perhaps autogenerate the possible return blah; statement, and even add more 'mangle/merge' possibilities. Hoping they'd share their enhancements with the world back here in an answer to this posting. Cheers... :-)
"Perfect hashing" I'm sure you know this; posting for the novice reader... The LUT for a "perfect hash" would be "minimal", containing no 'gaps' representing "true negative" calculated values. For 'random' input, every candidate must be subsequently verfied to differentiate "true" from "false" positives. OTOH, a "good enough" hashing function with an oversized LUT would result in many more "true negative" look-ups, saving execution time (running faster). (A "good enough" function may even involve fewer shinanigans calculating any hash.) Should be acknowledged... Cheers!

Stack Exchange Network

Desperately seeking a hashing function

2 Answers 2

Make sure the code compiles

Performance

Limitations

Don't hardcode the input

Final words

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Desperately seeking a hashing function

2 Answers 2

Make sure the code compiles

Performance

Limitations

Don't hardcode the input

Final words

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions