I'm looking for a (possibly cryptographic strong) hash function in Bash with the following properties:
- The input is only a string with 4 lower-case characters: aaaa to zzzz
- The output should be a decimal number (with leading zeros) in range: 0000 to 9999
- Low number of collisions and hash values distributed as evenly as possible
- Surjective (each output should appear)
- Chaos (should have a good diffusion of hash values)
- Optional: Confusing (it should not be possible to draw any conclusions about the input value from the hash value)
I have not yet found a (simple) algorithm that would fulfil these requirements.
This is my approach. How would you rate it?
#!/bin/bash
# ord() - converts ASCII character to its decimal value
ord() {
printf '%d' "'1ドル"
}
my_hash_4 () {
s=1ドル
while (( ${#s} < 4 ));
do
s="${s}a"
done
hash=0
multiplier=1
for i in $(seq 0 3);
do
j=$(ord "${s:$i:1}")
hash=$((hash+(multiplier*(j-97)*961)))
multiplier=$((multiplier*26))
done
# (int) (10_000 / Math.pow(26, 4) * hash) % 10_000
hash=$((((10000000*hash)/456976000)%10000))
printf '%04d\n' "$hash"
}
my_hash_4 "1ドル"
Example output:
bash my_hash.sh aaaa
0000
bash my_hash.sh baaa
0021
bash my_hash.sh abaa
0546
bash my_hash.sh bbaa
0567
bash my_hash.sh zzzz
9978
bash my_hash.sh hallo
2292
5 Answers 5
I see no evidence that this is a cryptographically-strong hash function. You should include a comment that links to a reputable reference for the algorithm you are implementing, so that reviewers can compare your code against the specification and confirm that it matches. Otherwise, we have to assume that no cryptanalysis has been performed at all.
I would recommend using an existing implementation if possible, and reduce its output to your desired range (0-9999). Then only the last part needs to be verified as the operation of your script and all the specialist crypto work is done by those with the expertise.
That said, with such a small input space (264 values), "cryptographically strong" is probably meaningless - even on quite ordinary consumer hardware, it's not hard to generate all possible hashes for lookup or comparison.
There seems to be a baked-in assumption that input is ASCII. Yet that is not a given on all platforms where Bash can run.
Loops like this are wasteful:
s=1ドル while (( ${#s} < 4 )); do s="${s}a" done
Simpler just to append 4 chars (削除) then crop (削除ここまで):
s=${1}aaaa
Although we could then crop s
down to 4 characters (s=${s:0:4}
), there's no need because the rest of the script only indexes the first 4.
Invoking seq
to generate 0 1 2 3
seems wasteful. Bash brace-expansion {0..3}
is more efficient.
-
2\$\begingroup\$ I don't believe even using an existing hash function and concatenating it will make the hash function any more secure. There are simply too few values for this to ever be secure. \$\endgroup\$Dair– Dair2025年04月14日 14:31:41 +00:00Commented Apr 14 at 14:31
-
\$\begingroup\$ @Dair, you're absolutely right. One can make a tiny rainbow-table for this. \$\endgroup\$Toby Speight– Toby Speight2025年04月15日 08:15:46 +00:00Commented Apr 15 at 8:15
How would you rate it?
Obscure.
This does not impress me as a supportable codebase.
hash=$((hash+(multiplier*(j-97)*961)))
...
hash=$((((10_000_000*hash) / 456_976_000) % 10_000))
There's a lot of magic constants in the OP code. I imagine there might be some paper in the open literature that this code is based on. Cite your reference. What .PDF or other work does this source code rely upon?
That 312 (961) left shift does not seem well motivated. Including unit tests which demonstrate how bash arithmetic behaves during deliberate overflow would be useful. One day bash might change such behavior, and you'll want to know how to diagnose it.
The constant 45.6976 seems to have come from nowhere.
(And no, I didn't downvote. I did say the code seemed harder to maintain than necessary, and explained why.)
hash = ... % 10_000))
The modulo makes it clear that we shall obey the 4-digit contract. It would be easier to reason about the loop if a modulo happened within it. From the Review Context my understanding is that we expect a "long" input string to deliberately overflow, producing mod 264 results on each iteration. Coding an explicit modulo call would make Author's Intent more transparent, and reduce coupling between this high level source and low level details like bash version or native processor word width.
-
\$\begingroup\$ "I imagine there might be some paper in the open literature that this code is based on. Cite your reference." I thought about the constants myself. Thanks for downvoting my question. \$\endgroup\$Tobias Grothe– Tobias Grothe2025年04月13日 07:43:23 +00:00Commented Apr 13 at 7:43
-
7\$\begingroup\$ @TobiasGrothe Even I as a moderator can't see who down votes questions or answers. You can't assume an answer that provides valid observations that are critical of the code means that the user that provided the answer down voted the question. Most of the experienced users that answer a question upvote the question, at least I do. I down vote questions that are not worthy of answers. \$\endgroup\$2025年04月13日 13:05:34 +00:00Commented Apr 13 at 13:05
-
\$\begingroup\$ @J_H Perhaps you mean the FNV-1a hash function. And 26^4 (number of possible input values) is 456976, that's in the comment one line up. \$\endgroup\$Tobias Grothe– Tobias Grothe2025年04月13日 17:49:57 +00:00Commented Apr 13 at 17:49
-
2\$\begingroup\$ Sorry, I didn't read the comment one line up. It looked like "commented out code" rather than advice to a human. Typically we seek to delete such lines prior to PR review or merging to
main
. Its format differs enough from the code that actually does run that I didn't immediately view them as identical. Assigning \26ドル^4\$ to a symbol would be a good first step in assisting some poor maintenance engineer a year down the road who wants to understand what's happening and its motivation. It's unclear how FNV-1a is relevant to OP, absent XOR or a FNV prime. Do I understand the code? No, not well. \$\endgroup\$J_H– J_H2025年04月13日 21:39:12 +00:00Commented Apr 13 at 21:39
Validate input
The script expects the input is a string with 4 lower-case characters: aaaa to zzzz.
It would be good to enforce this in the script:
if ! [[ 1ドル =~ ^[a-z]{4}$ ]]; then
echo >&2 "Input must be a 4-digit string in the range aaaa..zzzz; got: 1ドル"
exit 1
fi
Simplify using ((...))
Instead of x=$((x+...))
you can write simpler as ((x += ...))
.
Use more spaces in arithmetic expressions for readability
In arithmetic expressions it's possible to use spaces similar to as in C.
Instead of:
hash=$((hash+(multiplier*(j-97)*961))) multiplier=$((multiplier*26)) hash=$((((10000000*hash)/456976000)%10000))
Combined with the previous point about ((...))
and using more spaces, this is easier to read:
((hash += multiplier * (j - 97) * 961))
((multiplier *= 26))
((hash = ((10000000 * hash) / 456976000) % 10000))
Bash is a shell. A shell is a tool to manipulate (create/destroy) files and processes and sequence calls to other tools. It is not a tool to manipulate text, including performing math operations. The people who invented shell also invented awk for shells to call when they need to manipulate text. Here's your shell script with the text manipulation part rewritten in awk, using GNU awk for its ordchr
library:
$ cat my_hash2.sh
#!/usr/bin/env bash
my_hash_4() {
awk -l 'ordchr' -v val="1ドル" '
BEGIN {
s = sprintf("%4s", val)
gsub(/ /, "a", s)
hash = 0
multiplier = 1
for ( i=1; i<=4; i++ ) {
j = ord( substr(s,i,1) )
hash += ( multiplier * (j-97) * 961 )
multiplier *= 26
}
# (int) (10_000 / Math.pow(26, 4) * hash) % 10_000
hash = ( (10000000 * hash) / 456976000 ) % 10000
printf "%04d\n", hash
}
'
}
my_hash_4 "1ドル"
$ for i in aaaa baaa abaaa bbaaa; do ./my_hash2.sh "$i"; done
0000
0021
0546
0567
You can write it a bit more briefly and clearly than a shell script (e.g. by including white space between values/variables) and if you only need to calculate 1 hash at a time the above would run about twice as fast as your original shell script but if you wanted to calculate multiple values in one execution then it'd run orders of magnitude faster, after being tweaked to handle multiple input strings, e.g.:
$ cat ./my_hash2.sh
#!/usr/bin/env bash
my_hash_4() {
awk -l 'ordchr' -v val="1ドル" '
{
s = sprintf("%4s", 1ドル)
gsub(/ /, "a", s)
hash = 0
multiplier = 1
for ( i=1; i<=4; i++ ) {
j = ord( substr(s,i,1) )
hash += ( multiplier * (j-97) * 961 )
multiplier *= 26
}
# (int) (10_000 / Math.pow(26, 4) * hash) % 10_000
hash = ( (10000000 * hash) / 456976000 ) % 10000
printf "%s\t%04d\n", 1,ドル hash
}
' < <(printf '%s\n' "$@")
}
my_hash_4 "$@"
$ ./my_hash2.sh aaaa baaa abaaa bbaaa
aaaa 0000
baaa 0021
abaaa 0546
bbaaa 0567
You should discard your shell script and use the above as a starting point for whatever it is you need to do.
If you don't have GNU awk then remove -l 'ordchr'
and add this function to the bottom of the awk script, before the last '
, and then it'll work in any awk:
function ord(char, num) {
if ( !(char in _ord) ) {
for ( num=0; num<=255; num++ ) {
_ord[sprintf("%c", num)] = num
}
}
return _ord[char]
}
Of course you could tweak that loop to go from -97
to 158
and then you wouldn't need to do j-97
in the loop on the input chars.
-
\$\begingroup\$ Tiny:
s = sprintf("%saaaa", 1ドル)
maybe? As noted in my comment, and another answer, following code only interested in leftmost 4 characters, regardless of length ofs
. Length doesn't strictly need to equal 4. \$\endgroup\$Fe2O3– Fe2O32025年04月19日 23:06:03 +00:00Commented Apr 19 at 23:06 -
2\$\begingroup\$ @Fe2O3 I was deliberately showing the realistic awk equivalent of the OPs original shell syntax so it was most obvious how to do the same in awk as they were doing in shell. There are other possible enhancements too. \$\endgroup\$Ed Morton– Ed Morton2025年04月19日 23:35:13 +00:00Commented Apr 19 at 23:35
Based on the previous reviews of J_H and Toby Speight (thanks for that), I now have this version I want to share with you:
#!/bin/bash
my_hash () {
s=${1}aaaa # ensure that there are at least 4 chars
a=$(printf '%d' "'a") # ordinal of 'a'
hash=0
multiplier=1
for i in {0..3};
do
j=$(printf '%d' "'${s:$i:1}") # ordinal of ith char
hash=$((hash+(multiplier*(j-a)*961)))
multiplier=$((multiplier*26))
done
hash=$((((10000*hash)/456976)%10000))
printf '%04d\n' "$hash"
}
my_hash "1ドル"
I think it is more readable and performs better and has less magic constants.
The thing I'm still not happy with is the constant 961 (31*31
).
In my view, this means that "similar" or "neighboring" input values are not distributed well enough, so that there is not enough chaos or diffusion. But this would have to be analyzed mathematically.
-
\$\begingroup\$ Letting
printf
populatea
itself withprintf -v a '%d' "'a"
would be more efficient than spawning a subshell witha=$(printf '%d' "'a")
and ditto forj=...
. But you shouldn't use shell for this anyway. \$\endgroup\$Ed Morton– Ed Morton2025年04月19日 11:43:33 +00:00Commented Apr 19 at 11:43
Explore related questions
See similar questions with these tags.
1+4=5
but2+3=5
, too... Given just5
, there's no way to reverse thing to determine the two input values again.) \$\endgroup\$aaa
") to input (sometimes making length 7), then slice off leftmost 4 characters for processing. Step back to see the whole forest. That firstwhile()
evaporates away. \$\endgroup\$