The following code compresses a 60-byte BCrypt MCF into 40 bytes.
Does anyone see room for performance or other optimizations?
For example, I should replace Guava's ArrayBasedUnicodeEscaper
with a more efficient algorithm that deals only with char
s.
import com.google.common.escape.ArrayBasedUnicodeEscaper;
import com.google.common.escape.UnicodeEscaper;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;
import java.util.Map;
public final class BcryptCompressor {
private static final String BCRYPT_CHARACTERS = "./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private static final String BASE64_CHARACTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
private static final UnicodeEscaper BCRYPT_TO_BASE64 = new BcryptToBase64();
private static final UnicodeEscaper BASE64_TO_BCRYPT = new Base64ToBcrypt();
private static final Map<String, Byte> SCHEME_TO_BYTE_MAP = Map.ofEntries(
Map.entry("2", (byte)0x20),
Map.entry("2a", (byte)0x40),
Map.entry("2x", (byte)0x60),
Map.entry("2y", (byte)0x80),
Map.entry("2b", (byte)0xA0)
);
private static final Map<Byte, String> BYTE_TO_SCHEME_MAP = Map.ofEntries(
Map.entry((byte)0x20, "2"),
Map.entry((byte)0x40, "2a"),
Map.entry((byte)0x60, "2x"),
Map.entry((byte)0x80, "2y"),
Map.entry((byte)0xA0, "2b")
);
public static String encode(final byte[] bmcf) {
final byte header = bmcf[0];
return ('$' + BYTE_TO_SCHEME_MAP.get((byte)(header & 0xE0)) + '$' + (header & 0x1F) + '$' + bcrypt64Encode(bmcf, 1, 16) + bcrypt64Encode(bmcf, 17, 23));
}
public static byte[] decode(final String mcf) {
final int secondDollarIndex = mcf.indexOf('$', 1);
final int thirdDollarIndex = mcf.indexOf('$', (secondDollarIndex + 1));
final String schema = mcf.substring(1, secondDollarIndex);
final String cost = mcf.substring((secondDollarIndex + 1), thirdDollarIndex);
final String saltAndHash = mcf.substring((thirdDollarIndex + 1));
final ByteBuffer buffer = ByteBuffer.allocate(40);
final byte header = (byte)(SCHEME_TO_BYTE_MAP.get(schema) | (Integer.parseInt(cost) & 0x1F));
buffer.put(header);
final String salt = saltAndHash.substring(0, 22);
final String hash = saltAndHash.substring(22);
buffer.put(bcrypt64Decode(salt));
buffer.put(bcrypt64Decode(hash));
return buffer.array();
}
private static String bcrypt64Encode(final byte[] data, final int offset, final int length) {
return BASE64_TO_BCRYPT.escape(new String(Base64.getEncoder()
.withoutPadding()
.encode(Arrays.copyOfRange(data, offset, offset + length))));
}
private static byte[] bcrypt64Decode(final String data) {
return Base64.getDecoder()
.decode(BCRYPT_TO_BASE64.escape(data));
}
private static Map<Character, String> createMap(final String from, final String to) {
final Map<Character, String> result = new HashMap<>();
for(int i = 0; i < from.length(); i++) {
result.put(from.charAt(i), to.substring(i, (i + 1)));
}
return result;
}
private static final class BcryptToBase64 extends ArrayBasedUnicodeEscaper {
public BcryptToBase64() {
super(createMap(BCRYPT_CHARACTERS, BASE64_CHARACTERS), 0, 0xFFFF, null);
}
@Override
protected char[] escapeUnsafe(final int cp) {
throw new UnsupportedOperationException();
}
}
private static final class Base64ToBcrypt extends ArrayBasedUnicodeEscaper {
public Base64ToBcrypt() {
super(createMap(BASE64_CHARACTERS, BCRYPT_CHARACTERS), 0, 0xFFFF, null);
}
@Override
protected char[] escapeUnsafe(final int cp) {
throw new UnsupportedOperationException();
}
}
}
1 Answer 1
This library is not well motivated; it does not document a use case where squishing 60-byte ASCII down to 40-byte binary provides meaningful app-level improvement.
room for performance ... ?
This is tagged as a performance
submission,
yet it contains no observed timings or profiling measurements.
If you don't
tell us
how to generate an example workload of interest,
then we're left with hand waving.
generating constants
Ok, maybe it was infeasible to pull
BCRYPT_CHARACTERS
and BASE64_CHARACTERS
from some existing library, fine.
But listing the contents of both
SCHEME_TO_BYTE_MAP
and BYTE_TO_SCHEME_MAP
is just distracting.
DRY.
Generate the one from the other.
Consider using enums.
Also, you apparently allocated the top three MSBs
to represent the scheme, but neglected to explicitly write that down,
and didn't cite a scheme reference.
Bit operations that use << 5
would improve clarity.
Ok, enough on that topic, I won't examine masks like 0x1F or 0xE0.
clear names
final byte header = bmcf[0];
Consider introducing an explanatory type, BinaryModularCryptFormat, which is simply a byte array.
I thought "header" here is a "cost" parameter. Again, cite your reference. And then use the same names that the reference uses.
input checking
decode
makes no effort to verify that mcf
is well formed,
and will always return some answer, possibly gibberish.
Or it throws IndexOutOfBoundsException.
Its /** javadoc */ does not touch on this topic,
nor indeed on any topic.
Consider splitting on '$'
, or perhaps using a regex would be convenient.
use meaningful names
The identifier informs me that createMap
is creating something,
good, that's helpful.
But what?
Yeah, I know it's a map, that much is in the signature already.
Tell me what kind of map, please.
Name it.
Kudos for using it to build base64 <=> bcrypt mappings, rather than including them as literals in the source code. But perhaps we could initialize them the first time through and retain them, rather than building a fresh (constant) map for each input hash?
DRY, consider defining a class that offers escapeUnsafe
,
which those two classes extend.
automated tests
I just flat out don't trust this code to produce correct results. Maybe it does, but how would I know?
Cite a reference. And then copy one or more test vectors (plaintext + hash) from the reference. Add them to a test suite that demonstrates same behavior for 60-byte text and 40-byte binary hashing. Maybe compare your test suite with some existing library's test suite. At that point I would have much greater confidence that I could integrate this code into a production setting.
This library achieves a subset of its design goals.
I would be willing to delegate or accept maintenance tasks on it.
-
\$\begingroup\$ Thanks for your response. This has a meaningful use in optimizing DB storage. I definitely do not want to split with regex, as that is slow. I know my algorithm works and was hoping to get feedback on how to speed it up. Please tell me how I could have made that more clear, so that I can write better questions in the future. \$\endgroup\$Oliver– Oliver2023年09月25日 20:41:56 +00:00Commented Sep 25, 2023 at 20:41
-
\$\begingroup\$ Two things would make it more clear: (1) cite the reference of the spec you're trying to conform to (intent of what we should do), and (2) include the spec's test vectors in a test suite (demonstration that we did it). BTW, I suggested regex for "correctness" rather than for "speed" -- it's easy to convince someone reading the code that malformed input won't be accepted by a regex. As written, this code is too tolerant of poorly placed dollar signs. If speed is a concern, then /** document */ that, and make validation against a certain spec the caller's responsibility. \$\endgroup\$J_H– J_H2023年09月25日 20:45:48 +00:00Commented Sep 25, 2023 at 20:45