Performance of this BCrypt lossless compressor

Question 1

The following code compresses a 60-byte BCrypt MCF into 40 bytes.

Does anyone see room for performance or other optimizations?
For example, I should replace Guava's ArrayBasedUnicodeEscaper with a more efficient algorithm that deals only with chars.

import com.google.common.escape.ArrayBasedUnicodeEscaper;
import com.google.common.escape.UnicodeEscaper;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;
import java.util.Map;
public final class BcryptCompressor {
 private static final String BCRYPT_CHARACTERS = "./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
 private static final String BASE64_CHARACTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 private static final UnicodeEscaper BCRYPT_TO_BASE64 = new BcryptToBase64();
 private static final UnicodeEscaper BASE64_TO_BCRYPT = new Base64ToBcrypt();
 private static final Map<String, Byte> SCHEME_TO_BYTE_MAP = Map.ofEntries(
 Map.entry("2", (byte)0x20),
 Map.entry("2a", (byte)0x40),
 Map.entry("2x", (byte)0x60),
 Map.entry("2y", (byte)0x80),
 Map.entry("2b", (byte)0xA0)
 );
 private static final Map<Byte, String> BYTE_TO_SCHEME_MAP = Map.ofEntries(
 Map.entry((byte)0x20, "2"),
 Map.entry((byte)0x40, "2a"),
 Map.entry((byte)0x60, "2x"),
 Map.entry((byte)0x80, "2y"),
 Map.entry((byte)0xA0, "2b")
 );
 public static String encode(final byte[] bmcf) {
 final byte header = bmcf[0];
 return ('$' + BYTE_TO_SCHEME_MAP.get((byte)(header & 0xE0)) + '$' + (header & 0x1F) + '$' + bcrypt64Encode(bmcf, 1, 16) + bcrypt64Encode(bmcf, 17, 23));
 }
 public static byte[] decode(final String mcf) {
 final int secondDollarIndex = mcf.indexOf('$', 1);
 final int thirdDollarIndex = mcf.indexOf('$', (secondDollarIndex + 1));
 final String schema = mcf.substring(1, secondDollarIndex);
 final String cost = mcf.substring((secondDollarIndex + 1), thirdDollarIndex);
 final String saltAndHash = mcf.substring((thirdDollarIndex + 1));
 final ByteBuffer buffer = ByteBuffer.allocate(40);
 final byte header = (byte)(SCHEME_TO_BYTE_MAP.get(schema) | (Integer.parseInt(cost) & 0x1F));
 buffer.put(header);
 final String salt = saltAndHash.substring(0, 22);
 final String hash = saltAndHash.substring(22);
 buffer.put(bcrypt64Decode(salt));
 buffer.put(bcrypt64Decode(hash));
 return buffer.array();
 }
 private static String bcrypt64Encode(final byte[] data, final int offset, final int length) {
 return BASE64_TO_BCRYPT.escape(new String(Base64.getEncoder()
 .withoutPadding()
 .encode(Arrays.copyOfRange(data, offset, offset + length))));
 }
 private static byte[] bcrypt64Decode(final String data) {
 return Base64.getDecoder()
 .decode(BCRYPT_TO_BASE64.escape(data));
 }
 private static Map<Character, String> createMap(final String from, final String to) {
 final Map<Character, String> result = new HashMap<>();
 for(int i = 0; i < from.length(); i++) {
 result.put(from.charAt(i), to.substring(i, (i + 1)));
 }
 return result;
 }
 private static final class BcryptToBase64 extends ArrayBasedUnicodeEscaper {
 public BcryptToBase64() {
 super(createMap(BCRYPT_CHARACTERS, BASE64_CHARACTERS), 0, 0xFFFF, null);
 }
 @Override
 protected char[] escapeUnsafe(final int cp) {
 throw new UnsupportedOperationException();
 }
 }
 private static final class Base64ToBcrypt extends ArrayBasedUnicodeEscaper {
 public Base64ToBcrypt() {
 super(createMap(BASE64_CHARACTERS, BCRYPT_CHARACTERS), 0, 0xFFFF, null);
 }
 @Override
 protected char[] escapeUnsafe(final int cp) {
 throw new UnsupportedOperationException();
 }
 }
}

Question 2

This library is not well motivated; it does not document a use case where squishing 60-byte ASCII down to 40-byte binary provides meaningful app-level improvement.

room for performance ... ?

This is tagged as a performance submission, yet it contains no observed timings or profiling measurements. If you don't tell us how to generate an example workload of interest, then we're left with hand waving.

generating constants

Ok, maybe it was infeasible to pull BCRYPT_CHARACTERS and BASE64_CHARACTERS from some existing library, fine. But listing the contents of both SCHEME_TO_BYTE_MAP and BYTE_TO_SCHEME_MAP is just distracting. DRY. Generate the one from the other. Consider using enums.

Also, you apparently allocated the top three MSBs to represent the scheme, but neglected to explicitly write that down, and didn't cite a scheme reference. Bit operations that use << 5 would improve clarity. Ok, enough on that topic, I won't examine masks like 0x1F or 0xE0.

clear names

 final byte header = bmcf[0];

Consider introducing an explanatory type, BinaryModularCryptFormat, which is simply a byte array.

I thought "header" here is a "cost" parameter. Again, cite your reference. And then use the same names that the reference uses.

input checking

decode makes no effort to verify that mcf is well formed, and will always return some answer, possibly gibberish. Or it throws IndexOutOfBoundsException. Its /** javadoc */ does not touch on this topic, nor indeed on any topic.

Consider splitting on '$', or perhaps using a regex would be convenient.

use meaningful names

The identifier informs me that createMap is creating something, good, that's helpful. But what? Yeah, I know it's a map, that much is in the signature already. Tell me what kind of map, please. Name it.

Kudos for using it to build base64 <=> bcrypt mappings, rather than including them as literals in the source code. But perhaps we could initialize them the first time through and retain them, rather than building a fresh (constant) map for each input hash?

DRY, consider defining a class that offers escapeUnsafe, which those two classes extend.

automated tests

I just flat out don't trust this code to produce correct results. Maybe it does, but how would I know?

Cite a reference. And then copy one or more test vectors (plaintext + hash) from the reference. Add them to a test suite that demonstrates same behavior for 60-byte text and 40-byte binary hashing. Maybe compare your test suite with some existing library's test suite. At that point I would have much greater confidence that I could integrate this code into a production setting.

This library achieves a subset of its design goals.

I would be willing to delegate or accept maintenance tasks on it.

Question 3

Thanks for your response. This has a meaningful use in optimizing DB storage. I definitely do not want to split with regex, as that is slow. I know my algorithm works and was hoping to get feedback on how to speed it up. Please tell me how I could have made that more clear, so that I can write better questions in the future.

Question 4

Two things would make it more clear: (1) cite the reference of the spec you're trying to conform to (intent of what we should do), and (2) include the spec's test vectors in a test suite (demonstration that we did it). BTW, I suggested regex for "correctness" rather than for "speed" -- it's easy to convince someone reading the code that malformed input won't be accepted by a regex. As written, this code is too tolerant of poorly placed dollar signs. If speed is a concern, then /** document */ that, and make validation against a certain spec the caller's responsibility.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-09-25 15:51:00Z

This library is not well motivated; it does not document a use case where squishing 60-byte ASCII down to 40-byte binary provides meaningful app-level improvement.

room for performance ... ?

This is tagged as a performance submission, yet it contains no observed timings or profiling measurements. If you don't tell us how to generate an example workload of interest, then we're left with hand waving.

generating constants

Ok, maybe it was infeasible to pull BCRYPT_CHARACTERS and BASE64_CHARACTERS from some existing library, fine. But listing the contents of both SCHEME_TO_BYTE_MAP and BYTE_TO_SCHEME_MAP is just distracting. DRY. Generate the one from the other. Consider using enums.

Also, you apparently allocated the top three MSBs to represent the scheme, but neglected to explicitly write that down, and didn't cite a scheme reference. Bit operations that use << 5 would improve clarity. Ok, enough on that topic, I won't examine masks like 0x1F or 0xE0.

clear names

 final byte header = bmcf[0];

Consider introducing an explanatory type, BinaryModularCryptFormat, which is simply a byte array.

I thought "header" here is a "cost" parameter. Again, cite your reference. And then use the same names that the reference uses.

input checking

decode makes no effort to verify that mcf is well formed, and will always return some answer, possibly gibberish. Or it throws IndexOutOfBoundsException. Its /** javadoc */ does not touch on this topic, nor indeed on any topic.

Consider splitting on '$', or perhaps using a regex would be convenient.

use meaningful names

The identifier informs me that createMap is creating something, good, that's helpful. But what? Yeah, I know it's a map, that much is in the signature already. Tell me what kind of map, please. Name it.

Kudos for using it to build base64 <=> bcrypt mappings, rather than including them as literals in the source code. But perhaps we could initialize them the first time through and retain them, rather than building a fresh (constant) map for each input hash?

DRY, consider defining a class that offers escapeUnsafe, which those two classes extend.

automated tests

I just flat out don't trust this code to produce correct results. Maybe it does, but how would I know?

Cite a reference. And then copy one or more test vectors (plaintext + hash) from the reference. Add them to a test suite that demonstrates same behavior for 60-byte text and 40-byte binary hashing. Maybe compare your test suite with some existing library's test suite. At that point I would have much greater confidence that I could integrate this code into a production setting.

This library achieves a subset of its design goals.

I would be willing to delegate or accept maintenance tasks on it.

Thanks for your response. This has a meaningful use in optimizing DB storage. I definitely do not want to split with regex, as that is slow. I know my algorithm works and was hoping to get feedback on how to speed it up. Please tell me how I could have made that more clear, so that I can write better questions in the future.
Two things would make it more clear: (1) cite the reference of the spec you're trying to conform to (intent of what we should do), and (2) include the spec's test vectors in a test suite (demonstration that we did it). BTW, I suggested regex for "correctness" rather than for "speed" -- it's easy to convince someone reading the code that malformed input won't be accepted by a regex. As written, this code is too tolerant of poorly placed dollar signs. If speed is a concern, then /** document */ that, and make validation against a certain spec the caller's responsibility.

Stack Exchange Network

Performance of this BCrypt lossless compressor

1 Answer 1

generating constants

clear names

input checking

use meaningful names

automated tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performance of this BCrypt lossless compressor

1 Answer 1

generating constants

clear names

input checking

use meaningful names

automated tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions