Java class to compute and get a MD5 hash as a string

Question 1

Here is my class file to compute MD5 hashes on a file. I wanted to be able to easily calculate them without having to create a lot of overhead in the code using the class.

Can anyone tell me a list of improvements to make this class acceptable?

package Md5Generator;
/**
 *
 * @author jacob
 */
 import java.io.IOException;
 import java.security.*;
 import java.nio.file.Files;
 import java.nio.file.Paths;
 import java.nio.file.Path;
 /**
 * The point of this class it to compute the MD5 sum of a file
 */
 public class Md5 {
// two instance variables, one to store the file path and one to store the MD5 sum
private String path;
private String md5Sum;
/**
 * Constructor that takes a file path and and calcs the MD5 sum
 *
 * @param filePath the string that contains the full path to the file
 */
public Md5(String filePath) {
 path = filePath;
 calcMd5(path);
}
private void calcMd5(String filePath) {
 //create a messagedigest object to compute an MD5 sum
 try {
 MessageDigest md = MessageDigest.getInstance("MD5");
 //create a input stream to get the bytes of the file
 Path path = Paths.get(filePath);
 //read the bytes from the file and put them in the message digest
 md.update(Files.readAllBytes(path));
 //digest the bytes and generate an MD5 sum thats stored in an array of bytes
 byte[] hash = md.digest();
 //convert the byte array to its hex counter parts and store it as a string
 md5Sum = toHexString(hash);
 } catch (IOException | NoSuchAlgorithmException ex) {
 ex.printStackTrace();
 }
}
private String toHexString(byte[] bytes) {
 char[] hexArray = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
 char[] hexChars = new char[bytes.length * 2];
 int v;
 for (int j = 0; j < bytes.length; j++) {
 v = bytes[j] & 0xFF;
 hexChars[j * 2] = hexArray[v / 16];
 hexChars[j * 2 + 1] = hexArray[v % 16];
 }
 return new String(hexChars);
}
/**
 * Returns the MD5 sum as a String
 *
 * @return the string that contains the MD5 sum
 */
public String getMd5Sum() {
 return md5Sum;
}
}

Question 2

What do you mean acceptable? Does it work as intended? If not, your question is off-topic here.

Question 3

MD5 is considered to be no longer reliable as a one-way hash. Is there some reason you are not using at least SHA1?

Question 4

You convert the MD5 hash to a hex string. You may want to use a Base64 string, which will be shorter, and Java 8 includes the ready-made converter java.util.Base64 which does the work for you.

Question 5

@200_success the reason for not using SHA1 is that I am using this class in a backup program I am developing and I am only checking for unintentional tampering, thus MD5 should be sufficient

Question 6

THe most striking issue I can see in your code is that you're severely limited in the size of files you can compute the hash on. This line of code is particularly jarring:

md.update(Files.readAllBytes(path));

You read the entire file in to memory... which limits you to two things:

the size of your file is limited to the size of your memory ... unless
you have lots of memory, but the file is>2GB at which point you can't create a larger array of bytes.

The reason that the MessageDigest methods contain the update method, is to deal with larger files, and allow you to hash them in parts.

I would use this as an opportunity to learn the somewhat older NIO features: Channels, and Buffers.

Consider the code:

try (MessageDigest md = MessageDigest.getInstance("MD5")) {
 try (FileChannel fc = FileChannel.open(Paths.get(filePath))) {
 long meg = 1024 * 1024;
 long len = fc.size();
 long pos = 0;
 while (pos < len) {
 long size = Math.min(meg, len - pos);
 MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, pos, size);
 md.update(mbb);
 pos += size;
 }
 } catch (IOException ioe) {
 ....
 }
 byte[] hash = md.digest();
 .....
} catch ( NoSuchAlgorithmException nsae) {
 ....
}

The advantage of the above code is that is plays much closer to the operating system. In theory, it should be the fastest possible way for a Java program to read the file, and it should never transfer the actual file content in to the Java memory space. It removes at least one data copy between the file represented in the OS on disk, and the copy in memory.

There is no limit to the size of file you compute the hash on.

Question 7

Thanks for including the example and explanation of FileChannel. I had not encountered it. Perhaps the md.digest() should occur inside the try block. A partially computed digest seems misleading. EDIT: I guess it depends on the elided code. Perhaps the exception is raised on catch.

Question 8

I just wanted to comment on these lines, and the toHexString() method:

//digest the bytes and generate an MD5 sum thats stored in an array of bytes
byte[] hash = md.digest();
//convert the byte array to its hex counter parts and store it as a string
md5Sum = toHexString(hash);

I would use this, instead:

// digest the bytes, generate an MD5 hash and store it as a hex string
md5sum = String.format("%032x", new BigInteger(1, md.digest()));

Question 9

Reading the file in all at once may limit your ability to deal with large files. The MessageDigest allows incremental updates so you could chunk the file. Assuming you choose to read it all at once, I would suggest explicitly stating the limitation in your code comments. Then if you ever have a problem you will know right where to look.

If the class is to be used heavily you might consider whether to cache the MessageDigest in a class variable. My understanding is that construction is relatively expensive. If performance is a consideration in your application you may wish to measure it.

I like that the variable names seem clearly chosen. The comments are generally helpful, but sometimes slip into purely descriptive.

// two instance variables, one to store the file path and one to store the MD5 sum
private String path;
private String md5Sum;

First consider that the instance variable path is never referenced outside the constructor and can be omitted.

Second, note that the comment simply restates the code in more words. I would suggest omitting it or making it that clarifies the code. In this case, the purpose of the md5Sum might be relevant

// Cache for the computed sum
private String md5Sum;

I think the masking of the byte is one case where a comment would be helpful to head off confusion you might anticipate. Since a Java bytes is 8-bits it would appear to be unneeded.

// Mask on int conversion to prevent sign extension
v = bytes[j] & 0xFF;`

Question 10

Using a class with fields is overkill here. This should be a utility function, like the one provided by Apache Commons Digest.

public class DigestUtils {
 public static md5Hex(Path path) { ... }
}

Do not use e.printStackTrace, even when your IDE generates that code for you. It is wrong. You must report any exceptions to the calling method, or alternatively log them and provide a sensible fallback value. An empty string (like in your code) is not sensible.

Question 11

The performance loss of using this plus the fact that it requires dependencies is why I don't want to do it this way

Question 12

There is no performance loss. I did not say you should use Apache Commons Digest, just model your API like theirs.

rolfl rolfl 98.1k17 gold badges219 silver badges419 bronze badges · Accepted Answer · 2017-01-20 20:40:17Z

THe most striking issue I can see in your code is that you're severely limited in the size of files you can compute the hash on. This line of code is particularly jarring:

md.update(Files.readAllBytes(path));

You read the entire file in to memory... which limits you to two things:

the size of your file is limited to the size of your memory ... unless
you have lots of memory, but the file is>2GB at which point you can't create a larger array of bytes.

The reason that the MessageDigest methods contain the update method, is to deal with larger files, and allow you to hash them in parts.

I would use this as an opportunity to learn the somewhat older NIO features: Channels, and Buffers.

Consider the code:

try (MessageDigest md = MessageDigest.getInstance("MD5")) {
 try (FileChannel fc = FileChannel.open(Paths.get(filePath))) {
 long meg = 1024 * 1024;
 long len = fc.size();
 long pos = 0;
 while (pos < len) {
 long size = Math.min(meg, len - pos);
 MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, pos, size);
 md.update(mbb);
 pos += size;
 }
 } catch (IOException ioe) {
 ....
 }
 byte[] hash = md.digest();
 .....
} catch ( NoSuchAlgorithmException nsae) {
 ....
}

The advantage of the above code is that is plays much closer to the operating system. In theory, it should be the fastest possible way for a Java program to read the file, and it should never transfer the actual file content in to the Java memory space. It removes at least one data copy between the file represented in the OS on disk, and the copy in memory.

There is no limit to the size of file you compute the hash on.

Thanks for including the example and explanation of FileChannel. I had not encountered it. Perhaps the md.digest() should occur inside the try block. A partially computed digest seems misleading. EDIT: I guess it depends on the elided code. Perhaps the exception is raised on catch.

Stack Exchange Network

Java class to compute and get a MD5 hash as a string

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Java class to compute and get a MD5 hash as a string

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions