Here is my class file to compute MD5 hashes on a file. I wanted to be able to easily calculate them without having to create a lot of overhead in the code using the class.
Can anyone tell me a list of improvements to make this class acceptable?
package Md5Generator;
/**
*
* @author jacob
*/
import java.io.IOException;
import java.security.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;
/**
* The point of this class it to compute the MD5 sum of a file
*/
public class Md5 {
// two instance variables, one to store the file path and one to store the MD5 sum
private String path;
private String md5Sum;
/**
* Constructor that takes a file path and and calcs the MD5 sum
*
* @param filePath the string that contains the full path to the file
*/
public Md5(String filePath) {
path = filePath;
calcMd5(path);
}
private void calcMd5(String filePath) {
//create a messagedigest object to compute an MD5 sum
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//create a input stream to get the bytes of the file
Path path = Paths.get(filePath);
//read the bytes from the file and put them in the message digest
md.update(Files.readAllBytes(path));
//digest the bytes and generate an MD5 sum thats stored in an array of bytes
byte[] hash = md.digest();
//convert the byte array to its hex counter parts and store it as a string
md5Sum = toHexString(hash);
} catch (IOException | NoSuchAlgorithmException ex) {
ex.printStackTrace();
}
}
private String toHexString(byte[] bytes) {
char[] hexArray = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
char[] hexChars = new char[bytes.length * 2];
int v;
for (int j = 0; j < bytes.length; j++) {
v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v / 16];
hexChars[j * 2 + 1] = hexArray[v % 16];
}
return new String(hexChars);
}
/**
* Returns the MD5 sum as a String
*
* @return the string that contains the MD5 sum
*/
public String getMd5Sum() {
return md5Sum;
}
}
4 Answers 4
THe most striking issue I can see in your code is that you're severely limited in the size of files you can compute the hash on. This line of code is particularly jarring:
md.update(Files.readAllBytes(path));
You read the entire file in to memory... which limits you to two things:
- the size of your file is limited to the size of your memory ... unless
- you have lots of memory, but the file is>2GB at which point you can't create a larger array of bytes.
The reason that the MessageDigest methods contain the update
method, is to deal with larger files, and allow you to hash them in parts.
I would use this as an opportunity to learn the somewhat older NIO features: Channels, and Buffers.
Consider the code:
try (MessageDigest md = MessageDigest.getInstance("MD5")) {
try (FileChannel fc = FileChannel.open(Paths.get(filePath))) {
long meg = 1024 * 1024;
long len = fc.size();
long pos = 0;
while (pos < len) {
long size = Math.min(meg, len - pos);
MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, pos, size);
md.update(mbb);
pos += size;
}
} catch (IOException ioe) {
....
}
byte[] hash = md.digest();
.....
} catch ( NoSuchAlgorithmException nsae) {
....
}
The advantage of the above code is that is plays much closer to the operating system. In theory, it should be the fastest possible way for a Java program to read the file, and it should never transfer the actual file content in to the Java memory space. It removes at least one data copy between the file represented in the OS on disk, and the copy in memory.
There is no limit to the size of file you compute the hash on.
-
\$\begingroup\$ Thanks for including the example and explanation of FileChannel. I had not encountered it. Perhaps the
md.digest()
should occur inside thetry
block. A partially computed digest seems misleading. EDIT: I guess it depends on the elided code. Perhaps the exception is raised on catch. \$\endgroup\$user650881– user6508812017年01月20日 21:27:13 +00:00Commented Jan 20, 2017 at 21:27
I just wanted to comment on these lines, and the toHexString()
method:
//digest the bytes and generate an MD5 sum thats stored in an array of bytes
byte[] hash = md.digest();
//convert the byte array to its hex counter parts and store it as a string
md5Sum = toHexString(hash);
I would use this, instead:
// digest the bytes, generate an MD5 hash and store it as a hex string
md5sum = String.format("%032x", new BigInteger(1, md.digest()));
Reading the file in all at once may limit your ability to deal with large files. The MessageDigest
allows incremental updates so you could chunk the file. Assuming you choose to read it all at once, I would suggest explicitly stating the limitation in your code comments. Then if you ever have a problem you will know right where to look.
If the class is to be used heavily you might consider whether to cache the
MessageDigest
in a class variable. My understanding is that construction is
relatively expensive. If performance is a consideration in your application you
may wish to measure it.
I like that the variable names seem clearly chosen. The comments are generally helpful, but sometimes slip into purely descriptive.
// two instance variables, one to store the file path and one to store the MD5 sum
private String path;
private String md5Sum;
First consider that the instance variable path
is never referenced outside the
constructor and can be omitted.
Second, note that the comment simply restates the code in more words. I would
suggest omitting it or making it that clarifies the code. In this case, the
purpose of the md5Sum
might be relevant
// Cache for the computed sum
private String md5Sum;
I think the masking of the byte is one case where a comment would be helpful to head off confusion you might anticipate. Since a Java bytes is 8-bits it would appear to be unneeded.
// Mask on int conversion to prevent sign extension
v = bytes[j] & 0xFF;`
Using a class with fields is overkill here. This should be a utility function, like the one provided by Apache Commons Digest.
public class DigestUtils {
public static md5Hex(Path path) { ... }
}
Do not use e.printStackTrace
, even when your IDE generates that code for you. It is wrong. You must report any exceptions to the calling method, or alternatively log them and provide a sensible fallback value. An empty string (like in your code) is not sensible.
-
\$\begingroup\$ The performance loss of using this plus the fact that it requires dependencies is why I don't want to do it this way \$\endgroup\$SteelToe– SteelToe2017年01月22日 00:28:28 +00:00Commented Jan 22, 2017 at 0:28
-
\$\begingroup\$ There is no performance loss. I did not say you should use Apache Commons Digest, just model your API like theirs. \$\endgroup\$Roland Illig– Roland Illig2017年01月22日 09:59:59 +00:00Commented Jan 22, 2017 at 9:59
java.util.Base64
which does the work for you. \$\endgroup\$