I've written a small program that will accept a directory path and recursively calculate the MD5 hash of each file.
use std::{env, fs};
use md5::Digest;
fn main() {
let args: Vec<String> = env::args().collect();
let mut entries = get_files(&args[1]);
}
#[derive(Debug)]
struct FileInfo {
full_path: String,
hash: Option<Digest>,
}
fn get_file_info(file_path: &str) -> FileInfo {
FileInfo {
full_path: file_path.to_owned(),
hash: match fs::read(file_path) {
Ok(d) => Some(md5::compute(d)),
Err(_) => None,
},
}
}
fn get_files(dir_path: &str) -> Vec<FileInfo> {
fs::read_dir(dir_path)
.expect("Couldnt read path")
.filter(|path| path.is_ok())
.flat_map(|path| {
let path_buf = path
.as_ref()
.and_then(|p| Ok(p.path().to_owned()))
.expect("Unable to read path");
let file_path = path_buf.to_str().expect("msg");
let is_file = path
.and_then(|p| p.file_type())
.and_then(|ft| Ok(ft.is_file()))
.expect("Unable to read path");
if is_file {
vec![get_file_info(file_path)]
} else {
get_files(file_path)
}
})
.collect()
}
I'm mainly concerned about writing more idiomatic rust and performance. I think most improvements could be made to get_files( dir_path: &str ) -> Vec<FileInfo>
but I'm not sure what to change.
3 Answers 3
This is some good code for a (presumably) beginner.
Use (more) libraries
You can make your life easier, by using additional libraries like walkdir
to traverse the directories.
Additionally, you can use clap
to let it parse your command line arguments.
In your current code, further arguments to your program would be silently swallowed, which is not what a user may expect.
Implement useful traits
You can implement the instatiation of the FileInfo
, which really constitutes a HashedFile
by implementing TryFrom
.
Avoid panics
Panicking is not a good way to handle errors, especially in an iterator.
If you're not interested in paths that failed to traverse, you can just filter them out or map them to respective errors.
Panicking would result in unwinding the stack and terminating the thread, thus ending your program, with zero chances for the user to process further, possibly valid items of the iterator.
Use generic types
Currently you limit the user to pass a &str
as path to the base directory. Consider using impl AsRef<Path>
intead, which will still allow to pass &str
but also allow users to pass specialized Path
objects.
Suggested change
Cargo.toml
[package]
name = "file_hashes"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
clap = { version = "4.3.19", features = ["derive"] }
md5 = "0.7.0"
walkdir = "2.3.3"
src/lib.rs
use md5::{compute, Digest};
use std::fmt::{Display, Formatter, LowerHex};
use std::fs::read;
use std::path::{Path, PathBuf};
use walkdir::WalkDir;
pub struct HashedFile {
path_buf: PathBuf,
hash: Digest,
}
impl Display for HashedFile {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
self.path_buf
.to_str()
.ok_or_else(std::fmt::Error::default)
.and_then(|path| write!(f, "{path}: "))?;
LowerHex::fmt(&self.hash, f)
}
}
impl TryFrom<PathBuf> for HashedFile {
type Error = std::io::Error;
fn try_from(path_buf: PathBuf) -> Result<Self, Self::Error> {
Ok(Self {
hash: compute(read(path_buf.as_path())?),
path_buf,
})
}
}
pub fn hash_dir(dir: impl AsRef<Path>) -> impl Iterator<Item = std::io::Result<HashedFile>> {
WalkDir::new(dir.as_ref())
.into_iter()
.filter_map(Result::ok)
.filter(|entry| entry.file_type().is_file())
.map(|entry| HashedFile::try_from(entry.into_path()))
}
src/main.rs
use clap::Parser;
use file_hashes::hash_dir;
#[derive(Debug, Parser)]
struct Args {
directory: String,
}
fn main() {
hash_dir(Args::parse().directory).for_each(|result| match result {
Ok(file_hash) => println!("{file_hash}"),
Err(error) => eprintln!("{error}"),
});
}
-
\$\begingroup\$ I am a Rust beginner - I wish it weren't so noticeable! Thanks for outlining these more idiomatic patterns. I will be refactoring my code to follow these conventions. \$\endgroup\$etchesketch– etchesketch2023年08月03日 03:02:10 +00:00Commented Aug 3, 2023 at 3:02
Nice little utility.
Two big remarks:
- Prefer a lazy iterator, instead of allocating vec storage for each pathname.
- Carefully distinguish file / dir from special files.
The main()
assignment of entries
could be
fleshed out a little, perhaps displaying pathnames and hashes.
Notice that it cannot begin to do that until the entire
subtree has been read in.
naming
fn get_file_info(
...
Ok(d) => Some(md5::compute(d)),
I know it's a local variable and all,
so documentation burden is low.
But still, please name it data
instead of cryptic d
.
(Or at least, that's how I was mentally pronouncing
your identifier.)
diagnostic messages
.expect("Couldnt read path")
nit, typo: missing '
apostrophe in "Couldn't".
When any one of these four expect
s trigger,
the first question a maintenance engineer will
have is: "Which path?"
Is it easy in Rust to include such diagnostic details?
The .expect("msg")
text is unique,
so I could find the relevant line of source code.
But maybe it could be a little more chatty?
special files
if is_file {
vec![get_file_info(file_path)]
} else {
get_files(file_path)
}
The is_file
test chases symlinks,
which I imagine is a Good Thing for your use case.
The else
clause suggests that there are ordinary files
and there are directories in the world, that's it.
I'm concerned about hitting a named FIFO,
a character special /dev/zero, a block special,
things like that.
I would feel better about this code if we
explicitly asked if the pathname is a directory.
Though I suppose that strictly speaking,
with the current code the user deserves what he asked for.
Out-of-scope feature request: Some recursive descent
utilities
include a -x
or -xdev
switch,
to avoid crossing over mount points.
You might choose to implement something similar.
Alternatively, you might farm out such details to /usr/bin/find,
and simply accept a stream of pathnames which you compute md5's for.
I am no Rust expert, and I know little about idiomatic usage.
But the is_file
processing felt a bit awkward.
Maybe it could be structured similarly
to the .filter(|path| path.is_ok())
filtering?
This code achieves its design goals.
I would be willing to delegate or accept maintenance tasks on it.
-
\$\begingroup\$
data
would also be a suboptimal variable name. It's aVec<u8>
, so I'd call itbytes
. Nice review, though. \$\endgroup\$Richard Neumann– Richard Neumann2023年08月01日 20:59:26 +00:00Commented Aug 1, 2023 at 20:59 -
\$\begingroup\$ Thanks for your feedback. My error message are woefully vague. I will definitely improve on those. \$\endgroup\$etchesketch– etchesketch2023年08月03日 03:00:34 +00:00Commented Aug 3, 2023 at 3:00
Consider Refactoring Long Closures into Helper Functions
This closure would be more readable as a helper function, perhaps expand_path
:
.flat_map(|path| {
let path_buf = path
.as_ref()
.and_then(|p| Ok(p.path().to_owned()))
.expect("Unable to read path");
let file_path = path_buf.to_str().expect("msg");
let is_file = path
.and_then(|p| p.file_type())
.and_then(|ft| Ok(ft.is_file()))
.expect("Unable to read path");
if is_file {
vec![get_file_info(file_path)]
} else {
get_files(file_path)
}
})
Which would then become .flat_map(expand_path)
. (I won’t repeat the advice to flatten to lazy iterators rather than Vec
, which would make the return type a slightly more complicated impl
, but I endorse it.)
In this helper function:
fn get_file_info(file_path: &str) -> FileInfo {
FileInfo {
full_path: file_path.to_owned(),
hash: match fs::read(file_path) {
Ok(d) => Some(md5::compute(d)),
Err(_) => None,
},
}
}
It’s fine, but consider as an alternative:
hash: fs::read(file_path)
.ok()
.map(md5::compute)
That’s if you want to ignore the errors silently. You might want to report them with eprintln!
and continue, instead.
Panic on Logic Errors, not Runtime Errors
Panic messages are a great way to help developers diagnose bugs. You can even set an environment variable so that they cough up a stack trace at the point of error, along with a source line. They are an appropriate way to detect bugs in the program that indicate something has gone wrong, instead of waiting for them to show up as a bug in the visible behavior later and then trying to figure out how those could have happened.
Ideally, though, an end user should never see a panic. If we know how an error can occur, we should either report that in a user-friendly way, or recover. Things like a file not being readable aren’t logic errors that indicate a bug in the program; they’re things that happen normally.
Consider skipping unreadable paths, possibly with an error message to eprintln!
. For example, you might call .filter_map
on a function that prints an error message and returns None
for an unreadable file, or Some(hash)
on success.
-
\$\begingroup\$ I had considered moving some functionality into another function. Your simplified
get_file_info
is helping me see that I have some gaps in my understanding how Result and Option works. Thanks for your suggestsions! \$\endgroup\$etchesketch– etchesketch2023年08月03日 03:05:46 +00:00Commented Aug 3, 2023 at 3:05
cargo fmt
on your code to adhere to the one any only true code style handed down to us from the heavens ;-). But seriously, this would make your code more easy to read for others if it'd conform to a standard code style (-> egoless programming) \$\endgroup\$cargo fmt
was a thing. \$\endgroup\$