I'm trying to get the diff between two directories, e.g.
dir1
- changed.txt
- deleted.txt
- index.txt
- nested
- changed.txt
- deleted.txt
- index.txt
dir2
- changed.txt
- added.txt
- index.txt
- nested
- changed.txt
- added.txt
- index.txt
}
added: ["added.txt", "nested/added.txt"]
deleted: ["deleted.txt", "nested/deleted.txt"]
changed: ["changed.txt", "nested/changed.txt"]
I was looking at the source code of the dir-diff crate and came up with the following solution. As I'm new to rust I'm looking for a better / cleaner way to archive this or some general feedback for my solution.
The code gets a dir walker for both directories and sorts it by the paths of the files. I then loop over both iterators. If the path of a
is smaller then the path of b
, I know an element was added. Removed elements are detected in a similar fashion. In this case I only increment one of the iterators. If the paths are equal, the file is in both directories and I simply check the length (In future version I'd like to check the hash as well). If either iterator ends, I have to loop through the remaining entries to get the last added / removed files.
use error_chain::error_chain;
use std::cmp::Ordering;
use std::path::Path;
use walkdir::{DirEntry, WalkDir};
error_chain! {
foreign_links {
Io(std::io::Error);
Walkdir(walkdir::Error);
}
}
fn compare_by_file_path(a: &DirEntry, b: &DirEntry) -> Ordering {
a.path().cmp(b.path())
}
fn walk_dir<P: AsRef<Path>>(path: P) -> Result<walkdir::IntoIter> {
let mut walkdir = WalkDir::new(path).sort_by(compare_by_file_path).into_iter();
if let Some(Err(e)) = walkdir.next() {
Err(e.into())
} else {
Ok(walkdir)
}
}
pub fn get_diff<U: AsRef<Path>, V: AsRef<Path>>(from: U, to: V) -> Result<()> {
let mut a_walker = walk_dir(from)?;
let mut b_walker = walk_dir(to)?;
let mut added: Vec<PathBuf> = Vec::new();
let mut removed: Vec<PathBuf> = Vec::new();
let mut changed: Vec<PathBuf> = Vec::new();
let mut a = a_walker.next().unwrap()?;
let mut b = b_walker.next().unwrap()?;
loop {
match a.file_name().cmp(b.file_name()) {
Ordering::Less => {
removed.push(a.path().into());
a = match a_walker.next() {
Some(entry) => entry?,
None => break,
};
}
Ordering::Greater => {
added.push(b.path().into());
b = match b_walker.next() {
Some(entry) => entry?,
None => break,
};
}
Ordering::Equal => {
if a.metadata()?.len() != b.metadata()?.len() {
changed.push(b.path().into());
}
a = match a_walker.next() {
Some(entry) => entry?,
None => break,
};
b = match b_walker.next() {
Some(entry) => entry?,
None => break,
};
}
}
}
for a in a_walker {
removed.push(a?.path().into());
}
for b in b_walker {
added.push(b?.path().into());
}
let output = (added, removed, changed);
Ok(output)
}
}
Output:
(
["./tests/test2/nested/new.ts", "./tests/test2/new.ts"],
["./tests/test1/deleted.ts", "./tests/test1/nested/deleted.ts"],
["./tests/test2/changed.ts", "./tests/test2/nested/changed.ts"]
)
1 Answer 1
I played around with the problem a bit, just for fun.
Here's a completely different approach:
use error_chain::error_chain;
use std::cmp::Ordering;
use std::path::{Path, PathBuf};
use walkdir::{DirEntry, WalkDir};
error_chain! {
foreign_links {
Io(std::io::Error);
Walkdir(walkdir::Error);
}
}
fn compare_by_file_path(a: &DirEntry, b: &DirEntry) -> Ordering {
a.path().cmp(b.path())
}
fn walk_dir<P: AsRef<Path>>(path: P) -> Result<walkdir::IntoIter> {
let mut walkdir = WalkDir::new(path).sort_by(compare_by_file_path).into_iter();
if let Some(Err(e)) = walkdir.next() {
Err(e.into())
} else {
Ok(walkdir)
}
}
#[derive(Debug)]
pub enum Difference {
Added(PathBuf),
Removed(PathBuf),
Changed(PathBuf),
}
pub fn get_diff<U: AsRef<Path>, V: AsRef<Path>>(
from: U,
to: V,
) -> Result<impl Iterator<Item = Result<Difference>>> {
let mut a_walker = walk_dir(from)?;
let mut b_walker = walk_dir(to)?;
let mut a = None;
let mut b = None;
let mut get_next_change = move || {
Ok(loop {
if a.is_none() {
a = a_walker.next().transpose()?;
}
if b.is_none() {
b = b_walker.next().transpose()?;
}
if a.is_none() {
break b.take().map(|b| Difference::Added(b.path().into()));
} else if b.is_none() {
break a.take().map(|a| Difference::Removed(a.path().into()));
} else {
match a
.as_ref()
.unwrap()
.file_name()
.cmp(b.as_ref().unwrap().file_name())
{
Ordering::Less => {
break a.take().map(|a| Difference::Removed(a.path().into()));
}
Ordering::Greater => {
break b.take().map(|b| Difference::Added(b.path().into()));
}
Ordering::Equal => {
let a = a.take().unwrap();
let b = b.take().unwrap();
if a.metadata()?.len() != b.metadata()?.len() {
break Some(Difference::Changed(b.path().into()));
}
}
}
};
})
};
Ok(std::iter::from_fn(move || get_next_change().transpose()))
}
fn main() {
println!(
"{:#?}",
get_diff("dir1", "dir2")
.unwrap()
.collect::<Result<Vec<_>>>()
.unwrap()
);
}
[
Added(
"dir2/added.txt",
),
Changed(
"dir2/changed.txt",
),
Removed(
"dir1/deleted.txt",
),
Added(
"dir2/nested/added.txt",
),
Changed(
"dir2/nested/changed.txt",
),
Removed(
"dir1/nested/deleted.txt",
),
]
Although I think there isn't really any point to using iterators as long as we are using WalkDir::new(path).sort_by
. It would only make sense if we used folder-wise iterators. But WalkDir.sort_by
loads all files first, then sorts them, and then iterates over them. Meaning: the entire files list is loaded to memory already, and using an iterator gives no further benefit.
if let Some(Err(e)) = walkdir.next()
inwalk_dir
? \$\endgroup\$walkdir.next()
removes the.
entry, the self-referencial directory. \$\endgroup\$