get diff between two directories in rust

Question 1

I'm trying to get the diff between two directories, e.g.

dir1
 - changed.txt
 - deleted.txt
 - index.txt
 - nested
 - changed.txt
 - deleted.txt
 - index.txt
dir2
 - changed.txt
 - added.txt
 - index.txt
 - nested
 - changed.txt
 - added.txt
 - index.txt
}
added: ["added.txt", "nested/added.txt"]
deleted: ["deleted.txt", "nested/deleted.txt"]
changed: ["changed.txt", "nested/changed.txt"]

I was looking at the source code of the dir-diff crate and came up with the following solution. As I'm new to rust I'm looking for a better / cleaner way to archive this or some general feedback for my solution.

The code gets a dir walker for both directories and sorts it by the paths of the files. I then loop over both iterators. If the path of a is smaller then the path of b, I know an element was added. Removed elements are detected in a similar fashion. In this case I only increment one of the iterators. If the paths are equal, the file is in both directories and I simply check the length (In future version I'd like to check the hash as well). If either iterator ends, I have to loop through the remaining entries to get the last added / removed files.

use error_chain::error_chain;
use std::cmp::Ordering;
use std::path::Path;
use walkdir::{DirEntry, WalkDir};
error_chain! {
 foreign_links {
 Io(std::io::Error);
 Walkdir(walkdir::Error);
 }
}
fn compare_by_file_path(a: &DirEntry, b: &DirEntry) -> Ordering {
 a.path().cmp(b.path())
}
fn walk_dir<P: AsRef<Path>>(path: P) -> Result<walkdir::IntoIter> {
 let mut walkdir = WalkDir::new(path).sort_by(compare_by_file_path).into_iter();
 if let Some(Err(e)) = walkdir.next() {
 Err(e.into())
 } else {
 Ok(walkdir)
 }
}
pub fn get_diff<U: AsRef<Path>, V: AsRef<Path>>(from: U, to: V) -> Result<()> {
 let mut a_walker = walk_dir(from)?;
 let mut b_walker = walk_dir(to)?;
 let mut added: Vec<PathBuf> = Vec::new();
 let mut removed: Vec<PathBuf> = Vec::new();
 let mut changed: Vec<PathBuf> = Vec::new();
 let mut a = a_walker.next().unwrap()?;
 let mut b = b_walker.next().unwrap()?;
 loop {
 match a.file_name().cmp(b.file_name()) {
 Ordering::Less => {
 removed.push(a.path().into());
 a = match a_walker.next() {
 Some(entry) => entry?,
 None => break,
 };
 }
 Ordering::Greater => {
 added.push(b.path().into());
 b = match b_walker.next() {
 Some(entry) => entry?,
 None => break,
 };
 }
 Ordering::Equal => {
 if a.metadata()?.len() != b.metadata()?.len() {
 changed.push(b.path().into());
 }
 a = match a_walker.next() {
 Some(entry) => entry?,
 None => break,
 };
 b = match b_walker.next() {
 Some(entry) => entry?,
 None => break,
 };
 }
 }
 }
 for a in a_walker {
 removed.push(a?.path().into());
 }
 for b in b_walker {
 added.push(b?.path().into());
 }
 let output = (added, removed, changed);
 Ok(output)
}
}

Output:

(
 ["./tests/test2/nested/new.ts", "./tests/test2/new.ts"],
 ["./tests/test1/deleted.ts", "./tests/test1/nested/deleted.ts"],
 ["./tests/test2/changed.ts", "./tests/test2/nested/changed.ts"]
)

Question 2

First thing that comes to my mind: For larger filesystems, an iterator based approach would probably be benefitial, to avoid large vector copies.

Question 3

What is the purpose of if let Some(Err(e)) = walkdir.next() in walk_dir?

Question 4

@Finomnis, it is because of this: docs.rs/walkdir/latest/walkdir/struct.IntoIter.html#method.next . The errors are returned wrapped in Some.

Question 5

@Finomnis I would agree, but I have no clue how to implement it with iterators efficiently. The iterator would not guarantee that the files in both directories are visited in the same order afaik. So I would have to go through the iterators multiple times to detect if a file is missing or was added / changed. This would increase the complexity but reduce the memory footprint I guess. I actually tried it this way first but had problems with the lifetimes of the nested iterators.

Question 6

Seems the first walkdir.next() removes the . entry, the self-referencial directory.

Question 7

I played around with the problem a bit, just for fun.

Here's a completely different approach:

use error_chain::error_chain;
use std::cmp::Ordering;
use std::path::{Path, PathBuf};
use walkdir::{DirEntry, WalkDir};
error_chain! {
 foreign_links {
 Io(std::io::Error);
 Walkdir(walkdir::Error);
 }
}
fn compare_by_file_path(a: &DirEntry, b: &DirEntry) -> Ordering {
 a.path().cmp(b.path())
}
fn walk_dir<P: AsRef<Path>>(path: P) -> Result<walkdir::IntoIter> {
 let mut walkdir = WalkDir::new(path).sort_by(compare_by_file_path).into_iter();
 if let Some(Err(e)) = walkdir.next() {
 Err(e.into())
 } else {
 Ok(walkdir)
 }
}
#[derive(Debug)]
pub enum Difference {
 Added(PathBuf),
 Removed(PathBuf),
 Changed(PathBuf),
}
pub fn get_diff<U: AsRef<Path>, V: AsRef<Path>>(
 from: U,
 to: V,
) -> Result<impl Iterator<Item = Result<Difference>>> {
 let mut a_walker = walk_dir(from)?;
 let mut b_walker = walk_dir(to)?;
 let mut a = None;
 let mut b = None;
 let mut get_next_change = move || {
 Ok(loop {
 if a.is_none() {
 a = a_walker.next().transpose()?;
 }
 if b.is_none() {
 b = b_walker.next().transpose()?;
 }
 if a.is_none() {
 break b.take().map(|b| Difference::Added(b.path().into()));
 } else if b.is_none() {
 break a.take().map(|a| Difference::Removed(a.path().into()));
 } else {
 match a
 .as_ref()
 .unwrap()
 .file_name()
 .cmp(b.as_ref().unwrap().file_name())
 {
 Ordering::Less => {
 break a.take().map(|a| Difference::Removed(a.path().into()));
 }
 Ordering::Greater => {
 break b.take().map(|b| Difference::Added(b.path().into()));
 }
 Ordering::Equal => {
 let a = a.take().unwrap();
 let b = b.take().unwrap();
 if a.metadata()?.len() != b.metadata()?.len() {
 break Some(Difference::Changed(b.path().into()));
 }
 }
 }
 };
 })
 };
 Ok(std::iter::from_fn(move || get_next_change().transpose()))
}
fn main() {
 println!(
 "{:#?}",
 get_diff("dir1", "dir2")
 .unwrap()
 .collect::<Result<Vec<_>>>()
 .unwrap()
 );
}

[
 Added(
 "dir2/added.txt",
 ),
 Changed(
 "dir2/changed.txt",
 ),
 Removed(
 "dir1/deleted.txt",
 ),
 Added(
 "dir2/nested/added.txt",
 ),
 Changed(
 "dir2/nested/changed.txt",
 ),
 Removed(
 "dir1/nested/deleted.txt",
 ),
]

Although I think there isn't really any point to using iterators as long as we are using WalkDir::new(path).sort_by. It would only make sense if we used folder-wise iterators. But WalkDir.sort_by loads all files first, then sorts them, and then iterates over them. Meaning: the entire files list is loaded to memory already, and using an iterator gives no further benefit.

Finomnis Finomnis 4463 silver badges6 bronze badges · Answer 1 · 2022-07-14 19:31:10Z

I played around with the problem a bit, just for fun.

Here's a completely different approach:

use error_chain::error_chain;
use std::cmp::Ordering;
use std::path::{Path, PathBuf};
use walkdir::{DirEntry, WalkDir};
error_chain! {
 foreign_links {
 Io(std::io::Error);
 Walkdir(walkdir::Error);
 }
}
fn compare_by_file_path(a: &DirEntry, b: &DirEntry) -> Ordering {
 a.path().cmp(b.path())
}
fn walk_dir<P: AsRef<Path>>(path: P) -> Result<walkdir::IntoIter> {
 let mut walkdir = WalkDir::new(path).sort_by(compare_by_file_path).into_iter();
 if let Some(Err(e)) = walkdir.next() {
 Err(e.into())
 } else {
 Ok(walkdir)
 }
}
#[derive(Debug)]
pub enum Difference {
 Added(PathBuf),
 Removed(PathBuf),
 Changed(PathBuf),
}
pub fn get_diff<U: AsRef<Path>, V: AsRef<Path>>(
 from: U,
 to: V,
) -> Result<impl Iterator<Item = Result<Difference>>> {
 let mut a_walker = walk_dir(from)?;
 let mut b_walker = walk_dir(to)?;
 let mut a = None;
 let mut b = None;
 let mut get_next_change = move || {
 Ok(loop {
 if a.is_none() {
 a = a_walker.next().transpose()?;
 }
 if b.is_none() {
 b = b_walker.next().transpose()?;
 }
 if a.is_none() {
 break b.take().map(|b| Difference::Added(b.path().into()));
 } else if b.is_none() {
 break a.take().map(|a| Difference::Removed(a.path().into()));
 } else {
 match a
 .as_ref()
 .unwrap()
 .file_name()
 .cmp(b.as_ref().unwrap().file_name())
 {
 Ordering::Less => {
 break a.take().map(|a| Difference::Removed(a.path().into()));
 }
 Ordering::Greater => {
 break b.take().map(|b| Difference::Added(b.path().into()));
 }
 Ordering::Equal => {
 let a = a.take().unwrap();
 let b = b.take().unwrap();
 if a.metadata()?.len() != b.metadata()?.len() {
 break Some(Difference::Changed(b.path().into()));
 }
 }
 }
 };
 })
 };
 Ok(std::iter::from_fn(move || get_next_change().transpose()))
}
fn main() {
 println!(
 "{:#?}",
 get_diff("dir1", "dir2")
 .unwrap()
 .collect::<Result<Vec<_>>>()
 .unwrap()
 );
}

[
 Added(
 "dir2/added.txt",
 ),
 Changed(
 "dir2/changed.txt",
 ),
 Removed(
 "dir1/deleted.txt",
 ),
 Added(
 "dir2/nested/added.txt",
 ),
 Changed(
 "dir2/nested/changed.txt",
 ),
 Removed(
 "dir1/nested/deleted.txt",
 ),
]

Although I think there isn't really any point to using iterators as long as we are using WalkDir::new(path).sort_by. It would only make sense if we used folder-wise iterators. But WalkDir.sort_by loads all files first, then sorts them, and then iterates over them. Meaning: the entire files list is loaded to memory already, and using an iterator gives no further benefit.

Stack Exchange Network

get diff between two directories in rust

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

get diff between two directories in rust

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions