rust sort numbered directories (non-lexicographical)

Question 1

I am very new to rust and have been reading up and playing around to get a better understanding. I was trying to replicate a small task that I have done via bash scripting before, as a way to continue learning. The task involves getting the sub-directories of some base directory, where all sub-directories will be numbered. Then sort the sub-directories via their number in non-lexicographic order.

The directory structure would be expected to be something like:

base_dir/
 1/
 2/
 5/
 100/
 10256/
 7/
 23/

I have come up with 4 sorting methods that all work - experimentally - to sort a Vec<std::fs::DirEntry> in-place, but use differing approaches / styles. I am trying to get an understanding of which approach would be closest to general best practice using rust.

#1:

fn sort_sub_dirs1(dirs: &mut Vec<std::fs::DirEntry>) {
 dirs.sort_by(|d1, d2| {
 let d1_path = d1.path()
 let d2_path = d2.path()
 let default_order = d1_path.cmp(&d2_path);
 let Some(d1_str) = d1.path().to_str() else { return default_order; };
 let Some(d2_str) = d2.path().to_str() else { return default_order; };
 let Some(d1_name) = d1_str.split('/').last() else { return default_order; };
 let Some(d2_name) = d2_str.split('/').last() else { return default_order; };
 let Ok(d1_num) = d1_name.parse::<i32>() else { return default_order; };
 let Ok(d2_num) = d2_name.parse::<i32>() else { return default_order; };
 return d1_num.cmp(&d2_num);
 });
}

#2:

fn sort_sub_dirs2(dirs: &mut Vec<std::fs::DirEntry>) {
 dirs.sort_by(|d1, d2| {
 let d1_num = d1.file_name().to_str().unwrap_or("").parse::<i32>();
 let d2_num = d2.file_name().to_str().unwrap_or("").parse::<i32>();
 match (d1_num, d2_num) {
 (Ok(d1), Ok(d2)) => return d1.cmp(&d2),
 _ => return d1.path().cmp(&d2.path()),
 }
 });
}

#3:

fn sort_sub_dirs3(dirs: &mut Vec<std::fs::DirEntry>) {
 dirs.sort_by(|d1, d2| {
 let Ok(d1_num) = d1.file_name().to_str().unwrap_or("").parse::<i32>() else {
 return d1.path().cmp(&d2.path());
 };
 let Ok(d2_num) = d2.file_name().to_str().unwrap_or("").parse::<i32>() else {
 return d1.path().cmp(&d2.path());
 };
 return d1_num.cmp(&d2_num);
 });
}

#4

fn sort_sub_dirs4(dirs: &mut Vec<std::fs::DirEntry>) {
 dirs.sort_by(|d1, d2| {
 let d1_num = d1.file_name().to_str().map(|n| n.parse::<i32>());
 let d2_num = d2.file_name().to_str().map(|n| n.parse::<i32>());
 match (d1_num, d2_num) {
 (Some(Ok(d1)), Some(Ok(d2))) => return d1.cmp(&d2),
 _ => return d1.path().cmp(&d2.path()),
 }
 });
}

#2 and #4 seem the cleanest looking to me, but I am not a fan of the nested Some(Ok(...)) in the match statement with #4.

These all fallback to the existing path sorting (lexicographic) if we find a directory whose name is not parseable to i32

I may be way off the mark here as well...

Question 2

What, no regex? 😀

Question 3

Note: you may want to look at Natural Sorts, which will split a filename into series of digits and non-digits, and compare two digits blocks by interpreting them as integer.

First of all, let's get a clear program statement:

Sort a list of DirEntry base on the file names.
If two file names are numeric, compare them as integers.
Otherwise, compare them as paths.

Discrepancies

Why are your parsing the full path yourself in #1 while using the file_name() method in #2, #3, and #4.

In general, I would recommend always using file_name() (on stable):

While both path() and file_name() allocate, the latter allocates a smaller buffer.
file_name() already bakes in the logic to extract the last component of the path.

If you are willing to use nightly, a specialization for Unix using file_name_ref() would be even better, skipping the memory allocation altogether.

Least capable arguments

You should pass an &mut [...] rather than an &mut Vec<...> for reasons:

In your case, there is no reason for sort_sub_dirs to modify the length of dirs, adding or removing elements, reserving extra memory, etc...
&mut [...] is more flexible, it may come from a Vec, from an array, from a VecDeque, ...

Thus, it's better to generalize the method by taking the "least capable" argument type possible.

Check the relevance of your arguments

Is DirEntry even the appropriate type, here?

If your goal is to simply display the sorted list of names, then passing the filename, rather than the full DirEntry, would be more appropriate.

Once again, the least capable arguments are generally better, and in this particular case, it would avoid allocating again and again with file_name every time a comparison is necessary.

On the other hand, if you then intend to traverse the DirEntry itself -- for example to display a tree of the filesystem structure -- then keeping the DirEntry is definitely the right call.

Import Your Types

It's tiring to type std::fs::DirEntry every time, and thus more idiomatic to have a use std::fs::DirEntry at the top of the module so as to only need to type DirEntry elsewhere.

I will admit I tend to prefer NOT importing free functions the same way, and instead would import std::fs to invoke fs::canonicalize; not sure whether that's idiomatic, or if I'm just weird, however.

DRY

In each and every of your attempts, you duplicate the logic to extract the (potential) integer out of the DirEntry for both entries to be compared.

Rust allows defining local functions, take advantage of it!

Pick more visually distinguishable names

Your two variables, d1 and d2 are visually very similar, requiring more effort to distinguish from one another.

A simple left and right would be better:

It's clear which was the left argument, and which was the right one, when the time comes to compare them.
It's hard to misread them, what with their different starting letter and different length.

Don't perform unnecessary work

#1 is strictly worse than the others due to defining default_order ahead of time, which requires comparing the two entries, even if you don't end up using the default_order.

This could be salvaged by defining default_order as a closure, so it's only called as necessary, and thus only compares as necessary.

Don't unwrap

In general, it's considered preferable to avoid unwrap_or to provide a dummy¹.

It is better to use monadic operators to chain the operations, instead.

¹ If the logic dictates a default, that's fine, but in your case one has to reason about the behavior of "".parse(), and whether it'll error out or return 0 because "" is a dummy.

Flatten

When using monadic operations, you end up with nested Option<Result<...>> which is awkward to manipulate.

Instead, you should use flattening operations. In this case:

.ok() will turn a Result<T, E> into an Option<T>.
.and_then(...) will turn an Option<T> into an Option<U> from a FnOnce(T) -> Option<U> closure.

Combined, you get:

let x: Option<i32> = d1.file_name().to_str().and_then(|s| s.parse().ok());

Turbo fish

I must admit I personally prefer to avoid using the turbo-fish "operator" (ie, the ::< ... > in parse::<i32>() and instead favor annotating the type of the result.

I find that it helps focusing the expression on the exact transformation taking place, especially with longer, more complicated types.

if let else

If you can use modern versions of Rust, I would tend towards either if let or let else as a lighter alternative to match.

To avoid repeating yourself, you should match on the tuple of the two elements -- matching both at once -- rather than match on one, executing the bail out action, then match on the other, repeating the bail out action.

Which one to use depends on the situation:

let else is better suited as a "guard" statement: binding a variable after which further logic is applied, with the else case being a quick "exit" path.
if let is better suited when both if and else are simple.

You were correct in picking if else for #1 and #3, but should have favored if let instead of match for #2 and #4 -- and in general, whenever you have a 2-cases match and bind variables only in 1 case.

Putting it altogether

And thus, I find all 4 of your proposals flawed for one reason or another, and would therefore submit this one instead:

fn sort_sub_dirs(dirs: &mut [DirEntry]) {
 fn try_parse(entry: &DirEntry) -> Option<i32> {
 entry.file_name().to_str().and_then(|name| name.parse().ok())
 }
 dirs.sort_by(|left, right| {
 if let (Some(left), Some(right)) = (try_parse(left), try_parse(right)) {
 left.cmp(&right)
 } else {
 left.cmp(&right)
 }
 });
}

Better performance

The problem with this solution is that there's a lot of repeated work. sort will on average do O(N log N) comparisons, which means in our case the logic to extract the file name and possibly view as an integer will be done more than twice per entry.

That's a lot, ain't it?

If performance matters -- and it may not! -- we are better off applying a pre-processing stage to only extract/parse each entry one, then do the comparison, and finally sort the initial slice.

And fortunately for us, the standard library delivers with sort_by_cached_key!

fn sort_sub_dirs(dirs: &mut [DirEntry]) {
 #[derive(Eq, PartialEq)]
 struct Key {
 file_name: PathBuf,
 numeric: Option<i32>,
 }
 impl Ord for Key {
 fn cmp(&self, other: &Self) -> Ordering {
 if let (Some(left), Some(right)) = (self.numeric, other.numeric) {
 left.cmp(&right)
 } else {
 self.file_name.cmp(&self.file_name)
 }
 }
 }
 impl PartialOrd for Key {
 fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
 Some(self.cmp(other))
 }
 }
 dirs.sort_by_cached_key(|entry| {
 let file_name: PathBuf = entry.file_name().into();
 let numeric = file_name.to_str().and_then(|s| s.parse().ok());
 
 Key { file_name, numeric }
 });
}

Is `i32` the right type?

I do wonder how you picked i32, and I do think that a justification for using this type -- as a comment -- would be appropriate if any.

I must admit that the idea of a negative integer as a name is pretty strange to me, to start with, so I would have tended towards u.

Then, there's a question of range. Is 4-ish billions an acceptable upper-bound (u32), or should you go towards large types such as u64 or u128?

This will depend on your usecase, so I'll leave that up to you, but whichever choice you pick, it'd be nice to document the rationale.

Question 4

First and foremost, thank you for the thorough and informative review! There are many points you have brought up which will be helpful in improving my learning/skills. The i32 would not be a great type here. The sub-dirs will always be positive numbers in this case, so u32 would be a better type.

Question 5

To the point of favoring if let or let else over match, would the Ord impl of Key be better to use let (Some(left), Some(right)) = (self.numeric, other.numeric) else {... in place of the match?

Question 6

@KylePrice: Indeed you're right, the Ord impl of Key would be better expressed with an if let. I would reserve let else to "short-circuit" cases. I'll fix my code.

Matthieu M. Matthieu M. 6,3891 gold badge20 silver badges34 bronze badges · Answer 1 · 2023-02-04 16:20:16Z

Note: you may want to look at Natural Sorts, which will split a filename into series of digits and non-digits, and compare two digits blocks by interpreting them as integer.

First of all, let's get a clear program statement:

Sort a list of DirEntry base on the file names.
If two file names are numeric, compare them as integers.
Otherwise, compare them as paths.

Discrepancies

Why are your parsing the full path yourself in #1 while using the file_name() method in #2, #3, and #4.

In general, I would recommend always using file_name() (on stable):

While both path() and file_name() allocate, the latter allocates a smaller buffer.
file_name() already bakes in the logic to extract the last component of the path.

If you are willing to use nightly, a specialization for Unix using file_name_ref() would be even better, skipping the memory allocation altogether.

Least capable arguments

You should pass an &mut [...] rather than an &mut Vec<...> for reasons:

In your case, there is no reason for sort_sub_dirs to modify the length of dirs, adding or removing elements, reserving extra memory, etc...
&mut [...] is more flexible, it may come from a Vec, from an array, from a VecDeque, ...

Thus, it's better to generalize the method by taking the "least capable" argument type possible.

Check the relevance of your arguments

Is DirEntry even the appropriate type, here?

If your goal is to simply display the sorted list of names, then passing the filename, rather than the full DirEntry, would be more appropriate.

Once again, the least capable arguments are generally better, and in this particular case, it would avoid allocating again and again with file_name every time a comparison is necessary.

On the other hand, if you then intend to traverse the DirEntry itself -- for example to display a tree of the filesystem structure -- then keeping the DirEntry is definitely the right call.

Import Your Types

It's tiring to type std::fs::DirEntry every time, and thus more idiomatic to have a use std::fs::DirEntry at the top of the module so as to only need to type DirEntry elsewhere.

I will admit I tend to prefer NOT importing free functions the same way, and instead would import std::fs to invoke fs::canonicalize; not sure whether that's idiomatic, or if I'm just weird, however.

DRY

In each and every of your attempts, you duplicate the logic to extract the (potential) integer out of the DirEntry for both entries to be compared.

Rust allows defining local functions, take advantage of it!

Pick more visually distinguishable names

Your two variables, d1 and d2 are visually very similar, requiring more effort to distinguish from one another.

A simple left and right would be better:

It's clear which was the left argument, and which was the right one, when the time comes to compare them.
It's hard to misread them, what with their different starting letter and different length.

Don't perform unnecessary work

#1 is strictly worse than the others due to defining default_order ahead of time, which requires comparing the two entries, even if you don't end up using the default_order.

This could be salvaged by defining default_order as a closure, so it's only called as necessary, and thus only compares as necessary.

Don't unwrap

In general, it's considered preferable to avoid unwrap_or to provide a dummy¹.

It is better to use monadic operators to chain the operations, instead.

¹ If the logic dictates a default, that's fine, but in your case one has to reason about the behavior of "".parse(), and whether it'll error out or return 0 because "" is a dummy.

Flatten

When using monadic operations, you end up with nested Option<Result<...>> which is awkward to manipulate.

Instead, you should use flattening operations. In this case:

.ok() will turn a Result<T, E> into an Option<T>.
.and_then(...) will turn an Option<T> into an Option<U> from a FnOnce(T) -> Option<U> closure.

Combined, you get:

let x: Option<i32> = d1.file_name().to_str().and_then(|s| s.parse().ok());

Turbo fish

I must admit I personally prefer to avoid using the turbo-fish "operator" (ie, the ::< ... > in parse::<i32>() and instead favor annotating the type of the result.

I find that it helps focusing the expression on the exact transformation taking place, especially with longer, more complicated types.

if let else

If you can use modern versions of Rust, I would tend towards either if let or let else as a lighter alternative to match.

To avoid repeating yourself, you should match on the tuple of the two elements -- matching both at once -- rather than match on one, executing the bail out action, then match on the other, repeating the bail out action.

Which one to use depends on the situation:

let else is better suited as a "guard" statement: binding a variable after which further logic is applied, with the else case being a quick "exit" path.
if let is better suited when both if and else are simple.

You were correct in picking if else for #1 and #3, but should have favored if let instead of match for #2 and #4 -- and in general, whenever you have a 2-cases match and bind variables only in 1 case.

Putting it altogether

And thus, I find all 4 of your proposals flawed for one reason or another, and would therefore submit this one instead:

fn sort_sub_dirs(dirs: &mut [DirEntry]) {
 fn try_parse(entry: &DirEntry) -> Option<i32> {
 entry.file_name().to_str().and_then(|name| name.parse().ok())
 }
 dirs.sort_by(|left, right| {
 if let (Some(left), Some(right)) = (try_parse(left), try_parse(right)) {
 left.cmp(&right)
 } else {
 left.cmp(&right)
 }
 });
}

Better performance

The problem with this solution is that there's a lot of repeated work. sort will on average do O(N log N) comparisons, which means in our case the logic to extract the file name and possibly view as an integer will be done more than twice per entry.

That's a lot, ain't it?

If performance matters -- and it may not! -- we are better off applying a pre-processing stage to only extract/parse each entry one, then do the comparison, and finally sort the initial slice.

And fortunately for us, the standard library delivers with sort_by_cached_key!

fn sort_sub_dirs(dirs: &mut [DirEntry]) {
 #[derive(Eq, PartialEq)]
 struct Key {
 file_name: PathBuf,
 numeric: Option<i32>,
 }
 impl Ord for Key {
 fn cmp(&self, other: &Self) -> Ordering {
 if let (Some(left), Some(right)) = (self.numeric, other.numeric) {
 left.cmp(&right)
 } else {
 self.file_name.cmp(&self.file_name)
 }
 }
 }
 impl PartialOrd for Key {
 fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
 Some(self.cmp(other))
 }
 }
 dirs.sort_by_cached_key(|entry| {
 let file_name: PathBuf = entry.file_name().into();
 let numeric = file_name.to_str().and_then(|s| s.parse().ok());
 
 Key { file_name, numeric }
 });
}

Is `i32` the right type?

I do wonder how you picked i32, and I do think that a justification for using this type -- as a comment -- would be appropriate if any.

I must admit that the idea of a negative integer as a name is pretty strange to me, to start with, so I would have tended towards u.

Then, there's a question of range. Is 4-ish billions an acceptable upper-bound (u32), or should you go towards large types such as u64 or u128?

This will depend on your usecase, so I'll leave that up to you, but whichever choice you pick, it'd be nice to document the rationale.

First and foremost, thank you for the thorough and informative review! There are many points you have brought up which will be helpful in improving my learning/skills. The i32 would not be a great type here. The sub-dirs will always be positive numbers in this case, so u32 would be a better type.
To the point of favoring if let or let else over match, would the Ord impl of Key be better to use let (Some(left), Some(right)) = (self.numeric, other.numeric) else {... in place of the match?
@KylePrice: Indeed you're right, the Ord impl of Key would be better expressed with an if let. I would reserve let else to "short-circuit" cases. I'll fix my code.

Stack Exchange Network

rust sort numbered directories (non-lexicographical)

1 Answer 1

Discrepancies

Least capable arguments

Check the relevance of your arguments

Import Your Types

DRY

Pick more visually distinguishable names

Don't perform unnecessary work

Don't unwrap

Flatten

Turbo fish

if let else

Putting it altogether

Better performance

Is `i32` the right type?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

rust sort numbered directories (non-lexicographical)

1 Answer 1

Discrepancies

Least capable arguments

Check the relevance of your arguments

Import Your Types

DRY

Pick more visually distinguishable names

Don't perform unnecessary work

Don't unwrap

Flatten

Turbo fish

if let else

Putting it altogether

Better performance

Is i32 the right type?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Is `i32` the right type?