I am very new to rust and have been reading up and playing around to get a better understanding. I was trying to replicate a small task that I have done via bash scripting before, as a way to continue learning. The task involves getting the sub-directories of some base directory, where all sub-directories will be numbered. Then sort the sub-directories via their number in non-lexicographic order.
The directory structure would be expected to be something like:
base_dir/
1/
2/
5/
100/
10256/
7/
23/
I have come up with 4 sorting methods that all work - experimentally - to sort a Vec<std::fs::DirEntry>
in-place, but use differing approaches / styles. I am trying to get an understanding of which approach would be closest to general best practice using rust.
#1:
fn sort_sub_dirs1(dirs: &mut Vec<std::fs::DirEntry>) {
dirs.sort_by(|d1, d2| {
let d1_path = d1.path()
let d2_path = d2.path()
let default_order = d1_path.cmp(&d2_path);
let Some(d1_str) = d1.path().to_str() else { return default_order; };
let Some(d2_str) = d2.path().to_str() else { return default_order; };
let Some(d1_name) = d1_str.split('/').last() else { return default_order; };
let Some(d2_name) = d2_str.split('/').last() else { return default_order; };
let Ok(d1_num) = d1_name.parse::<i32>() else { return default_order; };
let Ok(d2_num) = d2_name.parse::<i32>() else { return default_order; };
return d1_num.cmp(&d2_num);
});
}
#2:
fn sort_sub_dirs2(dirs: &mut Vec<std::fs::DirEntry>) {
dirs.sort_by(|d1, d2| {
let d1_num = d1.file_name().to_str().unwrap_or("").parse::<i32>();
let d2_num = d2.file_name().to_str().unwrap_or("").parse::<i32>();
match (d1_num, d2_num) {
(Ok(d1), Ok(d2)) => return d1.cmp(&d2),
_ => return d1.path().cmp(&d2.path()),
}
});
}
#3:
fn sort_sub_dirs3(dirs: &mut Vec<std::fs::DirEntry>) {
dirs.sort_by(|d1, d2| {
let Ok(d1_num) = d1.file_name().to_str().unwrap_or("").parse::<i32>() else {
return d1.path().cmp(&d2.path());
};
let Ok(d2_num) = d2.file_name().to_str().unwrap_or("").parse::<i32>() else {
return d1.path().cmp(&d2.path());
};
return d1_num.cmp(&d2_num);
});
}
#4
fn sort_sub_dirs4(dirs: &mut Vec<std::fs::DirEntry>) {
dirs.sort_by(|d1, d2| {
let d1_num = d1.file_name().to_str().map(|n| n.parse::<i32>());
let d2_num = d2.file_name().to_str().map(|n| n.parse::<i32>());
match (d1_num, d2_num) {
(Some(Ok(d1)), Some(Ok(d2))) => return d1.cmp(&d2),
_ => return d1.path().cmp(&d2.path()),
}
});
}
#2 and #4 seem the cleanest looking to me, but I am not a fan of the nested Some(Ok(...))
in the match
statement with #4.
These all fallback to the existing path sorting (lexicographic) if we find a directory whose name is not parseable to i32
I may be way off the mark here as well...
-
\$\begingroup\$ What, no regex? π \$\endgroup\$J_H– J_H2023εΉ΄01ζ29ζ₯ 19:49:47 +00:00Commented Jan 29, 2023 at 19:49
1 Answer 1
Note: you may want to look at Natural Sorts, which will split a filename into series of digits and non-digits, and compare two digits blocks by interpreting them as integer.
First of all, let's get a clear program statement:
- Sort a list of
DirEntry
base on the file names. - If two file names are numeric, compare them as integers.
- Otherwise, compare them as paths.
Discrepancies
Why are your parsing the full path yourself in #1 while using the file_name()
method in #2, #3, and #4.
In general, I would recommend always using file_name()
(on stable):
- While both
path()
andfile_name()
allocate, the latter allocates a smaller buffer. file_name()
already bakes in the logic to extract the last component of the path.
If you are willing to use nightly, a specialization for Unix using file_name_ref()
would be even better, skipping the memory allocation altogether.
Least capable arguments
You should pass an &mut [...]
rather than an &mut Vec<...>
for reasons:
- In your case, there is no reason for
sort_sub_dirs
to modify the length ofdirs
, adding or removing elements, reserving extra memory, etc... &mut [...]
is more flexible, it may come from aVec
, from an array, from aVecDeque
, ...
Thus, it's better to generalize the method by taking the "least capable" argument type possible.
Check the relevance of your arguments
Is DirEntry
even the appropriate type, here?
If your goal is to simply display the sorted list of names, then passing the filename, rather than the full DirEntry
, would be more appropriate.
Once again, the least capable arguments are generally better, and in this particular case, it would avoid allocating again and again with file_name
every time a comparison is necessary.
On the other hand, if you then intend to traverse the DirEntry
itself -- for example to display a tree of the filesystem structure -- then keeping the DirEntry
is definitely the right call.
Import Your Types
It's tiring to type std::fs::DirEntry
every time, and thus more idiomatic to have a use std::fs::DirEntry
at the top of the module so as to only need to type DirEntry
elsewhere.
I will admit I tend to prefer NOT importing free functions the same way, and instead would import std::fs
to invoke fs::canonicalize
; not sure whether that's idiomatic, or if I'm just weird, however.
DRY
In each and every of your attempts, you duplicate the logic to extract the (potential) integer out of the DirEntry
for both entries to be compared.
Rust allows defining local functions, take advantage of it!
Pick more visually distinguishable names
Your two variables, d1
and d2
are visually very similar, requiring more effort to distinguish from one another.
A simple left
and right
would be better:
- It's clear which was the left argument, and which was the right one, when the time comes to compare them.
- It's hard to misread them, what with their different starting letter and different length.
Don't perform unnecessary work
#1 is strictly worse than the others due to defining default_order
ahead of time, which requires comparing the two entries, even if you don't end up using the default_order
.
This could be salvaged by defining default_order
as a closure, so it's only called as necessary, and thus only compares as necessary.
Don't unwrap
In general, it's considered preferable to avoid unwrap_or
to provide a dummy1.
It is better to use monadic operators to chain the operations, instead.
1 If the logic dictates a default, that's fine, but in your case one has to reason about the behavior of "".parse()
, and whether it'll error out or return 0 because ""
is a dummy.
Flatten
When using monadic operations, you end up with nested Option<Result<...>>
which is awkward to manipulate.
Instead, you should use flattening operations. In this case:
.ok()
will turn aResult<T, E>
into anOption<T>
..and_then(...)
will turn anOption<T>
into anOption<U>
from aFnOnce(T) -> Option<U>
closure.
Combined, you get:
let x: Option<i32> = d1.file_name().to_str().and_then(|s| s.parse().ok());
Turbo fish
I must admit I personally prefer to avoid using the turbo-fish "operator" (ie, the ::< ... >
in parse::<i32>()
and instead favor annotating the type of the result.
I find that it helps focusing the expression on the exact transformation taking place, especially with longer, more complicated types.
if let else
If you can use modern versions of Rust, I would tend towards either if let
or let else
as a lighter alternative to match
.
To avoid repeating yourself, you should match on the tuple of the two elements -- matching both at once -- rather than match on one, executing the bail out action, then match on the other, repeating the bail out action.
Which one to use depends on the situation:
let else
is better suited as a "guard" statement: binding a variable after which further logic is applied, with theelse
case being a quick "exit" path.if let
is better suited when bothif
andelse
are simple.
You were correct in picking if else
for #1 and #3, but should have favored if let
instead of match
for #2 and #4 -- and in general, whenever you have a 2-cases match and bind variables only in 1 case.
Putting it altogether
And thus, I find all 4 of your proposals flawed for one reason or another, and would therefore submit this one instead:
fn sort_sub_dirs(dirs: &mut [DirEntry]) {
fn try_parse(entry: &DirEntry) -> Option<i32> {
entry.file_name().to_str().and_then(|name| name.parse().ok())
}
dirs.sort_by(|left, right| {
if let (Some(left), Some(right)) = (try_parse(left), try_parse(right)) {
left.cmp(&right)
} else {
left.cmp(&right)
}
});
}
Better performance
The problem with this solution is that there's a lot of repeated work. sort
will on average do O(N log N)
comparisons, which means in our case the logic to extract the file name and possibly view as an integer will be done more than twice per entry.
That's a lot, ain't it?
If performance matters -- and it may not! -- we are better off applying a pre-processing stage to only extract/parse each entry one, then do the comparison, and finally sort the initial slice.
And fortunately for us, the standard library delivers with sort_by_cached_key
!
fn sort_sub_dirs(dirs: &mut [DirEntry]) {
#[derive(Eq, PartialEq)]
struct Key {
file_name: PathBuf,
numeric: Option<i32>,
}
impl Ord for Key {
fn cmp(&self, other: &Self) -> Ordering {
if let (Some(left), Some(right)) = (self.numeric, other.numeric) {
left.cmp(&right)
} else {
self.file_name.cmp(&self.file_name)
}
}
}
impl PartialOrd for Key {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
dirs.sort_by_cached_key(|entry| {
let file_name: PathBuf = entry.file_name().into();
let numeric = file_name.to_str().and_then(|s| s.parse().ok());
Key { file_name, numeric }
});
}
Is i32
the right type?
I do wonder how you picked i32
, and I do think that a justification for using this type -- as a comment -- would be appropriate if any.
I must admit that the idea of a negative integer as a name is pretty strange to me, to start with, so I would have tended towards u
.
Then, there's a question of range. Is 4-ish billions an acceptable upper-bound (u32
), or should you go towards large types such as u64
or u128
?
This will depend on your usecase, so I'll leave that up to you, but whichever choice you pick, it'd be nice to document the rationale.
-
\$\begingroup\$ First and foremost, thank you for the thorough and informative review! There are many points you have brought up which will be helpful in improving my learning/skills. The
i32
would not be a great type here. The sub-dirs will always be positive numbers in this case, sou32
would be a better type. \$\endgroup\$Kyle Price– Kyle Price2023εΉ΄02ζ09ζ₯ 19:09:58 +00:00Commented Feb 9, 2023 at 19:09 -
\$\begingroup\$ To the point of favoring
if let
orlet else
over match, would theOrd
impl ofKey
be better to uselet (Some(left), Some(right)) = (self.numeric, other.numeric) else {...
in place of the match? \$\endgroup\$Kyle Price– Kyle Price2023εΉ΄02ζ09ζ₯ 19:14:34 +00:00Commented Feb 9, 2023 at 19:14 -
\$\begingroup\$ @KylePrice: Indeed you're right, the
Ord
impl ofKey
would be better expressed with anif let
. I would reservelet else
to "short-circuit" cases. I'll fix my code. \$\endgroup\$Matthieu M.– Matthieu M.2023εΉ΄02ζ10ζ₯ 08:01:44 +00:00Commented Feb 10, 2023 at 8:01