LINQ Extension
I present a LINQ method ArgBy
that returns the item of a sequence that corresponds to the predicate specified. Two additional methods MinBy
, MaxBy
are provided as specific cases. Its specification how to handle null is the same as the existing LINQ methods (see link).
The purpose is to replace a two-pass function:
var max = collection.Max(item => item.Value);
var argMax = collection.First(item => item.Value == max);
by a single-pass function:
var argMax = collection.MaxBy(x => x.Value);
Questions
- Is this method usable in practice?
- Is this correct LINQ style?
- Am I compliant to the existing LINQ specification of equivalent methods?
- Are there edge cases or optimisations that require more thought?
Source Code
ArgBy
iterates the source sequence and tests each item against a predicate. Any match on the predicate overwrites the referenence to test against for remaining items. The last match is returned. As with the existing LINQ implementation (see reference source Min
for instance) there is slightly different behavior for value types vs null-asssignable types. There is additional complexity over existing methods, since we include a key selector, and a key itself could also be either a value type or null-assignable.
public static class LinqExtension
{
public static TSource ArgBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
Func<(TKey Current, TKey Previous), bool> predicate)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));
if (predicate == null) throw new ArgumentNullException(nameof(predicate));
var value = default(TSource);
var key = default(TKey);
if (value == null)
{
foreach (var other in source)
{
if (other == null) continue;
var otherKey = keySelector(other);
if (otherKey == null) continue;
if (value == null || predicate((otherKey, key)))
{
value = other;
key = otherKey;
}
}
return value;
}
else
{
bool hasValue = false;
foreach (var other in source)
{
var otherKey = keySelector(other);
if (otherKey == null) continue;
if (hasValue)
{
if (predicate((otherKey, key)))
{
value = other;
key = otherKey;
}
}
else
{
value = other;
key = otherKey;
hasValue = true;
}
}
if (hasValue) return value;
throw new InvalidOperationException("Sequence contains no elements");
}
}
}
Two additional methods are provided to get the item with respectively minimum or maximum value given a key selector.
public static TSource MinBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer = null)
{
if (comparer == null) comparer = Comparer<TKey>.Default;
return source.ArgBy(keySelector, lag => comparer.Compare(lag.Current, lag.Previous) < 0);
}
public static TSource MaxBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer = null)
{
if (comparer == null) comparer = Comparer<TKey>.Default;
return source.ArgBy(keySelector, lag => comparer.Compare(lag.Current, lag.Previous) > 0);
}
Unit Tests
Let's create both a struct and class.
struct Tag
{
public readonly ValueProvider ValueProvider;
public readonly string Description;
public Tag(ValueProvider valueProvider, string description)
=> (ValueProvider, Description) = (valueProvider, description);
}
class ValueProvider : IComparable<ValueProvider>
{
public readonly int Value;
public ValueProvider(int value) => Value = value;
public int CompareTo(ValueProvider other)
{
return Value.CompareTo(other.Value);
}
}
And perform some basic tests:
[TestMethod]
public void MinByClass()
{
var sequence = new[]
{
null,
new ValueProvider(1),
new ValueProvider(0),
null,
new ValueProvider(2),
};
var min = sequence.MinBy(x => x.Value);
var max = sequence.MaxBy(x => x.Value);
Assert.AreEqual(0, min.Value);
Assert.AreEqual(2, max.Value);
}
[TestMethod]
public void MinByStruct()
{
var sequence = new[]
{
new Tag(new ValueProvider(1), "B"),
new Tag(new ValueProvider(0), "A"),
new Tag(null, "D"),
new Tag(new ValueProvider(2), "C"),
};
var min = sequence.MinBy(x => x.ValueProvider);
var max = sequence.MaxBy(x => x.ValueProvider);
Assert.AreEqual("A", min.Description);
Assert.AreEqual("C", max.Description);
}
3 Answers 3
public static TSource ArgBy<TSource, TKey>(
MinBy
and MaxBy
are very intelligible names, but ArgBy
doesn't say much to me. Really this is a specialised Aggregate
(and I wonder whether an implementation using Aggregate
might be simpler - or, indeed, whether the existence of Aggregate
makes ArgBy
unnecessary).
Func<(TKey Current, TKey Previous), bool> predicate)
I presume that you chose Func<(TKey Current, TKey Previous), bool>
instead of Func<TKey, TKey, bool>
because it lets you name the arguments? Whatever the reason, it would be worth documenting.
if (value == null) { foreach (var other in source) { if (other == null) continue; var otherKey = keySelector(other); if (otherKey == null) continue;
Those two continue
s would also be worth documenting. I might want null
to sort before everything else rather than be discarded.
if (value == null) { ... return value; } else { bool hasValue = false; ... if (hasValue) return value; throw new InvalidOperationException("Sequence contains no elements"); }
That inconsistency is a big gotcha. I see from your reference that System.Linq
does the same, but I wouldn't expect it and I definitely think it should be documented.
Also, the exception message might not be true: maybe the sequence contained elements, but their keys were null
.
Note that Linq predates some of the type constraints we have now. Perhaps it would be worth handling the two cases by having two implementations, one with where TSource : class
and one with where TSource : struct
. That way the method-level comments don't have to case split on the type parameter (<exception cref="InvalidOperationException">Thrown when the sequence is empty and the element type is a struct.</exception>
).
-
\$\begingroup\$ (1) Func<(TKey Current, TKey Previous), bool> yes this is my way of documenting the tuple. (2) the exception message might not be true you are right, it should be consistent with LINQ and be "Sequence contains no matching elements" (3) two seperate methods for struct and class, I considered this and perhaps it would be more clean code indeed (4) it does require documentation in general, as does the existing LINQ API :p \$\endgroup\$dfhwze– dfhwze2019年08月31日 08:48:33 +00:00Commented Aug 31, 2019 at 8:48
-
2\$\begingroup\$ @dfhwze .net is probably not the best example of consistency. If everyone would try to mimic their conventions of KeyNotFoundException without naming the key (which immutables now do) we would just throw exceptions like you know, there's an error, good luck finding it, I won't tell you more about it :-] \$\endgroup\$t3chb0t– t3chb0t2019年08月31日 08:51:54 +00:00Commented Aug 31, 2019 at 8:51
-
\$\begingroup\$ @t3chb0t Yeah I also feel, because I tried to be as consistent as possible with LINQ's reference source, it hurts readability and lacks documentation of specification. \$\endgroup\$dfhwze– dfhwze2019年08月31日 08:53:45 +00:00Commented Aug 31, 2019 at 8:53
-
1\$\begingroup\$ Thinking about all the possible null-handling preferences (source items null allowed, null preferred, keys null allowed, null preferred), perhaps there is a good reason LINQ does not provide a MinBy. This method would become really cluttered. \$\endgroup\$dfhwze– dfhwze2019年08月31日 09:08:22 +00:00Commented Aug 31, 2019 at 9:08
-
1\$\begingroup\$ @dfhwze, IMO the way to handle that is to include no special casing for
null
at all: if the caller needs to include aWhere(elt => elt != null)
in the chain, they can. \$\endgroup\$Peter Taylor– Peter Taylor2019年09月02日 08:23:55 +00:00Commented Sep 2, 2019 at 8:23
I think in general MinBy
and MaxBy
are useful but handling all the null
cases might be tricky, and indeed it is seeing the repeated logic and two similar loops. For this reason I also think that it'd be easier if we do not handle them at all but let the user decide what he wants to do with them and implement both simply with OrderBy(Descending).First()
public static TSource MinBy<TSource, TKey>
(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer = null
)
{
return
source
.OrderBy(s => keySelector(s), comparer ?? Comparer<TKey>.Default)
//.OrderByDescending() // for MaxBy
.First();
}
If the user wants null
s to be sorted then he'd just call it with .?
sequence.MinBy(x => x?.Value);
and if not, then he can filter them out:
sequence.Where(x => x != null).MinBy(x => x.Value);
Now, he has all the freedom and we no longer need to decide for him or document any possibe tricks or hidden features. If he doesn't filter null
s out and doesn't use .?
it'll just throw.
Don't use SortedSet
It turns out that it isn't the right tool for this job. OrderBy[Descending]
however, is performing quite well (even much better than expected) and only marginally left behind by the single-pass approach.
(Tested with extendend benchmarks with 100k shuffled items where the filter value was a random number between 0-100)
obsolete
(削除) In performance critical scenarios I'd use a SortedSet
with a custom wrapper-comparer (also because I'm lazy). Its complexitiy is O(log n) according to Comparative Analysis of List, HashSet and SortedSet which is better than O(n) according to What does O(log n) mean exactly?
. The disadvantage might be that the set needs aditional memory to hold the items but maybe this can be neglected and the faster search outweighs the memory usage. (削除ここまで)
private class MinComparer<TSource, TKey> : IComparer<TSource>
{
public IComparer<TKey> Comparer { get; set; }
public Func<TSource, TKey> KeySelector { get; set; }
public int Compare(TSource x, TSource y)
{
return Comparer.Compare(KeySelector(x), KeySelector(y))
}
}
public static TSource MinBy<TSource, TKey>
(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer = null
)
{
var c = new MinComparer<TSource, TKey>
{
KeySelector = keySelector,
Comparer = comparer ?? Comparer<TKey>.Default,
};
return new SortedSet<TSource>(source, c).First();
}
Another MaxComparer
could order the items in descending order so that we still could use First
with it.
-
\$\begingroup\$ @dfhwze there is one more possibility with a
SortedSet
, this one is O(log n) \$\endgroup\$t3chb0t– t3chb0t2019年08月31日 10:02:56 +00:00Commented Aug 31, 2019 at 10:02 -
\$\begingroup\$ I noticed your edit, so I removed my obsolete comment :p I could re-iterate that I find your solution compact and elegant. \$\endgroup\$dfhwze– dfhwze2019年08月31日 10:09:09 +00:00Commented Aug 31, 2019 at 10:09
-
1\$\begingroup\$ @HenrikHansen I call this oops ;-] fixed \$\endgroup\$t3chb0t– t3chb0t2019年08月31日 11:09:10 +00:00Commented Aug 31, 2019 at 11:09
-
2\$\begingroup\$
SortedSet
is not a high performance alternative: it's \$\Theta(\lg n)\$ perAdd
, which means thatnew SortedSet<TSource>(source, c)
takes \$\Theta(n \lg n)\$. \$\endgroup\$Peter Taylor– Peter Taylor2019年08月31日 13:19:04 +00:00Commented Aug 31, 2019 at 13:19 -
1\$\begingroup\$ @dfhwze I tuned the posted benchmarks and there is virtually not difference between single run and
OrderBy[Descending]
. The difference averages to~0.030 ms
for100k
items in random order (1.209 ms
vs1.245 ms
).SortedSet
really sucks haha. I'll have to update the question but it's not the worst one :PMinByA
andMaxByA
had to be excluded from the tests because I wanted them to complete today lol \$\endgroup\$t3chb0t– t3chb0t2019年09月01日 06:26:21 +00:00Commented Sep 1, 2019 at 6:26
- AggByA: baseline two-pass method presented by the OP.
- AggByB: one-pass method using
Aggregate
suggested by @PeterTaylor - AggByC: one-pass method implemented by hand
I left most error handling to the reader since I personally don't believe it is necessary, or even helpful, in something this abstracted. If the caller wants extra safety then they can easily add whatever checks they desire through the selector
and predicate
arguments.
Q&A:
- Is this method usable in practice? Of course, it just is not ideal.
- Is this correct LINQ style? Opinion-based, but I'd say 'No.' primarily because
ArgBy
seems to be unnecessary (as pointed out by @PeterTaylor). That said, depending on theAggregate
method ends up being slower than the naive alternative.
Benchmark Results:
Benchmark Code:
public readonly struct SomeStructure
{
private readonly int m_count;
private readonly string m_name;
public int Count => m_count;
public string Name => m_name;
public SomeStructure(int count, string name) {
m_count = count;
m_name = name;
}
}
public class Race
{
private static SomeStructure[] m_data;
[GlobalSetup]
public void Setup() {
m_data = new[] {
new SomeStructure(0, "A"),
new SomeStructure(1, "B"),
new SomeStructure(2, "C"),
};
}
[Benchmark]
public SomeStructure MinByA() => Program.MinByA(m_data, a => a.Count);
[Benchmark]
public SomeStructure MinByB() => Program.MinByB(m_data, a => a.Count);
[Benchmark]
public SomeStructure MinByC() => Program.MinByC(m_data, a => a.Count);
[Benchmark]
public SomeStructure MaxByA() => Program.MaxByA(m_data, a => a.Count);
[Benchmark]
public SomeStructure MaxByB() => Program.MaxByB(m_data, a => a.Count);
[Benchmark]
public SomeStructure MaxByC() => Program.MaxByC(m_data, a => a.Count);
}
class Program
{
static void Main(string[] args) {
var summary = BenchmarkRunner.Run<Race>();
Console.ReadKey();
}
public static TElement AggByA<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector, Func<TSelector, TSelector, bool> predicate, Func<Func<TElement, TSelector>, TSelector> aggregator) =>
items.First(acc => predicate(selector(acc), aggregator(selector)));
public static TElement AggByB<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector, Func<TSelector, TSelector, bool> predicate) =>
items.Aggregate((acc, cur) => (predicate(selector(acc), selector(cur)) ? cur : acc));
public static TElement AggByC<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector, Func<TSelector, TSelector, bool> predicate) {
TElement accumulator;
var enumerator = items.GetEnumerator();
try {
if (!enumerator.MoveNext()) {
throw new InvalidOperationException("no elements to enumerate");
}
accumulator = enumerator.Current;
while (enumerator.MoveNext()) {
var current = enumerator.Current;
if (predicate(selector(accumulator), selector(current))) {
accumulator = current;
}
}
}
finally {
enumerator.Dispose();
}
return accumulator;
}
public static TElement MaxByA<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByA(items, selector, (acc, cur) => (0 == acc.CompareTo(cur)), items.Max);
public static TElement MaxByB<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByB(items, selector, (acc, cur) => (0 > acc.CompareTo(cur)));
public static TElement MaxByC<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByC(items, selector, (acc, cur) => (0 > acc.CompareTo(cur)));
public static TElement MinByA<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByA(items, selector, (acc, cur) => (0 == acc.CompareTo(cur)), items.Min);
public static TElement MinByB<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByB(items, selector, (acc, cur) => (0 < acc.CompareTo(cur)));
public static TElement MinByC<TElement, TSelector>(IEnumerable<TElement> items, Func<TElement, TSelector> selector) where TSelector : IComparable<TSelector> =>
AggByC(items, selector, (acc, cur) => (0 < acc.CompareTo(cur)));
}
-
\$\begingroup\$ These tests are broken.
AggByC
doesn't dispose the enumerator;MinByA
andMaxByA
both enumerateitems
twice. Also these tests only test three already sorted items. Run them with100_000
shuffled items andOrderBy[Descending]
wins them all or is so close toMinByC
that the results are barely comparable.MinByA
andMaxByA
have to be excluded because they need so long.SortedSet
is not the slowest one...MinByA
andMaxByA
are. \$\endgroup\$t3chb0t– t3chb0t2019年09月01日 05:59:09 +00:00Commented Sep 1, 2019 at 5:59 -
\$\begingroup\$ +1 for creating these test but -1 for broken ones, non realistic conditions and not testing all approaches so the result is 0. \$\endgroup\$t3chb0t– t3chb0t2019年09月01日 06:02:47 +00:00Commented Sep 1, 2019 at 6:02
-
\$\begingroup\$ @t3chb0t I explicitly mentioned that
AggByA
is two passes; it is the baseline that the OP is trying to beat. Of course it's slowest! TheSortedSet
implementation wastes memory and doesn't play nice with large inputs; why bother using it at all? As for the lack of dispose, a simple oversight that I have corrected. \$\endgroup\$Kittoes0124– Kittoes01242019年09月01日 07:02:05 +00:00Commented Sep 1, 2019 at 7:02 -
1\$\begingroup\$ why bother using it at all - for the sake of science and proof :P \$\endgroup\$t3chb0t– t3chb0t2019年09月01日 07:56:28 +00:00Commented Sep 1, 2019 at 7:56
-
\$\begingroup\$ @t3chb0t We don't even need to perform a benchmark in order to prove that any solution that involves extra memory is going to be inherently slower than any reasonably intelligent algorithm that accumulates "in-place." Science tells us that such implementations are objectively awful before we even write a single line of code. \$\endgroup\$Kittoes0124– Kittoes01242019年09月01日 13:13:50 +00:00Commented Sep 1, 2019 at 13:13
Min
andMinBy
orMax
andMaxBy
? Currently they are the same and I'm having a hard time wrapping my head around it. \$\endgroup\$ArgumentException: At least one object must implement IComparable.
mhmm. I was thinkig of a test where you do both operations likesource.Min(x => x); // <-- 2
andsource.MinBy(x => x); // <-- 4
or I completely miss the point of these extensions hehe. \$\endgroup\$var value = default(TSource);
:P and I forgive you that onebool hasValue = false;
\$\endgroup\$