I had a need to group the same dataset in several groups. So instead of repeatedly query the dataset, I made an extension that could do it once. The caveat is, that the result is materialized in dictionaries, because I didn't managed to find an way to avoid that. Maybe you can?
public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (groupers == null) throw new ArgumentNullException(nameof(groupers));
IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>();
using (var enumer = source.GetEnumerator())
{
while (enumer.MoveNext())
{
foreach ((var label, var func) in groupers)
{
if (!results.TryGetValue(label, out var dict))
{
dict = new Dictionary<object, HashSet<T>>();
results[label] = dict;
}
var key = func(enumer.Current);
if (!dict.TryGetValue(key, out var set))
{
set = new HashSet<T>();
dict[key] = set;
}
set.Add(enumer.Current);
}
}
}
return results;
}
Use case:
static void TestMultiGrouping()
{
string[] data =
{
"Black",
"White",
"Yellow",
"green",
"Red",
"blue",
"cyan",
"Magenta",
"Orange"
};
foreach (var result in data.MultiGroupBy(
("First UCase", s => s.Length > 0 && char.IsUpper(s[0])),
("Length", s => s.Length),
("Length Four", s => s.Length == 4),
("Contains 'e'", s => s.Contains('e')),
("Num n's", s => s.Count(c => c == 'n'))))
{
Console.WriteLine($"Results for {result.Key}:");
foreach (var dict in result.Value)
{
Console.WriteLine($"{dict.Key}: {dict.Value.Count} [{(string.Join(", ", dict.Value))}]");
}
Console.WriteLine();
}
}
5 Answers 5
If you only want to enumerate source
once, then you'll have to cache it somehow. Either by materializing it immediately, as you do, or whenever the first group is enumerated, but that's more complicated.
If you don't mind duplicate entries in your groups and throwing on duplicate labels, then your code can be simplified to the following:
public static IDictionary<string, IEnumerable<IGrouping<object, T>>> MultiGroupBy<T>(
this IEnumerable<T> source,
params (string label, Func<T, object> keySelector)[] groupings)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (groupings == null) throw new ArgumentNullException(nameof(groupings));
var materializedSource = source.ToArray();
return groupings.ToDictionary(
grouping => grouping.label,
grouping => materializedSource.GroupBy(grouping.keySelector));
}
This materializes source
up-front, but each grouping is lazily evaluated. Some quick enumeration tests with randomly generated strings show a roughly 40% speed improvement. I haven't measured memory consumption, but I expect that to be a bit higher due to the extra references/values stored in materializedSource
.
I suspect the main reason for the speed difference is that your code performs a lookup into results
for every item/grouping combination, something that separate GroupBy
calls don't need to do.
Other notes:
- That
using GetEnumerator/while MoveNext
construction can be simplified to aforeach
loop. - You do not guard against duplicate labels, so you can end up with mixed results (and even mixed key types).
-
\$\begingroup\$ Nifty. But I can't get the same result as you when it comes to performance: If the source is a materialized array itself - yes, but if not, then mine is about double as fast as yours and for large sets even more. But one thing yours showed me, is that I can't use
HashSet<T>
as the inner vector, because it handles doublets wrongly for value types. `HashSet<T> made sense in the original context though. \$\endgroup\$user73941– user739412019年07月18日 13:46:18 +00:00Commented Jul 18, 2019 at 13:46 -
\$\begingroup\$ About
foreach
vs.GetEnumerator()
- it is my experience that the latter is slightly (sometimes much) faster - which is strange becauseforeach
callsGetEnumerator()
. Duplicate labels - thanks - has to be dealt with. \$\endgroup\$user73941– user739412019年07月18日 13:46:24 +00:00Commented Jul 18, 2019 at 13:46 -
3\$\begingroup\$ It's getting into micro-optimisation, but there's an argument for using
ToList()
in preference toToArray()
for temporary private eager evaluation results: the implementation is effectively almost the same, butToArray()
does a final copy to a shorter array for most inputs. \$\endgroup\$Peter Taylor– Peter Taylor2019年07月18日 14:17:44 +00:00Commented Jul 18, 2019 at 14:17
public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)
I don't understand the mixture of interfaces (IDictionary
) and implementations (Dictionary
, HashSet
), nor the mixture of generics (<T>
) and non-generics (object
). Why is it not
public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)
?
IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>(); using (var enumer = source.GetEnumerator()) { while (enumer.MoveNext()) { foreach ((var label, var func) in groupers) { if (!results.TryGetValue(label, out var dict)) { dict = new Dictionary<object, HashSet<T>>(); results[label] = dict; } ...
I'm not quite sure why you want to return an empty dictionary if the source is empty. As a caller of your library, I'd probably rather get a dictionary mapping the grouper names to empty dictionaries.
That also simplifies the initialisation:
var results = groupers.ToDictionary(grouper => grouper.Item1, _ => new Dictionary<object, HashSet<T>>());
using (var enumer = source.GetEnumerator()) { while (enumer.MoveNext()) { ... } }
KISS. foreach
is much kinder on the maintenance programmer, who doesn't have to check for correct usage patterns of the unsugared API. Using MoveNext() / Current
for speed is the epitome of premature optimisation unless benchmarking shows that it's a bottleneck, in which case there should be a comment explaining the bottleneck to justify the more complex code.
Moreover, if this is a bottleneck then it seems likely that the dictionary lookups in results
for every single element in the source will be slower than the overhead of foreach
, so you could start by replacing results
with a List<(string Label, Func<T, K> Getter, IDictionary<K, ISet<T>> Groups)>
and just convert it to a dictionary after the loop.
foreach ((var label, var func) in groupers)
var (label, func)
saves the repetition.
After my proposed refactors and some minor tidying of whitespace, I get
public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (groupers == null) throw new ArgumentNullException(nameof(groupers));
var results = groupers.ToDictionary(grouper => grouper.Item1, _ => (IDictionary<K, ISet<T>>)new Dictionary<K, ISet<T>>());
foreach (var elt in source)
{
foreach (var (label, func) in groupers)
{
var dict = results[label];
var key = func(elt);
if (!dict.TryGetValue(key, out var set))
{
set = new HashSet<T>();
dict[key] = set;
}
set.Add(elt);
}
}
return results;
}
-
\$\begingroup\$ Fine solution. You're right about the interface/class mix. One thing though: I can't use the type parameter
K
because the keys may not be of the same type - that's why I useobject
. The instantiation of theresults
dictionary the way you do, is definitely smarter and cleaner code - I'll use that, but apparently I can't measure any significant difference from mine. AboutISet/HashSet
see my comment to Pieter Witvoet answer. Tanks for interesting points - as always. \$\endgroup\$user73941– user739412019年07月18日 14:47:51 +00:00Commented Jul 18, 2019 at 14:47 -
\$\begingroup\$ LINQ provides public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer);. Why not use that one? \$\endgroup\$dfhwze– dfhwze2019年07月18日 14:49:13 +00:00Commented Jul 18, 2019 at 14:49
-
1\$\begingroup\$ @HenrikHansen, nothing stops you calling it with
object
forK
. I admit that the downside is that the inference is less likely to work, so it's usually going to be necessary to use explicit type arguments. \$\endgroup\$Peter Taylor– Peter Taylor2019年07月18日 15:25:09 +00:00Commented Jul 18, 2019 at 15:25 -
\$\begingroup\$ @dfhwze, was that intended for me or someone else? \$\endgroup\$Peter Taylor– Peter Taylor2019年07月18日 15:25:35 +00:00Commented Jul 18, 2019 at 15:25
-
\$\begingroup\$ It was intended on Pieter's answer. See my answer for an explanation. \$\endgroup\$dfhwze– dfhwze2019年07月18日 15:26:10 +00:00Commented Jul 18, 2019 at 15:26
GroupBy vs ToLookup
From reference source: Linq Enumerable
Dictionary<object, HashSet<T>>>
can be replaced with a ILookup<object, T>
.
public static ILookup<TKey, TSource> ToLookup<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) { // impl .. }
There is also an overload in case you require compliant behavior with HashSet<T>
.
public static ILookup<TKey, TSource> ToLookup<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer) { // impl .. }
This is much faster than GroupBy
. Have a look at the implementation of the latter.
public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) { return new GroupedEnumerable<TSource, TKey, TSource>(source, keySelector, IdentityFunction<TSource>.Instance, null); }
And GroupedEnumerable
wraps Lookup
.
public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator() { return Lookup<TKey, TElement>.Create<TSource>(source, keySelector, elementSelector, comparer).GetEnumerator(); }
Refactored Code
Pieter's answer could be updated with a performance boost substituting GroupBy
with ToLookup
, also including Peter's micro-optimized ToList
.
public static IDictionary<string, ILookup<object, T>> MultiLookupBy<T>(
this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupings)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (groupings == null) throw new ArgumentNullException(nameof(groupings));
var materializedSource = source.ToList();
return groupings.ToDictionary(
grouping => grouping.Label,
grouping => materializedSource.ToLookup(grouping.Getter));
}
And your test code would change a bit.
sb.AppendLine($"Results for {result.Key}:");
foreach (var dict in result.Value)
{
sb.AppendLine($"{dict.Key}: {dict.Count()} [{(string.Join(", ", dict))}]");
}
sb.AppendLine();
I get performance close to the initial OP with this refactored code.
-
1\$\begingroup\$ This seems to be the "testwinner" both according to elegance and performance (for large datasets about 20 %). My problem is now to accept an answer, because I think ,you have contributes evenly to the best solution. I have written a number [1,99] in notepad. Comment a number, and closest will be accepted :-) \$\endgroup\$user73941– user739412019年07月18日 15:55:11 +00:00Commented Jul 18, 2019 at 15:55
-
1\$\begingroup\$ @HenrikHansen you should usually accept the answer, which is the most useful to you and future readers, which stumble upon this problem. People landing here via search will often take the accepted answer as providing a final "best" solution - so the accepted answer should include all relevant facts and not just be a partial answer. - If you are worried about fame, the accepted answer could be edited to a community answer, with credits to both posters. \$\endgroup\$Falco– Falco2019年07月19日 11:24:49 +00:00Commented Jul 19, 2019 at 11:24
Consistency with LINQ
I find in order to be consistent with other LINQ APIs and to make the usage of this extension more intuitive you should slightly adjust the parameter names and the return value and rename it to ToLookups
.
ToLookup
calls the Func
keySelector
and since this extension is accepting a collection, I suggest the name keySelectors
.
As far as the return value is concerned, I would use ILookup
twice here so that the behaviour of the result is consistent.
Unexpected behaviour due to HashSet
If you require unique elements then the source should be prefiltered. Ignoring them here is not something I would expect from a grouping. On the contrary, it should group them together because this is what grouping is for. A HashSet
could also change the order of elements which the builtin grouping wouldn't so it's another surprise here.
Suggested code
This is how I think it should look like:
public static ILookup<string, ILookup<object, T>> ToLookups<T>
(
this IEnumerable<T> source,
params (string Name, Func<T, object> KeySelector)[] keySelectors
)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (keySelectors == null) throw new ArgumentNullException(nameof(keySelectors));
var materializedSource = source.ToList();
return
keySelectors
.Select(t => (t.Name, Lookup: materializedSource.ToLookup(t.KeySelector)))
.ToLookup(t => t.Name, t => t.Lookup);
}
-
\$\begingroup\$ Check my answer on HashSet behavior. You can use ToLookup with an overload that takes a IEqualityComparer<TKey> comparer. \$\endgroup\$dfhwze– dfhwze2019年07月20日 06:48:09 +00:00Commented Jul 20, 2019 at 6:48
-
\$\begingroup\$ @dfhwze I'm not quite sure what it has to do with the
HashSet
? TheIEqualityComparer<T>
It's for grouping, whereasHashSet
is for throwing away duplicates and a grouping shouldn't be doing that. \$\endgroup\$t3chb0t– t3chb0t2019年07月20日 06:52:47 +00:00Commented Jul 20, 2019 at 6:52 -
\$\begingroup\$ You are right, it's on the Key. Doh! \$\endgroup\$dfhwze– dfhwze2019年07月20日 06:56:35 +00:00Commented Jul 20, 2019 at 6:56
-
\$\begingroup\$ Thanks for the answer. According to
HashSet
, you're absolutely right - see my first comment to Pieter Witvoet's answer. I'll consider the renaming, but I think I prefer my name of the method itself. I look forward to benchmark you solution a little later. \$\endgroup\$user73941– user739412019年07月20日 07:06:19 +00:00Commented Jul 20, 2019 at 7:06
FYI. I've now tested the different versions of the algorithm from the different answers, and the result is as follows:
Data Size: 10
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 0.38341 0.05530 16.09750 19.17070 2.24480 [Milliseconds]
dfhwze : 50 0.09890 0.01250 3.96660 4.94510 0.55250 [Milliseconds]
Peter Tayl: 50 0.14559 0.01500 6.16400 7.27940 0.85970 [Milliseconds]
T3chb0t : 50 0.18089 0.01240 8.06260 9.04470 1.12590 [Milliseconds]
Original : 50 0.11584 0.01640 4.54850 5.79220 0.63330 [Milliseconds]
Data Size: 100
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 0.52665 0.48760 0.78700 26.33230 0.05190 [Milliseconds]
dfhwze : 50 0.14118 0.11800 0.24010 7.05920 0.02070 [Milliseconds]
Peter Tayl: 50 0.15725 0.14010 0.35670 7.86250 0.03030 [Milliseconds]
T3chb0t : 50 0.13385 0.11880 0.18680 6.69250 0.01470 [Milliseconds]
Original : 50 0.15542 0.14090 0.32780 7.77100 0.02600 [Milliseconds]
Data Size: 1000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 4.86897 4.56660 5.49500 243.44840 0.19180 [Milliseconds]
dfhwze : 50 1.22802 1.14460 1.55030 61.40110 0.10070 [Milliseconds]
Peter Tayl: 50 1.51039 1.41420 1.83450 75.51970 0.10540 [Milliseconds]
T3chb0t : 50 1.33878 1.13730 2.61480 66.93920 0.21000 [Milliseconds]
Original : 50 1.53352 1.39930 1.93510 76.67620 0.12120 [Milliseconds]
Data Size: 10000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 53.30435 48.53940 59.39360 2665.21760 2.12420 [Milliseconds]
dfhwze : 50 13.29163 11.58010 17.93610 664.58150 1.42940 [Milliseconds]
Peter Tayl: 50 15.99885 13.73030 19.87350 799.94260 1.62800 [Milliseconds]
T3chb0t : 50 13.35479 11.60260 17.27620 667.73940 1.33350 [Milliseconds]
Original : 50 16.06655 14.10760 21.15530 803.32750 1.57870 [Milliseconds]
Data Size: 100000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 759.18213 671.44490 972.02490 37959.10640 106.57280 [Milliseconds]
dfhwze : 50 184.68625 157.19610 240.79290 9234.31240 27.82440 [Milliseconds]
Peter Tayl: 50 247.55367 207.27300 296.28640 12377.68350 38.71610 [Milliseconds]
T3chb0t : 50 200.40129 159.78880 241.07520 10020.06430 31.49570 [Milliseconds]
Original : 50 250.01759 208.41280 324.99400 12500.87940 39.78020 [Milliseconds]
Data Size: 500000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 4241.30253 3572.39540 4887.39420 212065.12660 382.99050 [Milliseconds]
dfhwze : 50 1009.33538 798.42660 1143.81710 50466.76910 124.30220 [Milliseconds]
Peter Tayl: 50 1344.13312 1085.37460 1562.34310 67206.65590 185.08020 [Milliseconds]
T3chb0t : 50 1002.87650 784.16660 1195.38060 50143.82510 136.03740 [Milliseconds]
Original : 50 1354.36220 1072.92070 1536.09860 67718.10980 171.94550 [Milliseconds]
Test Data: randomly generated strings of length [0, 20), and the testcase was:
foreach (var result in data.MultiGroupBy(
("First UCase", s => s.Length > 0 && char.IsUpper(s[0])),
("Length", s => s.Length),
("Length Four", s => s.Length == 4),
("Contains 'e'", s => s.Contains('e')),
("Num 'n's", s => s.Count(c => c == 'n'))))
{
foreach (var dict in result.Value)
{
sum += dict.Value.Count;
}
}
In order to get equivalent results, I changed the HashSet
in the original with a List
.
It's somehow a little disappointing that my efforts to do it in one iteration didn't pay off.
-
\$\begingroup\$ I'm too lazy to try it myself, but what happens when you replace the
results
dictionaries with an array in the original and Peter Taylor's answers? There's really no reason to use a dictionary over the labels in the loop, when the groupers are already in an ordered list. \$\endgroup\$VisualMelon– VisualMelon2019年07月21日 09:17:48 +00:00Commented Jul 21, 2019 at 9:17 -
\$\begingroup\$ @VisualMelon: Good suggestion. I've tried with the original, and it seems to improve it by about 7-10 percent. \$\endgroup\$user73941– user739412019年07月21日 14:12:28 +00:00Commented Jul 21, 2019 at 14:12
IObservable
) rather than a pull interface (IEnumerable
). Rather than returning a dict mapping keys to sets of elements, you can try returning a dict mapping keys to streams of elements. \$\endgroup\$IEnumerable
) was given. Maybe the header of the question is a little misleading - the primary goal was performance optimization rather than "one iteration", but I thought, that "one iteration" was the way to optimize - which it still may be if memory allocation matters. \$\endgroup\$IEnumerable.toObservable()
, which would be a perfect fit. I think the performance characteristics would be pretty good. Using a cold observable (one that doesn't do anything until there's a subscriber), nothing happens until anyone is interested in the result, and they won't have to be concurrently stored in memory \$\endgroup\$