Grouping into more groups in one iteration

Question 1

I had a need to group the same dataset in several groups. So instead of repeatedly query the dataset, I made an extension that could do it once. The caveat is, that the result is materialized in dictionaries, because I didn't managed to find an way to avoid that. Maybe you can?

public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (groupers == null) throw new ArgumentNullException(nameof(groupers));
 IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>();
 using (var enumer = source.GetEnumerator())
 {
 while (enumer.MoveNext())
 {
 foreach ((var label, var func) in groupers)
 {
 if (!results.TryGetValue(label, out var dict))
 {
 dict = new Dictionary<object, HashSet<T>>();
 results[label] = dict;
 }
 var key = func(enumer.Current);
 if (!dict.TryGetValue(key, out var set))
 {
 set = new HashSet<T>();
 dict[key] = set;
 }
 set.Add(enumer.Current);
 }
 }
 }
 return results;
}

Use case:

static void TestMultiGrouping()
{
 string[] data = 
 {
 "Black",
 "White",
 "Yellow",
 "green",
 "Red",
 "blue",
 "cyan",
 "Magenta",
 "Orange"
 };
 foreach (var result in data.MultiGroupBy(
 ("First UCase", s => s.Length > 0 && char.IsUpper(s[0])), 
 ("Length", s => s.Length), 
 ("Length Four", s => s.Length == 4), 
 ("Contains 'e'", s => s.Contains('e')),
 ("Num n's", s => s.Count(c => c == 'n'))))
 {
 Console.WriteLine($"Results for {result.Key}:");
 foreach (var dict in result.Value)
 {
 Console.WriteLine($"{dict.Key}: {dict.Value.Count} [{(string.Join(", ", dict.Value))}]");
 }
 Console.WriteLine();
 }
}

Question 2

If you're trying to avoid materializing this data, I think that is better suited to a push interface (IObservable) rather than a pull interface (IEnumerable). Rather than returning a dict mapping keys to sets of elements, you can try returning a dict mapping keys to streams of elements.

Question 3

@Alexander, tanks for input. Your suggestion could be a solution, but the datasets datatype (IEnumerable) was given. Maybe the header of the question is a little misleading - the primary goal was performance optimization rather than "one iteration", but I thought, that "one iteration" was the way to optimize - which it still may be if memory allocation matters.

Question 4

RxNet adds IEnumerable.toObservable(), which would be a perfect fit. I think the performance characteristics would be pretty good. Using a cold observable (one that doesn't do anything until there's a subscriber), nothing happens until anyone is interested in the result, and they won't have to be concurrently stored in memory

Question 5

If you only want to enumerate source once, then you'll have to cache it somehow. Either by materializing it immediately, as you do, or whenever the first group is enumerated, but that's more complicated.

If you don't mind duplicate entries in your groups and throwing on duplicate labels, then your code can be simplified to the following:

public static IDictionary<string, IEnumerable<IGrouping<object, T>>> MultiGroupBy<T>(
 this IEnumerable<T> source,
 params (string label, Func<T, object> keySelector)[] groupings)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (groupings == null) throw new ArgumentNullException(nameof(groupings));
 var materializedSource = source.ToArray();
 return groupings.ToDictionary(
 grouping => grouping.label,
 grouping => materializedSource.GroupBy(grouping.keySelector));
}

This materializes source up-front, but each grouping is lazily evaluated. Some quick enumeration tests with randomly generated strings show a roughly 40% speed improvement. I haven't measured memory consumption, but I expect that to be a bit higher due to the extra references/values stored in materializedSource.

I suspect the main reason for the speed difference is that your code performs a lookup into results for every item/grouping combination, something that separate GroupBy calls don't need to do.

Other notes:

That using GetEnumerator/while MoveNext construction can be simplified to a foreach loop.
You do not guard against duplicate labels, so you can end up with mixed results (and even mixed key types).

Question 6

Nifty. But I can't get the same result as you when it comes to performance: If the source is a materialized array itself - yes, but if not, then mine is about double as fast as yours and for large sets even more. But one thing yours showed me, is that I can't use HashSet<T>as the inner vector, because it handles doublets wrongly for value types. `HashSet<T> made sense in the original context though.

Question 7

About foreach vs. GetEnumerator() - it is my experience that the latter is slightly (sometimes much) faster - which is strange because foreach calls GetEnumerator(). Duplicate labels - thanks - has to be dealt with.

Question 8

It's getting into micro-optimisation, but there's an argument for using ToList() in preference to ToArray() for temporary private eager evaluation results: the implementation is effectively almost the same, but ToArray() does a final copy to a shorter array for most inputs.

Question 9

public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)

I don't understand the mixture of interfaces (IDictionary) and implementations (Dictionary, HashSet), nor the mixture of generics (<T>) and non-generics (object). Why is it not

public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)

?

 IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>();
 using (var enumer = source.GetEnumerator())
 {
 while (enumer.MoveNext())
 {
 foreach ((var label, var func) in groupers)
 {
 if (!results.TryGetValue(label, out var dict))
 {
 dict = new Dictionary<object, HashSet<T>>();
 results[label] = dict;
 }
 ...

I'm not quite sure why you want to return an empty dictionary if the source is empty. As a caller of your library, I'd probably rather get a dictionary mapping the grouper names to empty dictionaries.

That also simplifies the initialisation:

 var results = groupers.ToDictionary(grouper => grouper.Item1, _ => new Dictionary<object, HashSet<T>>());

 using (var enumer = source.GetEnumerator())
 {
 while (enumer.MoveNext())
 {
 ...
 }
 }

KISS. foreach is much kinder on the maintenance programmer, who doesn't have to check for correct usage patterns of the unsugared API. Using MoveNext() / Current for speed is the epitome of premature optimisation unless benchmarking shows that it's a bottleneck, in which case there should be a comment explaining the bottleneck to justify the more complex code.

Moreover, if this is a bottleneck then it seems likely that the dictionary lookups in results for every single element in the source will be slower than the overhead of foreach, so you could start by replacing results with a List<(string Label, Func<T, K> Getter, IDictionary<K, ISet<T>> Groups)> and just convert it to a dictionary after the loop.

 foreach ((var label, var func) in groupers)

var (label, func) saves the repetition.

After my proposed refactors and some minor tidying of whitespace, I get

public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (groupers == null) throw new ArgumentNullException(nameof(groupers));
 var results = groupers.ToDictionary(grouper => grouper.Item1, _ => (IDictionary<K, ISet<T>>)new Dictionary<K, ISet<T>>());
 foreach (var elt in source)
 {
 foreach (var (label, func) in groupers)
 {
 var dict = results[label];
 var key = func(elt);
 if (!dict.TryGetValue(key, out var set))
 {
 set = new HashSet<T>();
 dict[key] = set;
 }
 set.Add(elt);
 }
 }
 return results;
}

Question 10

Fine solution. You're right about the interface/class mix. One thing though: I can't use the type parameter K because the keys may not be of the same type - that's why I use object. The instantiation of the results dictionary the way you do, is definitely smarter and cleaner code - I'll use that, but apparently I can't measure any significant difference from mine. About ISet/HashSet see my comment to Pieter Witvoet answer. Tanks for interesting points - as always.

Question 11

LINQ provides public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer);. Why not use that one?

Question 12

@HenrikHansen, nothing stops you calling it with object for K. I admit that the downside is that the inference is less likely to work, so it's usually going to be necessary to use explicit type arguments.

Question 13

@dfhwze, was that intended for me or someone else?

Question 14

It was intended on Pieter's answer. See my answer for an explanation.

Question 15

GroupBy vs ToLookup

From reference source: Linq Enumerable

Dictionary<object, HashSet<T>>> can be replaced with a ILookup<object, T>.

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
 this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) 
{
 // impl ..
}

There is also an overload in case you require compliant behavior with HashSet<T>.

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
 this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, 
 IEqualityComparer<TKey> comparer)
{
 // impl ..
}

This is much faster than GroupBy. Have a look at the implementation of the latter.

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
 this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) 
{
 return new GroupedEnumerable<TSource, TKey, TSource>(source, keySelector, IdentityFunction<TSource>.Instance, null);
}

And GroupedEnumerable wraps Lookup.

public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator() 
{
 return Lookup<TKey, TElement>.Create<TSource>(source, keySelector, elementSelector, comparer).GetEnumerator();
}

Refactored Code

Pieter's answer could be updated with a performance boost substituting GroupBy with ToLookup, also including Peter's micro-optimized ToList.

public static IDictionary<string, ILookup<object, T>> MultiLookupBy<T>(
 this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupings)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (groupings == null) throw new ArgumentNullException(nameof(groupings));
 var materializedSource = source.ToList();
 return groupings.ToDictionary(
 grouping => grouping.Label, 
 grouping => materializedSource.ToLookup(grouping.Getter));
}

And your test code would change a bit.

 sb.AppendLine($"Results for {result.Key}:");
 foreach (var dict in result.Value)
 {
 sb.AppendLine($"{dict.Key}: {dict.Count()} [{(string.Join(", ", dict))}]");
 }
 sb.AppendLine();

I get performance close to the initial OP with this refactored code.

Question 16

This seems to be the "testwinner" both according to elegance and performance (for large datasets about 20 %). My problem is now to accept an answer, because I think ,you have contributes evenly to the best solution. I have written a number [1,99] in notepad. Comment a number, and closest will be accepted :-)

Question 17

@HenrikHansen you should usually accept the answer, which is the most useful to you and future readers, which stumble upon this problem. People landing here via search will often take the accepted answer as providing a final "best" solution - so the accepted answer should include all relevant facts and not just be a partial answer. - If you are worried about fame, the accepted answer could be edited to a community answer, with credits to both posters.

Question 18

Consistency with LINQ

I find in order to be consistent with other LINQ APIs and to make the usage of this extension more intuitive you should slightly adjust the parameter names and the return value and rename it to ToLookups.

ToLookup calls the Func keySelector and since this extension is accepting a collection, I suggest the name keySelectors.

As far as the return value is concerned, I would use ILookup twice here so that the behaviour of the result is consistent.

Unexpected behaviour due to `HashSet`

If you require unique elements then the source should be prefiltered. Ignoring them here is not something I would expect from a grouping. On the contrary, it should group them together because this is what grouping is for. A HashSet could also change the order of elements which the builtin grouping wouldn't so it's another surprise here.

Suggested code

This is how I think it should look like:

public static ILookup<string, ILookup<object, T>> ToLookups<T>
(
 this IEnumerable<T> source, 
 params (string Name, Func<T, object> KeySelector)[] keySelectors
)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (keySelectors == null) throw new ArgumentNullException(nameof(keySelectors));
 var materializedSource = source.ToList(); 
 return 
 keySelectors
 .Select(t => (t.Name, Lookup: materializedSource.ToLookup(t.KeySelector)))
 .ToLookup(t => t.Name, t => t.Lookup);
}

Question 19

Check my answer on HashSet behavior. You can use ToLookup with an overload that takes a IEqualityComparer<TKey> comparer.

Question 20

@dfhwze I'm not quite sure what it has to do with the HashSet? The IEqualityComparer<T> It's for grouping, whereas HashSet is for throwing away duplicates and a grouping shouldn't be doing that.

Question 21

You are right, it's on the Key. Doh!

Question 22

Thanks for the answer. According to HashSet, you're absolutely right - see my first comment to Pieter Witvoet's answer. I'll consider the renaming, but I think I prefer my name of the method itself. I look forward to benchmark you solution a little later.

Question 23

FYI. I've now tested the different versions of the algorithm from the different answers, and the result is as follows:

Data Size: 10
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 0.38341 0.05530 16.09750 19.17070 2.24480 [Milliseconds]
dfhwze : 50 0.09890 0.01250 3.96660 4.94510 0.55250 [Milliseconds]
Peter Tayl: 50 0.14559 0.01500 6.16400 7.27940 0.85970 [Milliseconds]
T3chb0t : 50 0.18089 0.01240 8.06260 9.04470 1.12590 [Milliseconds]
Original : 50 0.11584 0.01640 4.54850 5.79220 0.63330 [Milliseconds]
Data Size: 100
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 0.52665 0.48760 0.78700 26.33230 0.05190 [Milliseconds]
dfhwze : 50 0.14118 0.11800 0.24010 7.05920 0.02070 [Milliseconds]
Peter Tayl: 50 0.15725 0.14010 0.35670 7.86250 0.03030 [Milliseconds]
T3chb0t : 50 0.13385 0.11880 0.18680 6.69250 0.01470 [Milliseconds]
Original : 50 0.15542 0.14090 0.32780 7.77100 0.02600 [Milliseconds]
Data Size: 1000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 4.86897 4.56660 5.49500 243.44840 0.19180 [Milliseconds]
dfhwze : 50 1.22802 1.14460 1.55030 61.40110 0.10070 [Milliseconds]
Peter Tayl: 50 1.51039 1.41420 1.83450 75.51970 0.10540 [Milliseconds]
T3chb0t : 50 1.33878 1.13730 2.61480 66.93920 0.21000 [Milliseconds]
Original : 50 1.53352 1.39930 1.93510 76.67620 0.12120 [Milliseconds]
Data Size: 10000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 53.30435 48.53940 59.39360 2665.21760 2.12420 [Milliseconds]
dfhwze : 50 13.29163 11.58010 17.93610 664.58150 1.42940 [Milliseconds]
Peter Tayl: 50 15.99885 13.73030 19.87350 799.94260 1.62800 [Milliseconds]
T3chb0t : 50 13.35479 11.60260 17.27620 667.73940 1.33350 [Milliseconds]
Original : 50 16.06655 14.10760 21.15530 803.32750 1.57870 [Milliseconds]
Data Size: 100000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 759.18213 671.44490 972.02490 37959.10640 106.57280 [Milliseconds]
dfhwze : 50 184.68625 157.19610 240.79290 9234.31240 27.82440 [Milliseconds]
Peter Tayl: 50 247.55367 207.27300 296.28640 12377.68350 38.71610 [Milliseconds]
T3chb0t : 50 200.40129 159.78880 241.07520 10020.06430 31.49570 [Milliseconds]
Original : 50 250.01759 208.41280 324.99400 12500.87940 39.78020 [Milliseconds]
Data Size: 500000
Name Iterations Average Min Max Total Std Dev Units
Pieter Wit: 50 4241.30253 3572.39540 4887.39420 212065.12660 382.99050 [Milliseconds]
dfhwze : 50 1009.33538 798.42660 1143.81710 50466.76910 124.30220 [Milliseconds]
Peter Tayl: 50 1344.13312 1085.37460 1562.34310 67206.65590 185.08020 [Milliseconds]
T3chb0t : 50 1002.87650 784.16660 1195.38060 50143.82510 136.03740 [Milliseconds]
Original : 50 1354.36220 1072.92070 1536.09860 67718.10980 171.94550 [Milliseconds]

Test Data: randomly generated strings of length [0, 20), and the testcase was:

 foreach (var result in data.MultiGroupBy(
 ("First UCase", s => s.Length > 0 && char.IsUpper(s[0])),
 ("Length", s => s.Length),
 ("Length Four", s => s.Length == 4),
 ("Contains 'e'", s => s.Contains('e')),
 ("Num 'n's", s => s.Count(c => c == 'n'))))
 {
 foreach (var dict in result.Value)
 {
 sum += dict.Value.Count;
 }
 }

In order to get equivalent results, I changed the HashSet in the original with a List.

It's somehow a little disappointing that my efforts to do it in one iteration didn't pay off.

Question 24

I'm too lazy to try it myself, but what happens when you replace the results dictionaries with an array in the original and Peter Taylor's answers? There's really no reason to use a dictionary over the labels in the loop, when the groupers are already in an ordered list.

Question 25

@VisualMelon: Good suggestion. I've tried with the original, and it seems to improve it by about 7-10 percent.

Pieter Witvoet Pieter Witvoet 8,26916 silver badges28 bronze badges · Accepted Answer · 2019-07-18 12:43:14Z

If you only want to enumerate source once, then you'll have to cache it somehow. Either by materializing it immediately, as you do, or whenever the first group is enumerated, but that's more complicated.

If you don't mind duplicate entries in your groups and throwing on duplicate labels, then your code can be simplified to the following:

public static IDictionary<string, IEnumerable<IGrouping<object, T>>> MultiGroupBy<T>(
 this IEnumerable<T> source,
 params (string label, Func<T, object> keySelector)[] groupings)
{
 if (source == null) throw new ArgumentNullException(nameof(source));
 if (groupings == null) throw new ArgumentNullException(nameof(groupings));
 var materializedSource = source.ToArray();
 return groupings.ToDictionary(
 grouping => grouping.label,
 grouping => materializedSource.GroupBy(grouping.keySelector));
}

This materializes source up-front, but each grouping is lazily evaluated. Some quick enumeration tests with randomly generated strings show a roughly 40% speed improvement. I haven't measured memory consumption, but I expect that to be a bit higher due to the extra references/values stored in materializedSource.

I suspect the main reason for the speed difference is that your code performs a lookup into results for every item/grouping combination, something that separate GroupBy calls don't need to do.

Other notes:

That using GetEnumerator/while MoveNext construction can be simplified to a foreach loop.
You do not guard against duplicate labels, so you can end up with mixed results (and even mixed key types).

Nifty. But I can't get the same result as you when it comes to performance: If the source is a materialized array itself - yes, but if not, then mine is about double as fast as yours and for large sets even more. But one thing yours showed me, is that I can't use HashSet<T>as the inner vector, because it handles doublets wrongly for value types. `HashSet<T> made sense in the original context though.
About foreach vs. GetEnumerator() - it is my experience that the latter is slightly (sometimes much) faster - which is strange because foreach calls GetEnumerator(). Duplicate labels - thanks - has to be dealt with.
It's getting into micro-optimisation, but there's an argument for using ToList() in preference to ToArray() for temporary private eager evaluation results: the implementation is effectively almost the same, but ToArray() does a final copy to a shorter array for most inputs.

Stack Exchange Network

Grouping into more groups in one iteration

5 Answers 5

GroupBy vs ToLookup

Refactored Code

Consistency with LINQ

Unexpected behaviour due to `HashSet`

Suggested code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Grouping into more groups in one iteration

5 Answers 5

GroupBy vs ToLookup

Refactored Code

Consistency with LINQ

Unexpected behaviour due to HashSet

Suggested code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Unexpected behaviour due to `HashSet`