I want to process a ton of JSON objects as quickly as possible.
I've done my best to make the processing as concise and efficient as I can.
How can I make it more efficient?
private void processButton_Click(object sender, EventArgs e)
{
// For measuring how long the processing takes.
var stopwatch = Stopwatch.StartNew();
// Get the IDs of the completed RootObjects.
// There are no duplicate IDs.
var completed = new List<string>();
foreach (var file in Directory.EnumerateFiles("C:\\Completed", "*.json"))
{
completed.AddRange(
JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)).Select(o => o.id));
}
Console.WriteLine($"completed.Count: {completed.Count}");
// Get the unfinished RootObjects.
//
// 78,198 of the unfinished RootObjects share their ID with another.
// The duplicates are removed in the next step. (#2)
//
// The unfinished RootObjects also contain ALL of the completed RootObjects.
// The completed RootObjects are ignored in the next step. (#1)
var unfinished = new List<RootObject>();
foreach (var file in Directory.EnumerateFiles("C:\\Unfinished", "*.json"))
{
unfinished.AddRange(JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
}
Console.WriteLine($"unfinished.Count: {unfinished.Count}");
var processed =
unfinished.Where(o => !completed.Contains(o.id)) // (#1) Ignore all completed RootObjects.
.GroupBy(o => o.id).Select(objects => objects.First()) // (#2) Remove all duplicate RootObjects.
.ToList();
Console.WriteLine($"processed.Count: {processed.Count}");
stopwatch.Stop();
Console.WriteLine($"stopwatch.ElapsedMilliseconds: {stopwatch.ElapsedMilliseconds}");
}
// Output:
// complete.Count: 35649
// unfinished.Count: 250315
// processed.Count: 136468
// stopwatch.ElapsedMilliseconds: 75875
If you require any further information, let me know.
-
\$\begingroup\$ Code Review isn't an online performance profiling service. There are tools you can use in 1st place. \$\endgroup\$πάντα ῥεῖ– πάντα ῥεῖ2017年01月19日 18:58:31 +00:00Commented Jan 19, 2017 at 18:58
-
\$\begingroup\$ The problem is that you might be reading way too much. Instead of reading the whole file you should only read some lines of it. To do that you might want to assure that the objects are ordered in the file by id. \$\endgroup\$Bruno Costa– Bruno Costa2017年01月19日 21:32:23 +00:00Commented Jan 19, 2017 at 21:32
2 Answers 2
LINQ (convenient)
You could rewrite the loops with LINQ so this
var completed = new List<string>(); foreach (var file in Directory.EnumerateFiles("C:\\Completed", "*.json")) { completed.AddRange( JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)).Select(o => o.id)); }
could become this
var completd =
Directory.EnumerateFiles("C:\\Completed", "*.json")
.Select(File.ReadAllText)
.Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json).Select(x => x.id))
.SelectMany(x => x)
.ToList();
You can do the same with the other loop.
When using LINQ you might experience some performance hit but don't be discouraged from using it. Use LINQ first because it's easy to use and read. Only if you are not satisfied with its performance rewrite it with a normal loop.
In most cases you won't even notice that LINQ is in action.
Run a profiler if you're not sure. Don't blindly rewrite everything with loops because LINQ potentialiy could slow something down unless you really, really, really need to.
var processed = unfinished.Where(o => !completed.Contains(o.id)) // (#1) Ignore all completed RootObjects. .GroupBy(o => o.id).Select(objects => objects.First()) // (#2) Remove all duplicate RootObjects. .ToList();
I think this is the most inefficient part in your code because the completed.Contains
method is a O(n) operation.
There are two alternatives for it: either use a HashSet
s or the Except
extension that
[..] returns those elements in first that do not appear in second. It does not also return those elements in second that do not appear in first.
In order to make it work and make its usage easier you should get entire RootObject
s from the first query instead of just the ids and you'll need a custom comparer for the RootObject
that you can even reuse for the Distinct
extension so you can replace the
.GroupBy(o => o.id).Select(objects => objects.First())
with
.Distinct(new RootObjectComparer())
The new query is now much easier to understand and does not requrie any comments. It also swaps the Distinct
with the Except
so it has less items to process.
var processed =
unfinished
.Distinct(new RootObjectComparer())
.Except(completed, new RootObjectComparer())
.ToList();
The custom comparer.
class RootObjectComparer : IEqualityComparer<RootObject>
{
public bool Equals(RootObject left, RootObject right)
{
return left.id == right.id;
}
public int GetHashCode(RootObject obj)
{
return obj.id;
}
}
LINQ free (fast)
You could try to mix LINQ and the HashSet
into this where you not only provide the root-objects to the constructor but also the custom comparer.
A hash-set stores only unique objects so there's no need to filter them.
var unfinished = new HashSet<RootObject>(
Directory
.EnumerateFiles("C:\\Unfinished", "*.json")
.Select(File.ReadAllText)
.Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json))
.SelectMany(x => x),
new RootObjectComparer()
);
The fastet version should be however this LINQ free one. Here you import the lists directly into the hash-set.
var unfinished = new HashSet<RootObject>(new RootObjectComparer());
foreach (var file in Directory.EnumerateFiles("C:\\Unfinished", "*.json"))
{
unfinished.UnionWith(
JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
}
Then you exclude the completed items with
unfinished.ExceptWith(completed);
-
\$\begingroup\$ If you really are going to try and squeeze as much performance as you can out of this (while ignoring technical debt), why not recommend using
for
loops instead offoreach
? Or even going as far as partially unwinding your loops to give out-of-order execution a leg up? If he is handling tens or hundreds of millions of objects there will be very significant performance gains. \$\endgroup\$Douglas Gaskell– Douglas Gaskell2017年01月19日 23:01:57 +00:00Commented Jan 19, 2017 at 23:01 -
-
\$\begingroup\$ @DouglasGaskell
foreach
in this case will be faster anyway becasue if you have a lot of files then withEnumerateFiles
you're enumerating them only once. If you wanted to use afor
loop you'd have to get all the files first either withGetFiles
orToList
it and by doing so you'd enumerate them all twice. Msdn states that usingEnumerateFiles
with lots of files is faster. \$\endgroup\$t3chb0t– t3chb0t2017年01月20日 07:36:01 +00:00Commented Jan 20, 2017 at 7:36 -
1\$\begingroup\$ With your help I've reduced the processing time by 95%. Thank you! \$\endgroup\$Owen– Owen2017年01月21日 22:06:26 +00:00Commented Jan 21, 2017 at 22:06
Don't have much to say about your code as it's rather short but I have a few notes:
- I'd use verbatim strings for the directory:
"C:\\Completed"
-> @"C:\Completed"
You can refactor your logic for obtaining the files in a separate method to shorten your
foreach
loops:private static IEnumerable<List<RootObject>> GetFiles(string directory) { return Directory.EnumerateFiles(directory, "*.json") .Select(file => JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file))); }
Usage:
var completed = new List<string>();
foreach (var file in GetFiles(@"C:\Completed"))
{
completed.AddRange(file.Select(o => o.id));
}
Console.WriteLine($"completed.Count: {completed.Count}");
var unfinished = new List<RootObject>();
foreach (var file in GetFiles(@"C:\Unfinished"))
{
unfinished.AddRange(file);
}
-
\$\begingroup\$
GetFiles
should be renamed to something else because this method does not return files but anIEnumerable <RootObject>
\$\endgroup\$Heslacher– Heslacher2017年01月20日 05:16:19 +00:00Commented Jan 20, 2017 at 5:16 -
\$\begingroup\$ I'm not sure what's the purpose of
RootObject
, I will leave that to the OP. \$\endgroup\$Denis– Denis2017年01月20日 05:20:10 +00:00Commented Jan 20, 2017 at 5:20
Explore related questions
See similar questions with these tags.