Processing lots of JSON objects

Question 1

I want to process a ton of JSON objects as quickly as possible.

I've done my best to make the processing as concise and efficient as I can.

How can I make it more efficient?

private void processButton_Click(object sender, EventArgs e)
{
 // For measuring how long the processing takes.
 var stopwatch = Stopwatch.StartNew();
 // Get the IDs of the completed RootObjects.
 // There are no duplicate IDs.
 var completed = new List<string>();
 foreach (var file in Directory.EnumerateFiles("C:\\Completed", "*.json"))
 {
 completed.AddRange(
 JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)).Select(o => o.id));
 }
 Console.WriteLine($"completed.Count: {completed.Count}");
 // Get the unfinished RootObjects.
 //
 // 78,198 of the unfinished RootObjects share their ID with another.
 // The duplicates are removed in the next step. (#2)
 //
 // The unfinished RootObjects also contain ALL of the completed RootObjects.
 // The completed RootObjects are ignored in the next step. (#1)
 var unfinished = new List<RootObject>();
 foreach (var file in Directory.EnumerateFiles("C:\\Unfinished", "*.json"))
 {
 unfinished.AddRange(JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
 }
 Console.WriteLine($"unfinished.Count: {unfinished.Count}");
 var processed =
 unfinished.Where(o => !completed.Contains(o.id)) // (#1) Ignore all completed RootObjects.
 .GroupBy(o => o.id).Select(objects => objects.First()) // (#2) Remove all duplicate RootObjects.
 .ToList();
 Console.WriteLine($"processed.Count: {processed.Count}");
 stopwatch.Stop();
 Console.WriteLine($"stopwatch.ElapsedMilliseconds: {stopwatch.ElapsedMilliseconds}");
}
// Output:
// complete.Count: 35649
// unfinished.Count: 250315
// processed.Count: 136468
// stopwatch.ElapsedMilliseconds: 75875

If you require any further information, let me know.

Question 2

Code Review isn't an online performance profiling service. There are tools you can use in 1st place.

Question 3

The problem is that you might be reading way too much. Instead of reading the whole file you should only read some lines of it. To do that you might want to assure that the objects are ordered in the file by id.

Question 4

LINQ (convenient)

You could rewrite the loops with LINQ so this

var completed = new List<string>();
foreach (var file in Directory.EnumerateFiles("C:\\Completed", "*.json"))
{
 completed.AddRange(
 JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)).Select(o => o.id));
}

could become this

var completd = 
 Directory.EnumerateFiles("C:\\Completed", "*.json")
 .Select(File.ReadAllText)
 .Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json).Select(x => x.id))
 .SelectMany(x => x)
 .ToList();

You can do the same with the other loop.

When using LINQ you might experience some performance hit but don't be discouraged from using it. Use LINQ first because it's easy to use and read. Only if you are not satisfied with its performance rewrite it with a normal loop.

In most cases you won't even notice that LINQ is in action.

Run a profiler if you're not sure. Don't blindly rewrite everything with loops because LINQ potentialiy could slow something down unless you really, really, really need to.

var processed =
 unfinished.Where(o => !completed.Contains(o.id)) // (#1) Ignore all completed RootObjects.
 .GroupBy(o => o.id).Select(objects => objects.First()) // (#2) Remove all duplicate RootObjects.
 .ToList();

I think this is the most inefficient part in your code because the completed.Contains method is a O(n) operation.

There are two alternatives for it: either use a HashSets or the Except extension that

[..] returns those elements in first that do not appear in second. It does not also return those elements in second that do not appear in first.

In order to make it work and make its usage easier you should get entire RootObjects from the first query instead of just the ids and you'll need a custom comparer for the RootObject that you can even reuse for the Distinct extension so you can replace the

.GroupBy(o => o.id).Select(objects => objects.First())

with

.Distinct(new RootObjectComparer())

The new query is now much easier to understand and does not requrie any comments. It also swaps the Distinct with the Except so it has less items to process.

var processed =
 unfinished
 .Distinct(new RootObjectComparer())
 .Except(completed, new RootObjectComparer())
 .ToList();

The custom comparer.

class RootObjectComparer : IEqualityComparer<RootObject>
{ 
 public bool Equals(RootObject left, RootObject right)
 {
 return left.id == right.id;
 }
 public int GetHashCode(RootObject obj)
 {
 return obj.id;
 }
}

LINQ free (fast)

You could try to mix LINQ and the HashSet into this where you not only provide the root-objects to the constructor but also the custom comparer.

A hash-set stores only unique objects so there's no need to filter them.

var unfinished = new HashSet<RootObject>(
 Directory
 .EnumerateFiles("C:\\Unfinished", "*.json")
 .Select(File.ReadAllText)
 .Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json))
 .SelectMany(x => x),
 new RootObjectComparer()
);

The fastet version should be however this LINQ free one. Here you import the lists directly into the hash-set.

var unfinished = new HashSet<RootObject>(new RootObjectComparer());
foreach (var file in Directory.EnumerateFiles("C:\\Unfinished", "*.json"))
{
 unfinished.UnionWith(
 JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
}

Then you exclude the completed items with

unfinished.ExceptWith(completed);

Question 5

If you really are going to try and squeeze as much performance as you can out of this (while ignoring technical debt), why not recommend using for loops instead of foreach? Or even going as far as partially unwinding your loops to give out-of-order execution a leg up? If he is handling tens or hundreds of millions of objects there will be very significant performance gains.

Question 6

@DouglasGaskell in older .net versions this might have been an issue but if you look here the difference between them is not even a whole nanosecond for 100.000.000 iterations.

Question 7

@DouglasGaskell foreach in this case will be faster anyway becasue if you have a lot of files then with EnumerateFiles you're enumerating them only once. If you wanted to use a for loop you'd have to get all the files first either with GetFiles or ToList it and by doing so you'd enumerate them all twice. Msdn states that using EnumerateFiles with lots of files is faster.

Question 8

With your help I've reduced the processing time by 95%. Thank you!

Question 9

Don't have much to say about your code as it's rather short but I have a few notes:

I'd use verbatim strings for the directory:

"C:\\Completed" -> @"C:\Completed"

You can refactor your logic for obtaining the files in a separate method to shorten your foreach loops:

private static IEnumerable<List<RootObject>> GetFiles(string directory)
{
 return Directory.EnumerateFiles(directory, "*.json")
 .Select(file => JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
}

Usage:

var completed = new List<string>();
foreach (var file in GetFiles(@"C:\Completed"))
{
 completed.AddRange(file.Select(o => o.id));
}
Console.WriteLine($"completed.Count: {completed.Count}");
var unfinished = new List<RootObject>();
foreach (var file in GetFiles(@"C:\Unfinished"))
{
 unfinished.AddRange(file);
}

Question 10

GetFiles should be renamed to something else because this method does not return files but an IEnumerable <RootObject>

Question 11

I'm not sure what's the purpose of RootObject, I will leave that to the OP.

t3chb0t t3chb0t 44.7k9 gold badges84 silver badges190 bronze badges · Accepted Answer · 2017-01-19 19:04:19Z

LINQ (convenient)

You could rewrite the loops with LINQ so this

var completed = new List<string>();
foreach (var file in Directory.EnumerateFiles("C:\\Completed", "*.json"))
{
 completed.AddRange(
 JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)).Select(o => o.id));
}

could become this

var completd = 
 Directory.EnumerateFiles("C:\\Completed", "*.json")
 .Select(File.ReadAllText)
 .Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json).Select(x => x.id))
 .SelectMany(x => x)
 .ToList();

You can do the same with the other loop.

When using LINQ you might experience some performance hit but don't be discouraged from using it. Use LINQ first because it's easy to use and read. Only if you are not satisfied with its performance rewrite it with a normal loop.

In most cases you won't even notice that LINQ is in action.

Run a profiler if you're not sure. Don't blindly rewrite everything with loops because LINQ potentialiy could slow something down unless you really, really, really need to.

var processed =
 unfinished.Where(o => !completed.Contains(o.id)) // (#1) Ignore all completed RootObjects.
 .GroupBy(o => o.id).Select(objects => objects.First()) // (#2) Remove all duplicate RootObjects.
 .ToList();

I think this is the most inefficient part in your code because the completed.Contains method is a O(n) operation.

There are two alternatives for it: either use a HashSets or the Except extension that

[..] returns those elements in first that do not appear in second. It does not also return those elements in second that do not appear in first.

In order to make it work and make its usage easier you should get entire RootObjects from the first query instead of just the ids and you'll need a custom comparer for the RootObject that you can even reuse for the Distinct extension so you can replace the

.GroupBy(o => o.id).Select(objects => objects.First())

with

.Distinct(new RootObjectComparer())

The new query is now much easier to understand and does not requrie any comments. It also swaps the Distinct with the Except so it has less items to process.

var processed =
 unfinished
 .Distinct(new RootObjectComparer())
 .Except(completed, new RootObjectComparer())
 .ToList();

The custom comparer.

class RootObjectComparer : IEqualityComparer<RootObject>
{ 
 public bool Equals(RootObject left, RootObject right)
 {
 return left.id == right.id;
 }
 public int GetHashCode(RootObject obj)
 {
 return obj.id;
 }
}

LINQ free (fast)

You could try to mix LINQ and the HashSet into this where you not only provide the root-objects to the constructor but also the custom comparer.

A hash-set stores only unique objects so there's no need to filter them.

var unfinished = new HashSet<RootObject>(
 Directory
 .EnumerateFiles("C:\\Unfinished", "*.json")
 .Select(File.ReadAllText)
 .Select(json => JsonConvert.DeserializeObject<List<RootObject>>(json))
 .SelectMany(x => x),
 new RootObjectComparer()
);

The fastet version should be however this LINQ free one. Here you import the lists directly into the hash-set.

var unfinished = new HashSet<RootObject>(new RootObjectComparer());
foreach (var file in Directory.EnumerateFiles("C:\\Unfinished", "*.json"))
{
 unfinished.UnionWith(
 JsonConvert.DeserializeObject<List<RootObject>>(File.ReadAllText(file)));
}

Then you exclude the completed items with

unfinished.ExceptWith(completed);

If you really are going to try and squeeze as much performance as you can out of this (while ignoring technical debt), why not recommend using for loops instead of foreach? Or even going as far as partially unwinding your loops to give out-of-order execution a leg up? If he is handling tens or hundreds of millions of objects there will be very significant performance gains.
@DouglasGaskell in older .net versions this might have been an issue but if you look here the difference between them is not even a whole nanosecond for 100.000.000 iterations.
@DouglasGaskell foreach in this case will be faster anyway becasue if you have a lot of files then with EnumerateFiles you're enumerating them only once. If you wanted to use a for loop you'd have to get all the files first either with GetFiles or ToList it and by doing so you'd enumerate them all twice. Msdn states that using EnumerateFiles with lots of files is faster.
With your help I've reduced the processing time by 95%. Thank you!

Stack Exchange Network

Processing lots of JSON objects

2 Answers 2

LINQ (convenient)

LINQ free (fast)

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processing lots of JSON objects

2 Answers 2

LINQ (convenient)

LINQ free (fast)

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions