How to track change of JSON data over time for large number of entities?

Question 1

I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a network share.

Each run of the schedule that runs every minute generates a JSON with 20k odd entities like these having tens of attributes.

[
 {
 "entityid": 12345,
 "attribute1": "queued",
 "attribute2": "pending"
 },
 {
 "entityid": 34563,
 "attribute1": "running",
 "attribute2": "successful"
 }
]

I need to be able to track the change of attribute status of the entities over time, for instance, answer questions like when did the status of entity x become "pending". What is the best way to store this data and generate the stats?

Question 2

You need more than just a "these entities just changed their states in this way" notification, right? If so, how much history do you need to retain?

Question 3

How long does this stuff have to live (especially subsequent JSON file updates)? To me, this shouts "Put it in a database"

Question 4

Is this something that deep object diffing, like deep-diff (JavaScript) or deepdiff (Python), could help with?

Question 5

Just an idea: you might get away with a "low-tech" implementation using diff, rsync, or a git repository as storage. BUT that's HIGHLY dependent on: your use case, your context, how often your data would change, how much data you'd store, if the order of the entities might change as well as the content of each entity, and what you want to do with the results. Also, only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do.

Question 6

the best way it to capture the event that makes the change, rather than check for a change after the fact

Question 7

Overview

I think you can solve your problem in a relatively simple 3-step process:

Given two (consecutive) snapshots of the state of your entities, determine the changes between them.
Repeat this step until all (available) snapshots are processed and store the changes somewhere.
Query the stored changes for something that is of interest to you.

Finding the Changes

I would most likely create a hash table for the first snapshot. There are a few options here:

Use the Id to map to a data structure that holds your entity's values
Use an (Id, AttributeName) tuple as the key to map to the values directly. Depending on your language, this might only make sense if all values are of the same type.
Do the same as above, but use one hash table for each type of attribute.

Now you turn the second snapshot into a hash table and compare it to the first. When you've found and stored all changes, you discard the first hash table (but keep the second one) and repeat this procedure with snapshots two and three - and so on...

Storing the Changes

Each change you find can be represented by a tuple such as (Time, EntityId, AttributeName, OldValue, NewValue). Depending on what you'd like to query, you may not need all of these fields.

Once you've found a change, the question becomes where to store them. A database seems like the ideal solution. If you have enough memory and don't want to persist the changes, you can use an in-memory DB.

The database will provide all the features to make querying easy and efficient. In particular, you'll have an established query-language and can create the relevant indices.

Added and Removed Entities

If the set of monitored entities remains constant, you can find all differences by simply iterating over one hash table's keys and comparing the key's values in both tables.

However, when entities may be added and removed, it may be helpful to deal with each case (added, changed, removed) separately.

Added entities can be easily found while building the new hash table. Simply check whether the entity already existed in the old one.
Removed entities can be found together with changed entities while iterating over the entities in the old table.

Alternatively, you can of course use the intersect/complement operations on your key-sets.

Question 8

You could work with versioning and immutability, this way you basically make a new entity when you change it, your entity is the entityid in combination with highest version. Clean up when entities are fully out of scope.:

[
 {
 "entityid": 12345,
 "attribute1": "queued",
 "attribute2": "pending",
 "version": "1",
 "created": "17:25"
 },
 {
 "entityid": 12345,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "2",
 "created": "17:48"
 },
 {
 "entityid": 34563,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "1",
 "created": "17:20"
 }
 {
 "entityid": 34563,
 "attribute1": "finished",
 "attribute2": "successful",
 "version": "2",
 "created": "17:47"
 }
]

Question 9

For completeness' sake, this answer is essentially using an event sourcing approach. There's a lot of online resources on this topic if you look for it.

Question 10

Transforming my comment as an answer, as I think it's a valid suggestion.

Quick warning: I think this is a very controversial answer, at it comes with a lot of caveats. Make sure to read them.

Go Low-Tech

Rather than using a full-on programmatic approach, you might get away with a "low-tech" implementation using diff, rsync, or a git repository as storage, as you mention dumping JSON files.

Caveats

BUT as mentioned, that's really controversial, and that's HIGHLY dependent on quite a lot of things:

your use case,
your context,
how often your data changes,
how much data you store (how many entities and their size),
if the order of the entities changes (not just their content),
and what you want to do with the results.

Also, this only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do, at least for a PoC.

It's quite likely as well that there already exists some libraries for that purpose and to achieve similar without implementing the bridge interface for you (e.g. https://www.npmjs.com/package/json-diff).

As mentioned, it's a bit a very controversial approach :)

It won't fly either if you use a "bad" filesystem that doesn't deal well with large folders, or is very slow at processing a lot of small files.

Run-Through

Without providing a complete implementation, my quick and dirty way would be to:

dump your entities with one entity per file, with entity ID as filename and commit changes,
diff your file on each change,
on each need to check history, check for files with diff changes, and retrieve what you need from history (timeline, changes, author, etc...)

doubleYou doubleYou 2,8671 gold badge14 silver badges26 bronze badges · Answer 1 · 2018-12-02 18:58:25Z

Overview

I think you can solve your problem in a relatively simple 3-step process:

Given two (consecutive) snapshots of the state of your entities, determine the changes between them.
Repeat this step until all (available) snapshots are processed and store the changes somewhere.
Query the stored changes for something that is of interest to you.

Finding the Changes

I would most likely create a hash table for the first snapshot. There are a few options here:

Use the Id to map to a data structure that holds your entity's values
Use an (Id, AttributeName) tuple as the key to map to the values directly. Depending on your language, this might only make sense if all values are of the same type.
Do the same as above, but use one hash table for each type of attribute.

Now you turn the second snapshot into a hash table and compare it to the first. When you've found and stored all changes, you discard the first hash table (but keep the second one) and repeat this procedure with snapshots two and three - and so on...

Storing the Changes

Each change you find can be represented by a tuple such as (Time, EntityId, AttributeName, OldValue, NewValue). Depending on what you'd like to query, you may not need all of these fields.

Once you've found a change, the question becomes where to store them. A database seems like the ideal solution. If you have enough memory and don't want to persist the changes, you can use an in-memory DB.

The database will provide all the features to make querying easy and efficient. In particular, you'll have an established query-language and can create the relevant indices.

Added and Removed Entities

If the set of monitored entities remains constant, you can find all differences by simply iterating over one hash table's keys and comparing the key's values in both tables.

However, when entities may be added and removed, it may be helpful to deal with each case (added, changed, removed) separately.

Added entities can be easily found while building the new hash table. Simply check whether the entity already existed in the old one.
Removed entities can be found together with changed entities while iterating over the entities in the old table.

Alternatively, you can of course use the intersect/complement operations on your key-sets.

Pieter B Pieter B 13.3k1 gold badge44 silver badges69 bronze badges · Answer 2 · 2021-05-20 07:16:44Z

You could work with versioning and immutability, this way you basically make a new entity when you change it, your entity is the entityid in combination with highest version. Clean up when entities are fully out of scope.:

[
 {
 "entityid": 12345,
 "attribute1": "queued",
 "attribute2": "pending",
 "version": "1",
 "created": "17:25"
 },
 {
 "entityid": 12345,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "2",
 "created": "17:48"
 },
 {
 "entityid": 34563,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "1",
 "created": "17:20"
 }
 {
 "entityid": 34563,
 "attribute1": "finished",
 "attribute2": "successful",
 "version": "2",
 "created": "17:47"
 }
]

For completeness' sake, this answer is essentially using an event sourcing approach. There's a lot of online resources on this topic if you look for it.

haylem haylem 29k11 gold badges106 silver badges119 bronze badges · Answer 3 · 2023-10-09 15:33:51Z

Transforming my comment as an answer, as I think it's a valid suggestion.

Quick warning: I think this is a very controversial answer, at it comes with a lot of caveats. Make sure to read them.

Go Low-Tech

Rather than using a full-on programmatic approach, you might get away with a "low-tech" implementation using diff, rsync, or a git repository as storage, as you mention dumping JSON files.

Caveats

BUT as mentioned, that's really controversial, and that's HIGHLY dependent on quite a lot of things:

your use case,
your context,
how often your data changes,
how much data you store (how many entities and their size),
if the order of the entities changes (not just their content),
and what you want to do with the results.

Also, this only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do, at least for a PoC.

It's quite likely as well that there already exists some libraries for that purpose and to achieve similar without implementing the bridge interface for you (e.g. https://www.npmjs.com/package/json-diff).

As mentioned, it's a bit a very controversial approach :)

It won't fly either if you use a "bad" filesystem that doesn't deal well with large folders, or is very slow at processing a lot of small files.

Run-Through

Without providing a complete implementation, my quick and dirty way would be to:

dump your entities with one entity per file, with entity ID as filename and commit changes,
diff your file on each change,
on each need to check history, check for files with diff changes, and retrieve what you need from history (timeline, changes, author, etc...)

Stack Exchange Network

How to track change of JSON data over time for large number of entities?

3 Answers 3

Go Low-Tech

Caveats

Run-Through

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to track change of JSON data over time for large number of entities?

3 Answers 3

Go Low-Tech

Caveats

Run-Through

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions