4

I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a network share.

Each run of the schedule that runs every minute generates a JSON with 20k odd entities like these having tens of attributes.

[
 {
 "entityid": 12345,
 "attribute1": "queued",
 "attribute2": "pending"
 },
 {
 "entityid": 34563,
 "attribute1": "running",
 "attribute2": "successful"
 }
]

I need to be able to track the change of attribute status of the entities over time, for instance, answer questions like when did the status of entity x become "pending". What is the best way to store this data and generate the stats?

asked Nov 30, 2018 at 20:47
5
  • You need more than just a "these entities just changed their states in this way" notification, right? If so, how much history do you need to retain? Commented Dec 1, 2018 at 14:14
  • 1
    How long does this stuff have to live (especially subsequent JSON file updates)? To me, this shouts "Put it in a database" Commented Aug 24, 2020 at 8:17
  • Is this something that deep object diffing, like deep-diff (JavaScript) or deepdiff (Python), could help with? Commented May 20, 2021 at 4:45
  • Just an idea: you might get away with a "low-tech" implementation using diff, rsync, or a git repository as storage. BUT that's HIGHLY dependent on: your use case, your context, how often your data would change, how much data you'd store, if the order of the entities might change as well as the content of each entity, and what you want to do with the results. Also, only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do. Commented Oct 9, 2023 at 15:19
  • the best way it to capture the event that makes the change, rather than check for a change after the fact Commented Nov 8, 2023 at 21:15

3 Answers 3

1

Overview

I think you can solve your problem in a relatively simple 3-step process:

  1. Given two (consecutive) snapshots of the state of your entities, determine the changes between them.

  2. Repeat this step until all (available) snapshots are processed and store the changes somewhere.

  3. Query the stored changes for something that is of interest to you.

Finding the Changes

I would most likely create a hash table for the first snapshot. There are a few options here:

  • Use the Id to map to a data structure that holds your entity's values

  • Use an (Id, AttributeName) tuple as the key to map to the values directly. Depending on your language, this might only make sense if all values are of the same type.

  • Do the same as above, but use one hash table for each type of attribute.

Now you turn the second snapshot into a hash table and compare it to the first. When you've found and stored all changes, you discard the first hash table (but keep the second one) and repeat this procedure with snapshots two and three - and so on...

Storing the Changes

Each change you find can be represented by a tuple such as (Time, EntityId, AttributeName, OldValue, NewValue). Depending on what you'd like to query, you may not need all of these fields.

Once you've found a change, the question becomes where to store them. A database seems like the ideal solution. If you have enough memory and don't want to persist the changes, you can use an in-memory DB.

The database will provide all the features to make querying easy and efficient. In particular, you'll have an established query-language and can create the relevant indices.

Added and Removed Entities

If the set of monitored entities remains constant, you can find all differences by simply iterating over one hash table's keys and comparing the key's values in both tables.

However, when entities may be added and removed, it may be helpful to deal with each case (added, changed, removed) separately.

  • Added entities can be easily found while building the new hash table. Simply check whether the entity already existed in the old one.

  • Removed entities can be found together with changed entities while iterating over the entities in the old table.

Alternatively, you can of course use the intersect/complement operations on your key-sets.

answered Dec 2, 2018 at 18:58
0

You could work with versioning and immutability, this way you basically make a new entity when you change it, your entity is the entityid in combination with highest version. Clean up when entities are fully out of scope.:

[
 {
 "entityid": 12345,
 "attribute1": "queued",
 "attribute2": "pending",
 "version": "1",
 "created": "17:25"
 },
 {
 "entityid": 12345,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "2",
 "created": "17:48"
 },
 {
 "entityid": 34563,
 "attribute1": "running",
 "attribute2": "successful",
 "version": "1",
 "created": "17:20"
 }
 {
 "entityid": 34563,
 "attribute1": "finished",
 "attribute2": "successful",
 "version": "2",
 "created": "17:47"
 }
]
answered May 20, 2021 at 7:16
1
  • 4
    For completeness' sake, this answer is essentially using an event sourcing approach. There's a lot of online resources on this topic if you look for it. Commented May 20, 2021 at 7:47
0

Transforming my comment as an answer, as I think it's a valid suggestion.

Quick warning: I think this is a very controversial answer, at it comes with a lot of caveats. Make sure to read them.

Go Low-Tech

Rather than using a full-on programmatic approach, you might get away with a "low-tech" implementation using diff, rsync, or a git repository as storage, as you mention dumping JSON files.

Caveats

BUT as mentioned, that's really controversial, and that's HIGHLY dependent on quite a lot of things:

  • your use case,
  • your context,
  • how often your data changes,
  • how much data you store (how many entities and their size),
  • if the order of the entities changes (not just their content),
  • and what you want to do with the results.

Also, this only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do, at least for a PoC.

It's quite likely as well that there already exists some libraries for that purpose and to achieve similar without implementing the bridge interface for you (e.g. https://www.npmjs.com/package/json-diff).

As mentioned, it's a bit a very controversial approach :)

It won't fly either if you use a "bad" filesystem that doesn't deal well with large folders, or is very slow at processing a lot of small files.

Run-Through

Without providing a complete implementation, my quick and dirty way would be to:

  • dump your entities with one entity per file, with entity ID as filename and commit changes,
  • diff your file on each change,
  • on each need to check history, check for files with diff changes, and retrieve what you need from history (timeline, changes, author, etc...)
answered Oct 9, 2023 at 15:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.