I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a network share.
Each run of the schedule that runs every minute generates a JSON with 20k odd entities like these having tens of attributes.
[
{
"entityid": 12345,
"attribute1": "queued",
"attribute2": "pending"
},
{
"entityid": 34563,
"attribute1": "running",
"attribute2": "successful"
}
]
I need to be able to track the change of attribute status of the entities over time, for instance, answer questions like when did the status of entity x
become "pending". What is the best way to store this data and generate the stats?
3 Answers 3
Overview
I think you can solve your problem in a relatively simple 3-step process:
Given two (consecutive) snapshots of the state of your entities, determine the changes between them.
Repeat this step until all (available) snapshots are processed and store the changes somewhere.
Query the stored changes for something that is of interest to you.
Finding the Changes
I would most likely create a hash table for the first snapshot. There are a few options here:
Use the
Id
to map to a data structure that holds your entity's valuesUse an
(Id, AttributeName)
tuple as the key to map to the values directly. Depending on your language, this might only make sense if all values are of the same type.Do the same as above, but use one hash table for each type of attribute.
Now you turn the second snapshot into a hash table and compare it to the first. When you've found and stored all changes, you discard the first hash table (but keep the second one) and repeat this procedure with snapshots two and three - and so on...
Storing the Changes
Each change you find can be represented by a tuple such as (Time, EntityId, AttributeName, OldValue, NewValue)
. Depending on what you'd like to query, you may not need all of these fields.
Once you've found a change, the question becomes where to store them. A database seems like the ideal solution. If you have enough memory and don't want to persist the changes, you can use an in-memory DB.
The database will provide all the features to make querying easy and efficient. In particular, you'll have an established query-language and can create the relevant indices.
Added and Removed Entities
If the set of monitored entities remains constant, you can find all differences by simply iterating over one hash table's keys and comparing the key's values in both tables.
However, when entities may be added and removed, it may be helpful to deal with each case (added, changed, removed) separately.
Added entities can be easily found while building the new hash table. Simply check whether the entity already existed in the old one.
Removed entities can be found together with changed entities while iterating over the entities in the old table.
Alternatively, you can of course use the intersect/complement operations on your key-sets.
You could work with versioning and immutability, this way you basically make a new entity when you change it, your entity is the entityid in combination with highest version. Clean up when entities are fully out of scope.:
[
{
"entityid": 12345,
"attribute1": "queued",
"attribute2": "pending",
"version": "1",
"created": "17:25"
},
{
"entityid": 12345,
"attribute1": "running",
"attribute2": "successful",
"version": "2",
"created": "17:48"
},
{
"entityid": 34563,
"attribute1": "running",
"attribute2": "successful",
"version": "1",
"created": "17:20"
}
{
"entityid": 34563,
"attribute1": "finished",
"attribute2": "successful",
"version": "2",
"created": "17:47"
}
]
-
4For completeness' sake, this answer is essentially using an event sourcing approach. There's a lot of online resources on this topic if you look for it.Flater– Flater2021年05月20日 07:47:22 +00:00Commented May 20, 2021 at 7:47
Transforming my comment as an answer, as I think it's a valid suggestion.
Quick warning: I think this is a very controversial answer, at it comes with a lot of caveats. Make sure to read them.
Go Low-Tech
Rather than using a full-on programmatic approach, you might get away with a "low-tech" implementation using diff
, rsync
, or a git
repository as storage, as you mention dumping JSON files.
Caveats
BUT as mentioned, that's really controversial, and that's HIGHLY dependent on quite a lot of things:
- your use case,
- your context,
- how often your data changes,
- how much data you store (how many entities and their size),
- if the order of the entities changes (not just their content),
- and what you want to do with the results.
Also, this only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do, at least for a PoC.
It's quite likely as well that there already exists some libraries for that purpose and to achieve similar without implementing the bridge interface for you (e.g. https://www.npmjs.com/package/json-diff).
As mentioned, it's a bit a very controversial approach :)
It won't fly either if you use a "bad" filesystem that doesn't deal well with large folders, or is very slow at processing a lot of small files.
Run-Through
Without providing a complete implementation, my quick and dirty way would be to:
- dump your entities with one entity per file, with entity ID as filename and commit changes,
- diff your file on each change,
- on each need to check history, check for files with diff changes, and retrieve what you need from history (timeline, changes, author, etc...)
diff
,rsync
, or agit
repository as storage. BUT that's HIGHLY dependent on: your use case, your context, how often your data would change, how much data you'd store, if the order of the entities might change as well as the content of each entity, and what you want to do with the results. Also, only interesting if you're OK doing that check at the filesystem level. Otherwise, going with a DB or in-memory solution would be best. Again, depends on your scenario, but I my gut feeling was a diff would do.