7

Preface

We have been stuck for years using a monolithic program with terrible and inefficient database structure. Now we are finally moving away from it with our own program.

The only area we are having difficulties with is the database. Our data is financial and our existing program causes loss of data mostly through overwrites (UPDATE, DELETE). We have decided to move toward an append-only structure to keep all of the data we accumulate. We may look into using a warehouse if it grows past our initial estimates, but in the meantime, we expect the database to grow to a few millions rows over a few years for some of the larger tables; not massive in scale at this point.

We are using PostgreSQL and will be denying modification privileges such as UPDATE and DELETE for users to ensure historical consistency of data.


Our Ideas

We thought we might be able to contain all current and historic data into one table such that we query the last record (or newest timestamp) for each id as the following table suggests.

To clarify, we have dozens of tables; Below represents a simple version of one such tables.

╔═════╦════╦════════╦════════╦═════════════════════╗
║ key ║ id ║ field1 ║ field2 ║ timestamp ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 0 ║ 0 ║ a ║ b ║ 1970年01月01日 00:00:01 ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 1 ║ 1 ║ x ║ y ║ 1970年01月01日 00:00:02 ║
╠═════╬════╬════════╬════════╬═════════════════════╣
║ 2 ║ 0 ║ a ║ c ║ 1970年01月01日 00:05:01 ║ # new version of id=0 row
╚═════╩════╩════════╩════════╩═════════════════════╝

We have also considered NoSQL solutions, except we prefer to remain ACID compliant (and most NoSQL solutions cannot meet that expectation).

We have also looked into other products such as AWS QLDB, but we're limited by budget, so we've decided to go with PostgreSQL for the time being and will re-evaluate when necessary.


Questions

  1. How can we best structure the database to ensure optimal performance, but also keep in mind developer usage and ease of use?
  2. Does our existing idea for structuring the database allow for optimal performance and is it coherent?
  3. What limitations and issues (apart from disk space) might we be creating for ourselves with this plan?
asked Jun 11, 2019 at 20:41
1
  • You might find of interest this Q & A. Commented Jun 13, 2019 at 18:19

2 Answers 2

7

This sounds like a use-case for temporal tables. AFAIK PostgreSQL does not implement this natively. There are extensions for it. It can be written in user-code, usually with triggers.

The basic idea is that the data table has two date/time columns. These define the interval during which the value in a row is applicable - it is "the" value for that key during that interval. By key I mean natural key, not a surrogate row-labeling key i.e. the id column, not the key column in your example. It is best to make this a "closed-open" interval so the start time is included but the end time excluded. The resolution of the interval start and end columns has to be appropriate for the update frequency.

When a new value for that key arrives, the existing row is updated to set its end and a new row is written. The end date/time for the old row and the start date/time for the new row are equal, hence the closed-open formulation.

To use your example: the first row arrives

║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970年01月01日 00:00:01 ║ 9999年12月31日 23:59:59 ║

Some time later the second row for this key arrives. The first row is updated, the second row is written

║ key ║ id ║ field1 ║ field2 ║ from ║ to
║ 0 ║ 0 ║ a ║ b ║ 1970年01月01日 00:00:01 ║ 1970年01月01日 00:05:01 ║
║ 2 ║ 0 ║ a ║ c ║ 1970年01月01日 00:05:01 ║ 9999年12月31日 23:59:59 ║

To read a value

select <columns>
from <table>
where id = 0
and from <= <right now>
and to > <right now>

The place holder <right now> represents the current instant on the server. It can be local, UTC or whatever. It just has to be consistent with what was used when the row was written. Using UTC uniformly makes this easier.

A consistent set of values can be read from the database from any point in history by using that historical time in the predicate instead of <right now>.

To delete a value set to to the current time.

There are, of course, downsides. Each table must have extra columns. These must be indexed. Ensuring intervals for each key do not overlap using only constraints is non-trivial. Every query must have temporal predicates and they must all use exactly the same time value.

These complexities can be mitigated by adding an archive table. When an existing row's to values is set, that row is also deleted from the data table inserted into the corresponding archive table. So each id value will only have one row in the data table, and many retired rows in the archive table. Queries no longer need the additional temporal predicates. Indexes (on the data table) need not include the interval columns. Historical queries become more complex, and every new value stored will incur the cost of an additional delete and insert.

answered Jun 13, 2019 at 1:22
1
  • This is not append only pattern you call update. You can do it just with new timestamp, distinct on the key with max timestamp will give you latest record. Finally you need to group by to prevent duplication, eg case where same timestamp appeared for same record. Commented Nov 6, 2021 at 5:39
5

I would add a boolean column named active to the table the is TRUE initially and gets changed to FALSE when a row version is deleted or updated.

This will make your queries simpler, and you can use partial indexes to efficiently search only the active rows.

You could use partitioning to keep the active rows in a different partition than the old ones. That will make deactivating rows more expensive, but you can use the active partition directly in your SQL statements. This will also simplify moving the old data elsewhere (foreign tables can be partitions).

answered Jun 12, 2019 at 6:22
1
  • if you have a single datetime created_at field you could also work out the timeline with this system, to know which version was active on whichever past date. Commented Feb 25, 2021 at 1:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.