keep your snapshots simple

DEV Community

if any of that feels familiar from data vault, that is because effective satellites are essentially the same idea expressed in vault vocabulary. i wrote about a related grain trap in left join an effective satellite without duplicating rows, and the same care applies here. the moment you have validity intervals, every join and every filter has to respect them.

what a dbt snapshot is

a dbt snapshot is dbt's built-in implementation of type 2 history capture. you write a query that returns the current state of a source, and dbt takes responsibility for comparing that current state against the snapshot table on every run, closing rows that changed and inserting new versions. the columns it adds are:

dbt_valid_from: when this version became current
dbt_valid_to: when this version stopped being current (or the sentinel for current rows)
dbt_scd_id: a hash of the unique key plus dbt_valid_from for stable surrogate identity

i wrote a longer companion piece on the practical migration story in dbt snapshots, moving from merges to native history. that post is the "how to do it well" view. this post is the "how to do less of it" view.

when a snapshot is the right call

reach for a snapshot when all of the following are true:

the source system overwrites the row in place and does not retain history of its own
you genuinely need point-in-time answers, not just "the current value"
you cannot reconstruct the historical state from a deterministic formula or from another system that already keeps history
the source has a stable grain and a reliable unique key
you can guarantee the snapshot will run on a cadence that catches every change you care about

if any one of those is false, a snapshot is probably the wrong tool. for example, if the source already publishes change events to a message broker, capture the events into a regular append-only table and model on top of that. if the value is a deterministic function of other inputs (for example a derived score from frozen reference data), recompute it. if the source has a cycle_id or some other natural temporal key, join on that key directly instead of leaning on validity intervals.

the cost: complexity, brittleness, maintenance, backups

this is the part most adoption guides skip over. snapshots look free in the demo and feel free for the first month. then the bills start to come in.

complexity in the dag

a snapshot is a node in your dag, but it does not behave like other nodes. it is the only model type that has hidden state from prior runs, and it requires its own command (dbt snapshot) on its own schedule. a dbt project that contains snapshots has, in practice, two pipelines that have to stay in lockstep, namely the regular dbt build and the snapshot pipeline. when one of them lags or fails, downstream consumers see stale or partial history and the symptoms can be subtle.

every snapshot also forces every downstream model that reads it to think about validity intervals. queries that used to be a simple select now need a current-row filter or a point-in-time predicate, and reviewers have to verify that filter on every change.

brittleness under change

snapshots are unusually sensitive to upstream changes. a few examples i have seen close up:

a source query is widened to include a new column, and the check strategy now flags every row as changed on the next run, doubling the table overnight
a source briefly drops keys (because of a partial backfill or a bad upstream join), and a hard_deletes = invalidate snapshot closes thousands of rows that are still valid
duplicate keys appear in the source for a single run, and the snapshot either fails or quietly bloats with overlapping intervals
the source schema changes type on a column, and the snapshot now refuses to merge because the staged data and the historical data disagree

each of these takes a careful, surgical fix. you cannot just rerun the snapshot from scratch, because the original sequence of source values is gone.

maintenance and backups

because snapshot output is not reproducible, you have to treat the snapshot table itself as production data, not a derived artifact. that means:

regular backups of the snapshot tables to a separate schema or storage location, with a retention policy you actually enforce
a documented recovery runbook for partial corruption (for example, restore from backup, replay only specific keys, validate intervals)
alerting on row-count deltas, row-count ratios per run, and anomalies in dbt_valid_to distributions, so a misconfigured run does not run for a week before anyone notices
change review for any edit to the snapshot definition, because changing check_cols, unique_key, or the source query can rewrite history in non-obvious ways

a regular dbt model needs none of that. a snapshot needs all of it, and the cost scales with the number of snapshots in your project.

the worst pattern: snapshots on top of snapshots

if there is one rule i would carve into the wall, never build a snapshot on top of another snapshot. mixing two type 2 tables produces validity intervals on top of validity intervals, and the result is almost never what anyone wants.

why this is so dangerous

a single type 2 table is already a careful object. its grain is the business key plus the validity interval, and every consumer has to filter to a single moment in time before doing anything else. when you stack a second snapshot on top of it, you are now tracking history of a thing that was already historical, and the questions you can sensibly ask multiply in ugly ways.

think about what the row count of snapshot_b becomes when its source query reads from snapshot_a without point-in-time filtering. for any business key, you get the cartesian product of versions, which means changes in snapshot_b get attributed to the wrong intervals of snapshot_a. even if you do filter for current rows, the second snapshot will react to every change in the first, including changes that have nothing to do with the attributes you care about, so you end up with a much noisier history than you wanted.

even if you carefully filter the inner snapshot to its current row, you have lost something important. the outer snapshot now records history of a moving target. when you reread the outer snapshot at a past timestamp, the row you get back was generated against the inner snapshot's current state at the time the outer run executed, not against the inner snapshot's state at the same past timestamp. this is the validity-on-validity trap, and it is almost impossible to reason about by inspection.

what to do instead

if you find yourself wanting to build a second snapshot on top of a first, treat that as a signal that the design is wrong, not as a problem to solve in sql. a few healthier alternatives:

have one snapshot per source object that genuinely needs history, and read it directly in your information delivery layer
if you need a derived attribute that depends on a snapshot, compute that attribute in a regular view that filters the snapshot to a single point in time and is itself recomputable
if the derived attribute genuinely needs its own history, snapshot the source inputs independently and join them by point-in-time logic in a downstream view, instead of stacking the snapshots themselves
if your business has a natural temporal key (cycle, period, year), prefer joining by that key over inferring history from validity intervals

the goal is to push as much of the temporal logic as you can into deterministic transformations, and keep the snapshots themselves at the edges of the dag.

less is more, simple is better

most of the temporal questions you think need a snapshot do not. before adding one, run through this short checklist:

is the value already historical somewhere upstream, in events, in a cycle table, or in another system, where i can read it without snapshotting myself?
can i compute the value deterministically from current inputs, so any past answer is just a recomputation against frozen reference data?
is the source overwriting history in place with no other record of the prior value?
if i never built this snapshot, would consumers really lose information they care about, or just convenience?

if the answer to either of the first two is yes, do not add a snapshot. if the answer to the third is no, do not add a snapshot. if the answer to the fourth is "just convenience", do not add a snapshot.

what you are aiming for is a warehouse that is mostly deterministic, with a small ring of carefully managed snapshots at the edges. the deterministic core is cheap to rebuild, easy to test, and forgiving to refactor. the snapshot ring is where the real cost lives, so you want it to be small enough that you can afford to back it up, monitor it, and recover it when something goes wrong.

simple beats clever here. one well-run snapshot you understand is worth ten clever snapshots that nobody can rebuild.

faq

when is a snapshot definitely worth it?

when the source overwrites in place, you have a real business need for point-in-time answers, and there is no upstream event log or cycle key to lean on instead. operational systems that mutate rows without retaining history are the canonical case.

what is the single biggest mistake people make with snapshots?

reading from a snapshot in another snapshot's source query. the validity-on-validity trap is the worst class of bug to debug, because the symptom shows up far away from the cause and the table looks plausible at a glance.

how do i reduce the number of snapshots in an existing project?

start with the snapshots that are read by the smallest number of downstream models, and ask whether the consumers really need history or just the current row. if they only need current, replace the snapshot with a regular view. for the snapshots that genuinely need history, make sure each one is independent and that nothing else in the project reads a snapshot to feed another snapshot.

should i ever full-refresh a snapshot?

almost never in production. a full refresh wipes the historical rows that no longer match the current source, which is the entire reason you built the snapshot in the first place. treat the snapshot table like operational data, not a derived artifact.