Skip to content
DEV Community

DEV Community

Collapse Expand
vinimabreu profile image
Vinicius Pereira
AI and data engineer. RAG, agents, automations, and scraping pipelines, all built with tests and CI. I write about the small reliable pieces I reuse.
  • Pronouns
    he/him
  • Work
    AI and data engineer, freelance
  • Joined

The line that'll stick with me is "a model cannot mourn the data it was never allowed to read." From the building side that's the whole problem in one sentence: confidence has almost no correlation with completeness. A RAG system hands you a clean, sure-sounding answer off three documents exactly the way it would off three hundred, and nothing in the output tells you which one you got.

What you name well is that this is sliding from a reasoning problem to an access problem. Two deployments of the same model can now disagree purely on what each was allowed to retrieve, which quietly breaks something builders lean on: a fixed question is supposed to have a stable answer. Once retrieval depends on licensing and robots.txt, "correct" becomes a function of what was reachable that day.

The only honest response I've found is to make the system show its evidence boundary instead of hiding it: cite what it used, and abstain or flag when the evidence is thin rather than smoothing over the gap. It's the museum instinct you describe, applied to a pipeline. Treat the missing horizon as a field in the output, not an absence the user never learns about.

Really good framing on the training-vs-retrieval split. I hadn't thought about how invisibly retrieval fragmentation would erode reproducibility.

Collapse Expand
kenwalger profile image
Ken W Alger
Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.
  • Location
    Oregon
  • Joined

I think the reproducibility point is the one that keeps growing in importance the more I think about it.

Historically, if two people queried the same search engine, database, or archive, we largely assumed they were operating against the same body of information. Differences in outcomes were usually attributed to interpretation, expertise, or reasoning.

Increasingly, that assumption may no longer hold.

As you point out, two deployments of the same model can now produce different answers simply because their evidence boundaries differ. Same question. Same model. Different retrieval horizon.

That's a subtle shift, but it has significant implications for trust, reproducibility, and even debugging. When two systems disagree, are we looking at a reasoning difference or an evidence difference? Without visibility into the boundary, it's difficult to tell.

I also like your framing of applying the museum instinct to the pipeline. Archivists, historians, and curators have long understood that provenance isn't just about what is present. It's also about documenting what is missing, unknown, restricted, or unavailable. The absence often tells part of the story.

The more I look at retrieval systems, the more I suspect that evidence boundaries may eventually become as important as citations themselves. Knowing what contributed to an answer is valuable. Knowing the shape of what could not contribute may be equally valuable.

Collapse Expand
vinimabreu profile image
Vinicius Pereira
AI and data engineer. RAG, agents, automations, and scraping pipelines, all built with tests and CI. I write about the small reliable pieces I reuse.
  • Pronouns
    he/him
  • Work
    AI and data engineer, freelance
  • Joined

Yeah, the reproducibility angle is the sleeper here, and I think it only gets teeth if you make the evidence boundary a real artifact instead of a description. Concretely, attach a retrieval manifest to every answer: not just the docs it cited, but the candidates it saw and excluded, each with a reason code (out of license, below the rank cutoff, stale, blocked by robots). The second you have that, your "reasoning difference or evidence difference?" question stops being a judgment call. When two deployments disagree you diff the manifests first, if the evidence sets differ it's an access problem, and if they're identical but the answers still diverge then it's genuinely reasoning or nondeterminism. Right now people debug that backwards, staring at the outputs, because the boundary was never captured in the first place.

And that's the same move that turns "the shape of what could not contribute" from a nice phrase into a field you can actually query. The exclusions with their reasons are the negative space, logged. Citations tell you the answer's support, the exclusion log tells you its blind spots, and imo you need both to really trust the thing. Enjoyed this exchange a lot.

Collapse Expand
kenielzep97 profile image
Self-Correcting Systems
I run a small multi-agent system and write about the part nobody tells you: memory that keeps your corrections, not your flattery. Truth over comfort.
  • Email
  • Education
    Self-taught. No degree. Learned by building, failing, and correcting.
  • Pronouns
    He
  • Joined

The line that stopped me was "a model cannot mourn the data it was never allowed to read." That’s the whole problem in one sentence. A partial-evidence answer and a full-evidence answer show up wearing the same confidence, and the user never sees the missing horizon you’re pointing at. I’ve been circling the same thing from the agent side absence as a first-class provenance category, not a footnote. A system that can’t state what it wasn’t allowed to see is grading its own paper. Most retrieval setups log what they found and stay silent on what they were blocked from, which is the moment "confident" and "complete" quietly stop meaning the same thing. The retrieval-now / training-later split is the sharp part. Watching that one.

Collapse Expand
kenwalger profile image
Ken W Alger
Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.
  • Location
    Oregon
  • Joined

I think you've put your finger on the part that keeps pulling at me as well.

Most systems are very good at explaining presence. Here are the sources I found. Here are the documents I cited. Here are the passages that influenced the answer.

They're far less capable of explaining absence.

Was a source unavailable because it didn't exist? Because it wasn't indexed? Because it was behind a paywall? Because a crawler was blocked? Because a licensing agreement excluded it? Those are very different conditions, yet they often collapse into the same user experience: silence.

I particularly like your observation that a system unable to articulate its own boundaries is effectively grading its own paper. Confidence becomes a much weaker signal when the evidence horizon is unknown.

The retrieval-now / training-later distinction is fascinating to me for the same reason. Training data limitations eventually become historical artifacts. Retrieval limitations are happening in real time. Information Borders can shift overnight because access policies, agreements, and restrictions change overnight.

In that sense, the answer isn't just a reflection of what the model knows. It's increasingly a reflection of what the model was permitted to know at the moment the question was asked.

That feels like a very different problem than the one we've spent the last few years discussing.

Collapse Expand
kenielzep97 profile image
Self-Correcting Systems
I run a small multi-agent system and write about the part nobody tells you: memory that keeps your corrections, not your flattery. Truth over comfort.
  • Email
  • Education
    Self-taught. No degree. Learned by building, failing, and correcting.
  • Pronouns
    He
  • Joined

"Good at explaining presence, bad at explaining absence" is cleaner than how I had it, and I’m stealing it. Mine was "data deficits hidden behind a curtain" yours actually names the four different conditions that get flattened into one silence: doesn’t exist, wasn’t indexed, paywalled, blocked. Those are different failures with different fixes, and right now they all just look like "I don’t know." The line that’s going to sit with me is "what the model was permitted to know at the moment the question was asked." That’s not a knowledge limitation, that’s a permission state and permission states change without anyone re-running the question. Same prompt, different hour, different border. That’s a harder problem than training cutoffs ever were, because training cutoffs at least hold still.

Thread Thread
kenwalger profile image
Ken W Alger
Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.
  • Location
    Oregon
  • Joined

I think you've captured the distinction perfectly.

A training cutoff is ultimately a historical boundary. It's frustrating, but at least it's stable. If we rerun the same question tomorrow, the model's training corpus hasn't changed.

Permission states are different. They're operational.

A publisher can change a robots.txt file. A licensing agreement can expire. An API can become unavailable. A paywall can appear. A retrieval index can be updated. The evidence boundary can shift without the underlying question changing at all.

That's what fascinates me about the idea of Information Borders. We tend to think of knowledge limitations as properties of the model. Increasingly, some of the most important limitations may be properties of the environment surrounding the model.

And I agree that those four conditions matter because they imply different remedies. "The information doesn't exist" is a very different situation from "the information exists, but the system wasn't permitted to access it." Today those often collapse into the same answer: silence.

The more I think about it, the more it feels like we're moving from a world where knowledge was primarily constrained by memory to one where knowledge is increasingly constrained by access. That's a subtle shift, but a consequential one.

View full discussion (11 comments)

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink.

For further actions, you may consider blocking this person and/or reporting abuse

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

AltStyle によって変換されたページ (->オリジナル) /