The New Information Borders

DEV Community

Software architect focused on practical systems: code that has to survive beyond the demo. If I mention nxusKit or nxus.SYSTEMS in an answer, I disclose the affiliation (Principal Architect there).

Joined

Jun 11, 2026

• Jun 30

I think "known blind spots" nails it. Then the answer can say "here is what I found" while the evidence packet says "here is the shape of what the system could not see." That would make cross-system comparisons much more honest.

kenwalger profile image

Ken W Alger

Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.

Location

Oregon
Joined

Apr 7, 2017

• Jun 30

Exactly. Today most answer systems treat citations as a record of inclusion: here are the sources that contributed to this answer.

What's missing is the complementary record of exclusion.

Not necessarily an exhaustive inventory of everything the system didn't see, but at least some indication of the boundaries it was operating within. Was the corpus limited? Were certain publishers unavailable? Were there licensing restrictions? Were portions of the web excluded from retrieval?

The phrase "known blind spots" resonates with me because it shifts the conversation from certainty to context. An answer doesn't become less useful because it acknowledges its boundaries. In many cases, it becomes more trustworthy.

And I agree that it would make cross-system comparisons much more meaningful. Instead of asking only "Which answer is better?" we could also ask "Which evidence horizon was each system operating within?" Those are related questions, but they're not the same question.

vinimabreu profile image

Vinicius Pereira

AI and data engineer. RAG, agents, automations, and scraping pipelines, all built with tests and CI. I write about the small reliable pieces I reuse.

Pronouns

he/him
Work

AI and data engineer, freelance
Joined

Jun 30, 2026

• Jun 30

The line that'll stick with me is "a model cannot mourn the data it was never allowed to read." From the building side that's the whole problem in one sentence: confidence has almost no correlation with completeness. A RAG system hands you a clean, sure-sounding answer off three documents exactly the way it would off three hundred, and nothing in the output tells you which one you got.

What you name well is that this is sliding from a reasoning problem to an access problem. Two deployments of the same model can now disagree purely on what each was allowed to retrieve, which quietly breaks something builders lean on: a fixed question is supposed to have a stable answer. Once retrieval depends on licensing and robots.txt, "correct" becomes a function of what was reachable that day.

The only honest response I've found is to make the system show its evidence boundary instead of hiding it: cite what it used, and abstain or flag when the evidence is thin rather than smoothing over the gap. It's the museum instinct you describe, applied to a pipeline. Treat the missing horizon as a field in the output, not an absence the user never learns about.

Really good framing on the training-vs-retrieval split. I hadn't thought about how invisibly retrieval fragmentation would erode reproducibility.

kenwalger profile image

Ken W Alger

Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.

Location

Oregon
Joined

Apr 7, 2017

• Jun 30

I think the reproducibility point is the one that keeps growing in importance the more I think about it.

Historically, if two people queried the same search engine, database, or archive, we largely assumed they were operating against the same body of information. Differences in outcomes were usually attributed to interpretation, expertise, or reasoning.

Increasingly, that assumption may no longer hold.

As you point out, two deployments of the same model can now produce different answers simply because their evidence boundaries differ. Same question. Same model. Different retrieval horizon.

That's a subtle shift, but it has significant implications for trust, reproducibility, and even debugging. When two systems disagree, are we looking at a reasoning difference or an evidence difference? Without visibility into the boundary, it's difficult to tell.

I also like your framing of applying the museum instinct to the pipeline. Archivists, historians, and curators have long understood that provenance isn't just about what is present. It's also about documenting what is missing, unknown, restricted, or unavailable. The absence often tells part of the story.

The more I look at retrieval systems, the more I suspect that evidence boundaries may eventually become as important as citations themselves. Knowing what contributed to an answer is valuable. Knowing the shape of what could not contribute may be equally valuable.

vinimabreu profile image

Vinicius Pereira

AI and data engineer. RAG, agents, automations, and scraping pipelines, all built with tests and CI. I write about the small reliable pieces I reuse.

Pronouns

he/him
Work

AI and data engineer, freelance
Joined

Jun 30, 2026

• Jul 1

Yeah, the reproducibility angle is the sleeper here, and I think it only gets teeth if you make the evidence boundary a real artifact instead of a description. Concretely, attach a retrieval manifest to every answer: not just the docs it cited, but the candidates it saw and excluded, each with a reason code (out of license, below the rank cutoff, stale, blocked by robots). The second you have that, your "reasoning difference or evidence difference?" question stops being a judgment call. When two deployments disagree you diff the manifests first, if the evidence sets differ it's an access problem, and if they're identical but the answers still diverge then it's genuinely reasoning or nondeterminism. Right now people debug that backwards, staring at the outputs, because the boundary was never captured in the first place.

And that's the same move that turns "the shape of what could not contribute" from a nice phrase into a field you can actually query. The exclusions with their reasons are the negative space, logged. Citations tell you the answer's support, the exclusion log tells you its blind spots, and imo you need both to really trust the thing. Enjoyed this exchange a lot.

kenielzep97 profile image

Self-Correcting Systems

I run a small multi-agent system and write about the part nobody tells you: memory that keeps your corrections, not your flattery. Truth over comfort.

Email

keniel13@icloud.com
Education

Self-taught. No degree. Learned by building, failing, and correcting.
Pronouns

He
Joined

May 23, 2026

• Jun 30

The line that stopped me was "a model cannot mourn the data it was never allowed to read." That’s the whole problem in one sentence. A partial-evidence answer and a full-evidence answer show up wearing the same confidence, and the user never sees the missing horizon you’re pointing at. I’ve been circling the same thing from the agent side absence as a first-class provenance category, not a footnote. A system that can’t state what it wasn’t allowed to see is grading its own paper. Most retrieval setups log what they found and stay silent on what they were blocked from, which is the moment "confident" and "complete" quietly stop meaning the same thing. The retrieval-now / training-later split is the sharp part. Watching that one.

kenwalger profile image

Ken W Alger

Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.

Location

Oregon
Joined

Apr 7, 2017

• Jun 30

I think you've put your finger on the part that keeps pulling at me as well.

Most systems are very good at explaining presence. Here are the sources I found. Here are the documents I cited. Here are the passages that influenced the answer.

They're far less capable of explaining absence.

Was a source unavailable because it didn't exist? Because it wasn't indexed? Because it was behind a paywall? Because a crawler was blocked? Because a licensing agreement excluded it? Those are very different conditions, yet they often collapse into the same user experience: silence.

I particularly like your observation that a system unable to articulate its own boundaries is effectively grading its own paper. Confidence becomes a much weaker signal when the evidence horizon is unknown.

The retrieval-now / training-later distinction is fascinating to me for the same reason. Training data limitations eventually become historical artifacts. Retrieval limitations are happening in real time. Information Borders can shift overnight because access policies, agreements, and restrictions change overnight.

In that sense, the answer isn't just a reflection of what the model knows. It's increasingly a reflection of what the model was permitted to know at the moment the question was asked.

That feels like a very different problem than the one we've spent the last few years discussing.

kenielzep97 profile image

Self-Correcting Systems

I run a small multi-agent system and write about the part nobody tells you: memory that keeps your corrections, not your flattery. Truth over comfort.

Email

keniel13@icloud.com
Education

Self-taught. No degree. Learned by building, failing, and correcting.
Pronouns

He
Joined

May 23, 2026

• Jun 30

"Good at explaining presence, bad at explaining absence" is cleaner than how I had it, and I’m stealing it. Mine was "data deficits hidden behind a curtain" yours actually names the four different conditions that get flattened into one silence: doesn’t exist, wasn’t indexed, paywalled, blocked. Those are different failures with different fixes, and right now they all just look like "I don’t know." The line that’s going to sit with me is "what the model was permitted to know at the moment the question was asked." That’s not a knowledge limitation, that’s a permission state and permission states change without anyone re-running the question. Same prompt, different hour, different border. That’s a harder problem than training cutoffs ever were, because training cutoffs at least hold still.

kenwalger profile image

Ken W Alger

Systems architect & technical product leader with roots in bare-metal engineering. I design modern local-first, data-sovereign AI platforms in Go/Python and scale elite core infrastructure teams.

Location

Oregon
Joined

Apr 7, 2017

• Jun 30

I think you've captured the distinction perfectly.

A training cutoff is ultimately a historical boundary. It's frustrating, but at least it's stable. If we rerun the same question tomorrow, the model's training corpus hasn't changed.

Permission states are different. They're operational.

A publisher can change a robots.txt file. A licensing agreement can expire. An API can become unavailable. A paywall can appear. A retrieval index can be updated. The evidence boundary can shift without the underlying question changing at all.

That's what fascinates me about the idea of Information Borders. We tend to think of knowledge limitations as properties of the model. Increasingly, some of the most important limitations may be properties of the environment surrounding the model.

And I agree that those four conditions matter because they imply different remedies. "The information doesn't exist" is a very different situation from "the information exists, but the system wasn't permitted to access it." Today those often collapse into the same answer: silence.

The more I think about it, the more it feels like we're moving from a world where knowledge was primarily constrained by memory to one where knowledge is increasingly constrained by access. That's a subtle shift, but a consequential one.

View full discussion (11 comments)