Ensure that no node duplicates exist in the adjacency list of any node.#522

Draft

marianotepper wants to merge 6 commits intomain from

fix-neighbor-duplicates

Draft

Ensure that no node duplicates exist in the adjacency list of any node. #522
marianotepper wants to merge 6 commits intomain from
fix-neighbor-duplicates

Conversation

@marianotepper

Copy link

Contributor

@marianotepper marianotepper commented Sep 17, 2025 •

edited

Loading

Most nodes in a graph have MANY duplicated entries in their adjacency list. This is undesirable as it reduces the effective degree of the graph.

This was occurring because scores where the main vehicle to check whether a node was inserted twice in an adjacency list. However, when we are building a graph with PQ we use a mix of sim(x1, quant(x2)) and sim(quant(x1), quant(x2)) similarities. Because quantization is lossy, sim(x1, quant(x2)) != sim(quant(x1), quant(x2)). Thus, we can have two different scores associated with a given node, depending on which quantized similarity we use.

This PR changes the way we are computing duplicates in NodeArray, to prevent the emergence of these duplicates. Now, we only use the node ordinals for these checks.

One potential future improvement is to order every adjacency by the node ordinals, so that duplicate checks are faster. There is a fine tradeoff between accelerating these checks and decelerating the diversity computation that needs to be carefully analyzed.

Mariano Tepper added 2 commits

September 17, 2025 12:48


 Ensures that no node duplicates exist in the adjacency list of any node.

66cb152


 Fix grammar in comment

6c68ad7

@marianotepper marianotepper requested review from jkni and tlwillke

September 17, 2025 20:15

@marianotepper marianotepper changed the title ~~(削除) Ensures that no node duplicates exist in the adjacency list of any node. (削除ここまで)~~ (追記) Ensure that no node duplicates exist in the adjacency list of any node. (追記ここまで)

Sep 17, 2025


 Fix minor bugs

a483317

@marianotepper marianotepper added the bug label

Sep 17, 2025

@marianotepper marianotepper marked this pull request as ready for review

September 17, 2025 20:55

tlwillke

tlwillke approved these changes

Sep 18, 2025

View reviewed changes

Copy link

Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would like to see the results of the perf benchmark GHA.

jkni

jkni reviewed

Sep 18, 2025

View reviewed changes

Copy link

Contributor

@jkni jkni left a comment •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good catch. I left some minor nits inline. I agree with Ted regarding measuring the perf, but I'm woefully behind the times on the work you've done there, so I'll hold off on approving since I don't know the standards.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/NodeArray.java Outdated

// If elements remain in a2, add them

if (j < a2.size()) {

// avoid duplicates while adding nodes with the same score

Copy link

Contributor

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment describing old strategy -- can clean up on commit to make this clearer

jvector-base/src/main/java/io/github/jbellis/jvector/graph/NodeArray.java Outdated

@@ -112,30 +108,22 @@ static NodeArray merge(NodeArray a1, NodeArray a2) {

// If elements remain in a1, add them

if (i < a1.size()) {

// avoid duplicates while adding nodes with the same score

Copy link

Contributor

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment describing old strategy -- can clean up on commit

jvector-base/src/main/java/io/github/jbellis/jvector/graph/NodeArray.java Outdated

for (int i = insertionPoint - 1; i >= 0 && scores[i] == newScore; i--) {

if (nodes[i] == newNode) {

return true;

private boolean duplicateExists(int insertionPoint, int newNode) {

Copy link

Contributor

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be marginally cleaner -- at i = 0, we check nodes[insertion] point at each conditional, and the loop bounds are unnecessarily pessimistic (i.e., if the insertion point is right in the middle, we'll waste a bunch of loop iterations in both directions, when we could just go to the max radius around the insertion point to hit one end of the array). No idea if this will matter in practice but might be worth measuring.

Copy link

Contributor

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., something like

 int n = this.size;
 int[] a = this.nodes;
 int left = insertionPoint - 1;
 int right = insertionPoint;
 // Exact hit fast path
 if (right < n && a[right] == newNode) return true;
 // Expand outward
 while (left >= 0 || right < n) {
 if (left >= 0 && a[left] == newNode) return true;
 if (right < n && a[right] == newNode) return true;
 left--;
 right++;
 }
 return false;

@marianotepper

Copy link

Contributor Author

marianotepper commented Sep 18, 2025 •

edited

Loading

Initial benchmarking shows that index construction got a bit slower (~10%), which is not surprising. Even if this PR moves things in the right direction conceptually, in practice it does not offer a net benefit. I think that we need a better solution and not just a patch. Will transform the PR into a draft until that superior solution is in place.

@marianotepper marianotepper marked this pull request as draft

September 18, 2025 20:10


 Overhaul strategy to have unique edges in the graph.

e15f153

@marianotepper

Copy link

Contributor Author

marianotepper commented Sep 19, 2025 •

edited

Loading

The most recent commits overhaul the strategy used to ensure that edges in the graph are unique.
Now, each adjacency list is sorted by node ID in ascending order.

NodeArray has a method public NodesIterator getIteratorSortedByScores() that is used when pruning each adjacency list so that the nodes are explored in descending order of the the scores. This adds an additional sorting operation in the pruning step. However, pruning is much less frequent than insertions in the adjacency lists (because of backlinks), so we should not see a detrimental effect in performance.

Mariano Tepper added 2 commits

September 19, 2025 08:14


 Merge branch 'main' into fix-neighbor-duplicates

48b40a7


 Since we removed one integer from the adjacency list, we need to adju...

76f3703

...st the expected size down by 4 in GraphIndexBuilderTest.testEstimatedBytes

@marianotepper marianotepper self-assigned this

Oct 3, 2025

@marianotepper marianotepper mentioned this pull request

Oct 7, 2025

Reconstruct heap graph from disk graph #536

Merged

@tlwillke tlwillke mentioned this pull request

Jan 26, 2026

Fix recall on score ties #520

Closed

Labels

bug

3 participants

@marianotepper @jkni @tlwillke

Conversation

@marianotepper marianotepper commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

Uh oh!

@jkni jkni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

@jkni jkni Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

marianotepper commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marianotepper commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@marianotepper marianotepper commented Sep 17, 2025 •

edited

Loading

@jkni jkni left a comment •

edited

Loading

marianotepper commented Sep 18, 2025 •

edited

Loading

marianotepper commented Sep 19, 2025 •

edited

Loading