Skip to content

Comments

Remap heap graph ordinals on serialization#618

Open
ashkrisk wants to merge 1 commit intomainfrom
ohg-serde-ord-map
Open

Remap heap graph ordinals on serialization#618
ashkrisk wants to merge 1 commit intomainfrom
ohg-serde-ord-map

Conversation

@ashkrisk
Copy link
Contributor

There is currently no way to alter the ordinals of the vectors in an OnHeapGraphIndex. This limitation extends to the save and load methods used for serializing and de-serializing the heap graph.

In particular, this means that any ordinal "holes" created by deleting nodes from the OnHeapGraphIndex can never be compacted away. This is different from the behavior of the disk graph writer, which not only provides a way to remap ordinals, it automatically compacts ordinals by default.

This PR adds the option to remap ordinals while serializing the graph, preventing the accumulation of ordinal holes.

Note that this is only useful for those systems which rely on repeatedly loading, mutating, and then saving the OnHeapGraphIndex, which is arguably an anti-pattern.

Add the option to avoid ordinal holes in the serialized heap graph.
@github-actions
Copy link
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@jshook
Copy link
Contributor

jshook commented Feb 17, 2026

Does this pose a risk for stable ordinal mapping with external system which retain their own ordinal maps? We need to identify how the through line from non-vector data remains connected to the associated vectors/embeddings in jvector. If this is already handled effectively here, we should make it very clear in documentation with examples.

@ashkrisk
Copy link
Contributor Author

Short answer: no, since the client application needs to opt-in to this by calling either saveWithCompactOrdinals or the save overload which takes an OrdinalMapper as input.

Longer answer:
OnDiskGraphIndex actually does this already. If you take a graph index with missing ordinals (due to deletions) and pass them to the GraphIndexWriter, the output ordinals will be automatically re-numbered to avoid any ordinal "holes" in the saved graph.

Say the application creates a graph with some deleted ordinals. Then it does something like this:

OnHeapGraphIndex graph = ...;
var writer = new GraphIndexWriter.Builder(graph, ...) (...) .build();  // no OrdinalMapper supplied
writer.write(...);
graph.save(...);

This saves two files. One corresponding to the OnDiskGraphIndex, and the other corresponding to the serialized OnHeapGraphIndex. Since the disk graph ordinals are automatically re-numbered, but the heap graph's ordinals are not, these two sets of ordinals are not consistent with each other. A bug related to this was recently fixed in Opensearch.

Additionally, since the heap graph serialization process does not provide an option to pass in a mapper, the application has no choice but to live with the ordinal mismatch and compensate for it by introducing additional code complexity.

This PR doesn't attempt to change the disk graph's existing philosophy of ordinal re-numbering during write, it simply adds the same functionality to the heap graph serialization. The key difference is that the user has to "opt-in" to ordinal compaction to prevent nasty surprises in existing code. While I agree that the whole ordinal mapping system could do with more thorough documentation, that's not within the scope of this PR and should be handled separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants