feat: support segmented inverted index build and search by Xuanwo · Pull Request #6305 · lance-format/lance

Xuanwo · 2026-03-26T14:06:33Z

This PR teaches inverted/FTS indices to participate in the segment-based build workflow and to search across multiple committed segments with a shared BM25 scorer. It keeps the current on-disk inverted format intact while aligning FTS with the newer execute_uncommitted() -> create_index_segment_builder() -> commit_existing_index_segments() path.

This is the first vertical slice for segmented inverted indices: build, commit, and query now work end-to-end, and the follow-up work can focus on compaction and metadata acceleration instead of basic control-plane wiring.

github-actions · 2026-03-26T14:12:13Z

PR Review: feat: support segmented inverted index build and search

Overall this is a solid first slice for segmented FTS. The BM25 cross-segment scoring approach is correct (global corpus stats passed to per-segment search, top-k merge via min-heap). A few items worth addressing:

P1: Duplicated scorer-merge logic (3x copy-paste)

The pattern of merging MemBM25Scorer across segments is repeated identically in MatchQueryExec, PhraseQueryExec, and FlatMatchQueryExec (fts.rs):

let mut base_scorer = first_index.bm25_base_scorer(&tokens);
for index in indices.iter().skip(1) {
    let segment_scorer = index.bm25_base_scorer(&tokens);
    base_scorer.total_tokens += segment_scorer.total_tokens;
    base_scorer.num_docs += segment_scorer.num_docs;
    for (token, count) in segment_scorer.token_docs {
        *base_scorer.token_docs.entry(token).or_insert(0) += count;
    }
}

The same top-k heap merge is also duplicated between MatchQueryExec and PhraseQueryExec. This makes a future bug fix easy to miss in one of the three locations. Consider extracting these into shared helpers (e.g. merge_bm25_scorers(&indices, &tokens) and a helper for the heap-merge-across-segments pattern).

P1: `MemBM25Scorer` fields exposed as `pub`

The merging code directly mutates base_scorer.total_tokens, base_scorer.num_docs, and base_scorer.token_docs. If these fields were already public that's fine, but if this PR made them pub, consider adding a merge(&mut self, other: &MemBM25Scorer) method instead — it keeps the invariant (consistent stats) in one place and avoids leaking internal representation.

Minor observations (non-blocking)

_details loaded and discarded: In MatchQueryExec and PhraseQueryExec, let _details = load_fts_segment_details(...) performs I/O across all segments purely for validation. This adds per-query latency. Consider deferring this validation to index build time or making it optional at query time.
Sequential segment search: Each segment is searched in a for loop. With many segments this could become a bottleneck. Understood this is a first slice — just flagging for the follow-up.
Nit: let mut tokenizer = tokenizer; in flat_bm25_search_stream is a no-op rebinding (the parameter is already owned).

Tests

Good coverage: segmented match query, phrase query, mixed indexed + unindexed fragments, and the legacy-index tokenizer fallback test. The test_index_segment_builder_fts_commits_multi_segment_logical_index in create.rs validates the build pipeline end-to-end.

codecov · 2026-03-26T15:46:32Z

Codecov Report

❌ Patch coverage is 83.06011% with 124 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/fts.rs	79.32%	24 Missing and 31 partials ⚠️
rust/lance/src/index/scalar/inverted.rs	65.18%	37 Missing and 10 partials ⚠️
rust/lance/src/index/create.rs	93.06%	19 Missing ⚠️
rust/lance-index/src/scalar/inverted/index.rs	95.23%	0 Missing and 2 partials ⚠️
rust/lance/src/dataset/scanner.rs	50.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo added 2 commits March 26, 2026 22:05

feat: support segmented inverted index build and search

a6d4a36

test: cover segmented inverted index workflows

1efcfa9

github-actions bot added the enhancement New feature or request label Mar 26, 2026

Xuanwo marked this pull request as draft March 26, 2026 14:22

Xuanwo added 4 commits March 26, 2026 22:35

fix: address segmented inverted index review issues

b71d333

fix: update inverted bench for shared bm25 scorer

f69eb66

fix: satisfy clippy visibility checks

ea80a5b

fix: restore vector segment builder compatibility

28e963b

fix: require explicit index type for segment builder

f9a7c71

github-actions bot added the python label Mar 26, 2026

fix: require explicit segment types in java builder

6d242df

github-actions bot added the java label Mar 26, 2026

Xuanwo added 2 commits March 27, 2026 00:23

fix: skip segment commit path for vector extensions

5fd5794

fix: decode inverted index details from payload

88dab0e

Xuanwo marked this pull request as ready for review March 27, 2026 08:15

Xuanwo mentioned this pull request Mar 27, 2026

Tracking: Distributed Indexes Search #6309

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support segmented inverted index build and search#6305

feat: support segmented inverted index build and search#6305
Xuanwo wants to merge 10 commits intomainfrom
feat/fts-segment-pr1

Xuanwo commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

codecov bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xuanwo commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

PR Review: feat: support segmented inverted index build and search

P1: Duplicated scorer-merge logic (3x copy-paste)

P1: MemBM25Scorer fields exposed as pub

Minor observations (non-blocking)

Tests

Uh oh!

codecov bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

P1: `MemBM25Scorer` fields exposed as `pub`

codecov bot commented Mar 26, 2026 •

edited

Loading