Skip to content

feat: support segmented inverted index build and search#6305

Open
Xuanwo wants to merge 10 commits intomainfrom
feat/fts-segment-pr1
Open

feat: support segmented inverted index build and search#6305
Xuanwo wants to merge 10 commits intomainfrom
feat/fts-segment-pr1

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 26, 2026

This PR teaches inverted/FTS indices to participate in the segment-based build workflow and to search across multiple committed segments with a shared BM25 scorer. It keeps the current on-disk inverted format intact while aligning FTS with the newer execute_uncommitted() -> create_index_segment_builder() -> commit_existing_index_segments() path.

This is the first vertical slice for segmented inverted indices: build, commit, and query now work end-to-end, and the follow-up work can focus on compaction and metadata acceleration instead of basic control-plane wiring.

@github-actions github-actions bot added the enhancement New feature or request label Mar 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: support segmented inverted index build and search

Overall this is a solid first slice for segmented FTS. The BM25 cross-segment scoring approach is correct (global corpus stats passed to per-segment search, top-k merge via min-heap). A few items worth addressing:

P1: Duplicated scorer-merge logic (3x copy-paste)

The pattern of merging MemBM25Scorer across segments is repeated identically in MatchQueryExec, PhraseQueryExec, and FlatMatchQueryExec (fts.rs):

let mut base_scorer = first_index.bm25_base_scorer(&tokens);
for index in indices.iter().skip(1) {
    let segment_scorer = index.bm25_base_scorer(&tokens);
    base_scorer.total_tokens += segment_scorer.total_tokens;
    base_scorer.num_docs += segment_scorer.num_docs;
    for (token, count) in segment_scorer.token_docs {
        *base_scorer.token_docs.entry(token).or_insert(0) += count;
    }
}

The same top-k heap merge is also duplicated between MatchQueryExec and PhraseQueryExec. This makes a future bug fix easy to miss in one of the three locations. Consider extracting these into shared helpers (e.g. merge_bm25_scorers(&indices, &tokens) and a helper for the heap-merge-across-segments pattern).

P1: MemBM25Scorer fields exposed as pub

The merging code directly mutates base_scorer.total_tokens, base_scorer.num_docs, and base_scorer.token_docs. If these fields were already public that's fine, but if this PR made them pub, consider adding a merge(&mut self, other: &MemBM25Scorer) method instead — it keeps the invariant (consistent stats) in one place and avoids leaking internal representation.

Minor observations (non-blocking)

  • _details loaded and discarded: In MatchQueryExec and PhraseQueryExec, let _details = load_fts_segment_details(...) performs I/O across all segments purely for validation. This adds per-query latency. Consider deferring this validation to index build time or making it optional at query time.

  • Sequential segment search: Each segment is searched in a for loop. With many segments this could become a bottleneck. Understood this is a first slice — just flagging for the follow-up.

  • Nit: let mut tokenizer = tokenizer; in flat_bm25_search_stream is a no-op rebinding (the parameter is already owned).

Tests

Good coverage: segmented match query, phrase query, mixed indexed + unindexed fragments, and the legacy-index tokenizer fallback test. The test_index_segment_builder_fts_commits_multi_segment_logical_index in create.rs validates the build pipeline end-to-end.

@Xuanwo Xuanwo marked this pull request as draft March 26, 2026 14:22
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

@github-actions github-actions bot added the java label Mar 26, 2026
@Xuanwo Xuanwo marked this pull request as ready for review March 27, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant