Skip to content

fix(opensearch): Use the same method for getting title that the title embedding logic uses; small cleanup for content embedding#7638

Merged
acaprau merged 1 commit intomainfrom
andrei/260121/0/opensearch/fix-title-insertion-and-create-func-for-content-embedding-enrichment
Jan 21, 2026
Merged

fix(opensearch): Use the same method for getting title that the title embedding logic uses; small cleanup for content embedding#7638
acaprau merged 1 commit intomainfrom
andrei/260121/0/opensearch/fix-title-insertion-and-create-func-for-content-embedding-enrichment

Conversation

@acaprau
Copy link
Copy Markdown
Contributor

@acaprau acaprau commented Jan 21, 2026

Description

_index_vespa_chunk uses get_title_for_document_index() to populate the title field in a document index entry, which also happens to be same data used to create a title embedding. The OpenSearch flow needs to use this too.

This PR also encapsulates the enrichment logic for creating a content embedding in a similar way that content text enrichment logic is encapsulated in backend/onyx/document_index/chunk_content_enrichment.py.

How Has This Been Tested?

@evan-onyx first wrote this title fix in https://github.com/onyx-dot-app/onyx/pull/7560/changes#diff-d09eebe79e128bf021e791c21a4ff9cf98d85d79e4da36b237885de09bbd5618 and it works.

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Use get_title_for_document_index in OpenSearch so the stored title matches the title used for embeddings and corrects fallback behavior. Also add a helper for content-embedding enrichment and rename the text-enrichment helper for clarity.

  • Bug Fixes

    • OpenSearch sets title via get_title_for_document_index(), matching embedder logic and falling back to semantic_identifier when title is None.
  • Refactors

    • Added generate_enriched_content_for_chunk_embedding() and used it in the embedder.
    • Renamed generate_enriched_content_for_chunk → generate_enriched_content_for_chunk_text and updated OpenSearch/Vespa call sites.

Written for commit f786e38. Summary will update on new commits.

@acaprau acaprau requested a review from a team as a code owner January 21, 2026 21:02
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Summary

Fixed OpenSearch to use get_title_for_document_index() for the title field, ensuring consistency with Vespa and the title embedding logic. This method properly falls back to semantic_identifier when title is None (but not empty string), preventing title/embedding mismatches.

  • Refactored content enrichment logic by extracting generate_enriched_content_for_chunk_embedding() function that encapsulates the embedding-specific enrichment (uses metadata_suffix_semantic)
  • Renamed existing function to generate_enriched_content_for_chunk_text() to clarify it's for text/BM25 indexing (uses metadata_suffix_keyword)
  • Updated all call sites in OpenSearch, Vespa, and embedder to use the appropriate enrichment functions

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • The changes are straightforward refactoring and bug fixes that improve code consistency. The title fix addresses a real inconsistency between OpenSearch and Vespa/embedder. The content enrichment refactoring follows existing patterns and properly separates concerns between embedding and text indexing. All changes are well-contained with clear intent.
  • No files require special attention

Important Files Changed

Filename Overview
backend/onyx/document_index/chunk_content_enrichment.py Extracted content embedding enrichment logic into a dedicated function, matching the existing pattern for text enrichment
backend/onyx/document_index/opensearch/opensearch_document_index.py Fixed title field to use get_title_for_document_index() matching Vespa and embedder logic; renamed function to use new text enrichment function

Sequence Diagram

sequenceDiagram
    participant Chunk as DocMetadataAwareIndexChunk
    participant Embedder as DefaultIndexingEmbedder
    participant EnrichFunc as generate_enriched_content_for_chunk_embedding
    participant OpenSearch as opensearch_document_index
    participant TextEnrichFunc as generate_enriched_content_for_chunk_text
    participant Doc as Document

    Note over Embedder,EnrichFunc: Content Embedding Flow
    Embedder->>EnrichFunc: Call with DocAwareChunk
    EnrichFunc-->>Embedder: Return enriched content (with metadata_suffix_semantic)
    Embedder->>Doc: get_title_for_document_index()
    Doc-->>Embedder: Return title (or semantic_identifier if title is None)
    
    Note over OpenSearch,TextEnrichFunc: OpenSearch Indexing Flow
    OpenSearch->>Doc: get_title_for_document_index()
    Note right of OpenSearch: Fixed: Now uses same<br/>method as embedder
    Doc-->>OpenSearch: Return consistent title
    OpenSearch->>TextEnrichFunc: Call with DocMetadataAwareIndexChunk
    TextEnrichFunc-->>OpenSearch: Return enriched content (with metadata_suffix_keyword)
Loading

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@acaprau acaprau enabled auto-merge January 21, 2026 21:06
@acaprau acaprau added this pull request to the merge queue Jan 21, 2026
Merged via the queue into main with commit 267042a Jan 21, 2026
78 checks passed
@acaprau acaprau deleted the andrei/260121/0/opensearch/fix-title-insertion-and-create-func-for-content-embedding-enrichment branch January 21, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants