Skip to content

feat(opensearch): Fix some stuff around metadata to improve code and match what we store in Vespa#7448

Merged
acaprau merged 3 commits intomainfrom
andrei/260115/1/opensearch/some-metadata-stuff
Jan 16, 2026
Merged

feat(opensearch): Fix some stuff around metadata to improve code and match what we store in Vespa#7448
acaprau merged 3 commits intomainfrom
andrei/260115/1/opensearch/some-metadata-stuff

Conversation

@acaprau
Copy link
Copy Markdown
Contributor

@acaprau acaprau commented Jan 16, 2026

Description

We now store metadata list in OpenSearch, this will be used to filter on metadata fields but also used to reconstruct the metadata dict. This is more space efficient than storing both the dict and list, which is what we do in Vespa.

Also cleaned up the source links dict we return on retrieval to have keys which are ints not strs.

How Has This Been Tested?

I trust CI.

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Store metadata in OpenSearch as a flattened list (metadata_list) and add helpers to rebuild the dict at read time. Updates schema, indexing, and retrieval to reduce storage, align with Vespa filtering, and fix a source_links type issue.

  • Refactors

    • Replace metadata dict storage with metadata_list (keyvalue strings) and add two converters.
    • Update schema: metadata → metadata_list, add metadata_suffix, project_ids → user_projects.
    • Indexer now writes metadata_list and metadata_suffix; skips empty arrays by sending None.
    • Retrieval converts metadata_list back to dict and casts source_links keys to ints.
  • Migration

    • Recreate or reindex the OpenSearch index due to mapping changes.
    • Field changes: metadata → metadata_list, project_ids → user_projects, add metadata_suffix.

Written for commit 10341d2. Summary will update on new commits.

@acaprau acaprau requested a review from a team as a code owner January 16, 2026 00:59
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/document_index/opensearch/schema.py">

<violation number="1" location="backend/onyx/document_index/opensearch/schema.py:38">
P0: This rename breaks an existing import. `opensearch_document_index.py` imports `PROJECT_IDS_FIELD_NAME` which no longer exists after this rename. This will cause an `ImportError` at runtime. The import and usage in `opensearch_document_index.py` should be updated to use `USER_PROJECTS_FIELD_NAME`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

SOURCE_LINKS_FIELD_NAME = "source_links"
DOCUMENT_SETS_FIELD_NAME = "document_sets"
PROJECT_IDS_FIELD_NAME = "project_ids"
USER_PROJECTS_FIELD_NAME = "user_projects"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: This rename breaks an existing import. opensearch_document_index.py imports PROJECT_IDS_FIELD_NAME which no longer exists after this rename. This will cause an ImportError at runtime. The import and usage in opensearch_document_index.py should be updated to use USER_PROJECTS_FIELD_NAME.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/document_index/opensearch/schema.py, line 38:

<comment>This rename breaks an existing import. `opensearch_document_index.py` imports `PROJECT_IDS_FIELD_NAME` which no longer exists after this rename. This will cause an `ImportError` at runtime. The import and usage in `opensearch_document_index.py` should be updated to use `USER_PROJECTS_FIELD_NAME`.</comment>

<file context>
@@ -35,14 +35,15 @@
 SOURCE_LINKS_FIELD_NAME = "source_links"
 DOCUMENT_SETS_FIELD_NAME = "document_sets"
-PROJECT_IDS_FIELD_NAME = "project_ids"
+USER_PROJECTS_FIELD_NAME = "user_projects"
 DOCUMENT_ID_FIELD_NAME = "document_id"
 CHUNK_INDEX_FIELD_NAME = "chunk_index"
</file context>

This comment was marked as off-topic.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile Summary

This PR refactors OpenSearch metadata storage to use a flattened list format (metadata_list) instead of a JSON-stringified dict, aligning it with Vespa's storage approach for better space efficiency and filtering support. Key changes:

  • New utility functions in models.py for converting between metadata dict and list-of-strings formats
  • Schema changes: metadatametadata_list, project_idsuser_projects, added metadata_suffix field
  • Retrieval improvements: source_links keys now properly cast to integers, metadata_suffix correctly read from storage for content cleanup

Critical Issue: The import and usage of PROJECT_IDS_FIELD_NAME in opensearch_document_index.py was not updated to match the schema rename to USER_PROJECTS_FIELD_NAME, which will cause an ImportError at runtime.

Confidence Score: 1/5

  • This PR contains an import error that will crash the application at startup when OpenSearch is used.
  • The PR has a critical bug where PROJECT_IDS_FIELD_NAME is imported but no longer exists in schema.py after being renamed to USER_PROJECTS_FIELD_NAME. This will cause an ImportError at runtime, preventing the OpenSearch module from loading.
  • backend/onyx/document_index/opensearch/opensearch_document_index.py requires immediate attention - line 47 imports a non-existent constant and line 562 uses it.

Important Files Changed

Filename Overview
backend/onyx/connectors/models.py Added two utility functions for metadata conversion between dict and list formats. Clean, well-documented implementation with proper edge case handling.
backend/onyx/context/search/models.py Minor documentation-only change: Added TODO comment about metadata dict schema improvement.
backend/onyx/document_index/opensearch/schema.py Schema updates: renamed metadata → metadata_list, project_ids → user_projects, added metadata_suffix field. Breaking schema change requiring index recreation.
backend/onyx/document_index/opensearch/opensearch_document_index.py Critical bug: imports PROJECT_IDS_FIELD_NAME which no longer exists in schema.py (renamed to USER_PROJECTS_FIELD_NAME). Will cause ImportError at runtime.

Sequence Diagram

sequenceDiagram
    participant Chunk as DocMetadataAwareIndexChunk
    participant Convert as Indexer
    participant OS as OpenSearch
    participant Retrieve as Retriever
    participant InfChunk as InferenceChunkUncleaned

    Note over Chunk,InfChunk: Indexing Flow
    Chunk->>Convert: chunk with metadata dict
    Convert->>Convert: get_metadata_str_attributes
    Note right of Convert: Converts dict to list
    Convert->>OS: Store metadata_list, metadata_suffix

    Note over Chunk,InfChunk: Retrieval Flow
    OS->>Retrieve: Return chunk with metadata_list
    Retrieve->>Retrieve: convert_metadata_list_of_strings_to_dict
    Note right of Retrieve: Reconstructs dict from list
    Retrieve->>InfChunk: Return with metadata dict
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. backend/onyx/document_index/opensearch/opensearch_document_index.py, line 47 (link)

    syntax: PROJECT_IDS_FIELD_NAME no longer exists in schema.py (renamed to USER_PROJECTS_FIELD_NAME). This will cause an ImportError at runtime.

  2. backend/onyx/document_index/opensearch/opensearch_document_index.py, line 562 (link)

    syntax: After fixing the import, this usage also needs to be updated to use the new constant name.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

# Small optimization, if this list is empty we can supply None to
# OpenSearch and it will not store any data at all for this field, which
# is different from supplying an empty list.
user_projects=chunk.user_project if chunk.user_project else None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk.user_project or None

SOURCE_LINKS_FIELD_NAME = "source_links"
DOCUMENT_SETS_FIELD_NAME = "document_sets"
PROJECT_IDS_FIELD_NAME = "project_ids"
USER_PROJECTS_FIELD_NAME = "user_projects"

This comment was marked as off-topic.

@acaprau acaprau enabled auto-merge January 16, 2026 03:14
@acaprau acaprau added this pull request to the merge queue Jan 16, 2026
Merged via the queue into main with commit e0a9723 Jan 16, 2026
75 checks passed
@acaprau acaprau deleted the andrei/260115/1/opensearch/some-metadata-stuff branch January 16, 2026 03:51
rohoswagger pushed a commit that referenced this pull request Jan 19, 2026
jessicasingh7 pushed a commit that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants