Conversation
There was a problem hiding this comment.
1 issue found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/document_index/opensearch/schema.py">
<violation number="1" location="backend/onyx/document_index/opensearch/schema.py:38">
P0: This rename breaks an existing import. `opensearch_document_index.py` imports `PROJECT_IDS_FIELD_NAME` which no longer exists after this rename. This will cause an `ImportError` at runtime. The import and usage in `opensearch_document_index.py` should be updated to use `USER_PROJECTS_FIELD_NAME`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| SOURCE_LINKS_FIELD_NAME = "source_links" | ||
| DOCUMENT_SETS_FIELD_NAME = "document_sets" | ||
| PROJECT_IDS_FIELD_NAME = "project_ids" | ||
| USER_PROJECTS_FIELD_NAME = "user_projects" |
There was a problem hiding this comment.
P0: This rename breaks an existing import. opensearch_document_index.py imports PROJECT_IDS_FIELD_NAME which no longer exists after this rename. This will cause an ImportError at runtime. The import and usage in opensearch_document_index.py should be updated to use USER_PROJECTS_FIELD_NAME.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/document_index/opensearch/schema.py, line 38:
<comment>This rename breaks an existing import. `opensearch_document_index.py` imports `PROJECT_IDS_FIELD_NAME` which no longer exists after this rename. This will cause an `ImportError` at runtime. The import and usage in `opensearch_document_index.py` should be updated to use `USER_PROJECTS_FIELD_NAME`.</comment>
<file context>
@@ -35,14 +35,15 @@
SOURCE_LINKS_FIELD_NAME = "source_links"
DOCUMENT_SETS_FIELD_NAME = "document_sets"
-PROJECT_IDS_FIELD_NAME = "project_ids"
+USER_PROJECTS_FIELD_NAME = "user_projects"
DOCUMENT_ID_FIELD_NAME = "document_id"
CHUNK_INDEX_FIELD_NAME = "chunk_index"
</file context>
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Greptile SummaryThis PR refactors OpenSearch metadata storage to use a flattened list format (
Critical Issue: The import and usage of Confidence Score: 1/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Chunk as DocMetadataAwareIndexChunk
participant Convert as Indexer
participant OS as OpenSearch
participant Retrieve as Retriever
participant InfChunk as InferenceChunkUncleaned
Note over Chunk,InfChunk: Indexing Flow
Chunk->>Convert: chunk with metadata dict
Convert->>Convert: get_metadata_str_attributes
Note right of Convert: Converts dict to list
Convert->>OS: Store metadata_list, metadata_suffix
Note over Chunk,InfChunk: Retrieval Flow
OS->>Retrieve: Return chunk with metadata_list
Retrieve->>Retrieve: convert_metadata_list_of_strings_to_dict
Note right of Retrieve: Reconstructs dict from list
Retrieve->>InfChunk: Return with metadata dict
|
There was a problem hiding this comment.
Additional Comments (2)
-
backend/onyx/document_index/opensearch/opensearch_document_index.py, line 47 (link)syntax:
PROJECT_IDS_FIELD_NAMEno longer exists inschema.py(renamed toUSER_PROJECTS_FIELD_NAME). This will cause anImportErrorat runtime. -
backend/onyx/document_index/opensearch/opensearch_document_index.py, line 562 (link)syntax: After fixing the import, this usage also needs to be updated to use the new constant name.
4 files reviewed, 2 comments
| # Small optimization, if this list is empty we can supply None to | ||
| # OpenSearch and it will not store any data at all for this field, which | ||
| # is different from supplying an empty list. | ||
| user_projects=chunk.user_project if chunk.user_project else None, |
There was a problem hiding this comment.
chunk.user_project or None
| SOURCE_LINKS_FIELD_NAME = "source_links" | ||
| DOCUMENT_SETS_FIELD_NAME = "document_sets" | ||
| PROJECT_IDS_FIELD_NAME = "project_ids" | ||
| USER_PROJECTS_FIELD_NAME = "user_projects" |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
…match what we store in Vespa (#7448)
…match what we store in Vespa (#7448)
Description
We now store metadata list in OpenSearch, this will be used to filter on metadata fields but also used to reconstruct the metadata dict. This is more space efficient than storing both the dict and list, which is what we do in Vespa.
Also cleaned up the source links dict we return on retrieval to have keys which are ints not strs.
How Has This Been Tested?
I trust CI.
Additional Options
Summary by cubic
Store metadata in OpenSearch as a flattened list (metadata_list) and add helpers to rebuild the dict at read time. Updates schema, indexing, and retrieval to reduce storage, align with Vespa filtering, and fix a source_links type issue.
Refactors
Migration
Written for commit 10341d2. Summary will update on new commits.