Skip to content

fix: Metadata file for larger zips#8327

Merged
yuhongsun96 merged 2 commits intomainfrom
reintro-metadata-file
Feb 11, 2026
Merged

fix: Metadata file for larger zips#8327
yuhongsun96 merged 2 commits intomainfrom
reintro-metadata-file

Conversation

@yuhongsun96
Copy link
Copy Markdown
Contributor

@yuhongsun96 yuhongsun96 commented Feb 11, 2026

Description

We are loading the metadata file into a dict in memory then saving it as a jsonl in postgres but if the dict is large, it just doesn't get saved and the metadata overrides for the file connector do not get applied. Instead of saving the entire dict, we can just have a pointer to it in postgres and load it when the FileConnector actually runs.

How Has This Been Tested?

Tried uploading a zip with the metadata file which was not working previously, now it works!

Additional Options

  • [Required] I have considered whether this PR needs to be cherry-picked to the latest beta branch.
  • [Optional] Override Linear Check

Summary by cubic

Store ZIP metadata (.onyx_metadata.json) in the file store and reference it by file_id, loading it at run time. This fixes failures with large metadata, keeps overrides reliable for big ZIP uploads, and allows optional document_id overrides via the id field.

  • Bug Fixes

    • Save metadata JSON to the file store (CONNECTOR_METADATA) and keep only its file_id in connector config.
    • Load metadata lazily via zip_metadata_file_id; fallback to deprecated inline dict for backward compatibility.
    • Merge metadata on file updates and drop entries for removed files.
  • Migration

    • Connector config changed: zip_metadata -> zip_metadata_file_id.
    • Existing connectors still work; re-upload ZIPs to use the new file store format.

Written for commit e6f91c1. Summary will update on new commits.

@yuhongsun96 yuhongsun96 requested a review from a team as a code owner February 11, 2026 03:58
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR refactors how metadata files are stored for file connectors. Previously, the entire .onyx_metadata.json dictionary was stored inline in the connector's PostgreSQL config, which failed for large ZIP files due to size limitations. Now, the metadata is saved to the file store and only the file ID pointer is stored in the database.

Key Changes:

  • Metadata files are now stored in the file store with FileOrigin.CONNECTOR_METADATA and referenced by ID
  • FileUploadResponse changed from zip_metadata: dict to zip_metadata_file_id: str | None
  • LocalFileConnector loads metadata from file store at runtime instead of receiving it inline
  • Added backwards compatibility for connectors using the old inline zip_metadata dict format
  • Added document_id field to OnyxMetadata to support custom document ID overrides from metadata
  • Updated all tests and frontend code to use the new file ID approach

Issues Found:

  • Storage leak in update_connector_files(): when merging metadata and creating a new file, the old metadata file is not deleted from the file store

Confidence Score: 3/5

  • This PR has one logic issue that causes a storage leak but is otherwise safe to merge
  • Score reflects a storage leak bug in the metadata update logic where old metadata files are not cleaned up. The core refactoring is sound with good backwards compatibility, comprehensive test updates, and the change successfully addresses the original problem of large metadata dictionaries. However, the leak should be fixed to prevent unbounded storage growth over time.
  • Pay close attention to backend/onyx/server/documents/connector.py - the update_connector_files() function needs cleanup logic for old metadata files

Important Files Changed

Filename Overview
backend/onyx/server/documents/connector.py Refactored to store metadata in file store instead of inline dict; potential memory leak from not cleaning up old metadata files when updating
backend/onyx/connectors/file/connector.py Added metadata file loading from file store with backwards compatibility; allows document_id override from metadata
backend/onyx/server/documents/models.py Changed FileUploadResponse to use zip_metadata_file_id instead of inline dict
backend/onyx/connectors/models.py Added document_id field to OnyxMetadata for custom document ID override

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as API/Connector Endpoints
    participant FileStore as File Store
    participant DB as PostgreSQL
    participant Connector as LocalFileConnector

    Note over Client,Connector: File Upload Flow
    Client->>API: Upload ZIP with .onyx_metadata.json
    API->>API: Extract metadata from ZIP
    API->>FileStore: Save metadata file (get file_id)
    FileStore-->>API: Return metadata_file_id
    API->>FileStore: Save uploaded files
    FileStore-->>API: Return file_ids
    API->>DB: Store connector config with zip_metadata_file_id
    API-->>Client: Return file_paths and zip_metadata_file_id

    Note over Client,Connector: Indexing Flow
    Connector->>DB: Read connector config
    DB-->>Connector: Return config with zip_metadata_file_id
    Connector->>FileStore: Load metadata using file_id
    FileStore-->>Connector: Return metadata JSON
    Connector->>Connector: Parse metadata dict
    loop For each file
        Connector->>FileStore: Read file content
        FileStore-->>Connector: Return file data
        Connector->>Connector: Apply metadata overrides
        Connector->>Connector: Process and index file
    end

    Note over Client,Connector: Update Connector Flow
    Client->>API: Update files (add/remove)
    API->>FileStore: Load current metadata (if exists)
    API->>API: Merge old and new metadata
    API->>FileStore: Save merged metadata as new file
    Note over FileStore: Old metadata file NOT deleted (leak)
    API->>DB: Update config with new zip_metadata_file_id
    API-->>Client: Return updated config
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 16 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/server/documents/connector.py">

<violation number="1" location="backend/onyx/server/documents/connector.py:703">
P2: Overly broad `except Exception` silently drops metadata on any error. If loading fails (e.g., a list item missing `"filename"` key, or an unexpected file store issue), metadata is silently lost and the user won't know their metadata overrides weren't applied. Consider narrowing to specific exceptions (`json.JSONDecodeError`, `KeyError`, `FileNotFoundError`) and/or raising an `HTTPException` for unexpected failures so users are notified.

(Based on your team's feedback about narrowing exception scope rather than catching Exception.) [FEEDBACK_USED]</violation>

<violation number="2" location="backend/onyx/server/documents/connector.py:772">
P2: Orphaned metadata files in file store: when merged metadata is saved as a new file, the old metadata files (`current_zip_metadata_file_id` and `new_zip_metadata_file_id`) are never deleted. Each update cycle leaks previously-stored metadata files. Consider deleting the old files after the new merged file is successfully saved (e.g., `file_store.delete_file(current_zip_metadata_file_id)`).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown
Contributor

@evan-onyx evan-onyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, left nits

@yuhongsun96 yuhongsun96 added this pull request to the merge queue Feb 11, 2026
Merged via the queue into main with commit 90dc6b1 Feb 11, 2026
123 of 125 checks passed
@yuhongsun96 yuhongsun96 deleted the reintro-metadata-file branch February 11, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants