Skip to content

feat(craft): local file connector#8304

Merged
rohoswagger merged 9 commits intomainfrom
file-folder-connector
Feb 12, 2026
Merged

feat(craft): local file connector#8304
rohoswagger merged 9 commits intomainfrom
file-folder-connector

Conversation

@rohoswagger
Copy link
Copy Markdown
Contributor

@rohoswagger rohoswagger commented Feb 10, 2026

feat: local file/folder connector :big_grin:

Description

a big one

How Has This Been Tested?

locally, craft-dev

Additional Options

  • [Required] I have considered whether this PR needs to be cherry-picked to the latest beta branch.
  • [Optional] Override Linear Check

@rohoswagger rohoswagger requested a review from a team as a code owner February 10, 2026 20:36
@rohoswagger rohoswagger changed the base branch from main to whuang/craft-file-sync-lock February 10, 2026 20:37
@rohoswagger rohoswagger changed the title file folder connector feat: file/folder connector Feb 10, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 33 files

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

Implements a local file/folder connector that allows Craft users to upload raw binary files (Excel, PowerPoint, Word, CSV, etc.) to their sandbox environment for agent access, with per-file sync control and quota management.

Key changes:

  • Added new CRAFT_FILE document source with RAW_BINARY processing mode that stores files directly in S3/local storage without text extraction
  • Created comprehensive API endpoints for file upload (single/batch/zip), deletion, sync toggling, and directory management with 500MB per-file and 10GB total storage limits
  • Implemented filtered file syncing system that respects per-file sync_disabled flags by creating selective symlinks in sandbox workspaces
  • Built frontend modal with tree view, drag-and-drop upload, folder management, and per-file sync toggles using proper refresh-components
  • Integrated with existing sandbox provisioning workflow requiring sandbox recreation when user library changes

Architecture decisions:

  • Files bypass the standard indexing pipeline (no chunking/embedding/Vespa) and are stored as raw binaries for direct Python library access (openpyxl, python-pptx, etc.)
  • Uses deterministic document IDs based on path hashing to support upsert semantics on re-upload
  • Sandbox file visibility controlled via filtered symlinks rather than physical file deletion to maintain data integrity

Known limitations documented in code:

  • Upload endpoints read entire file content into memory (up to 500MB) instead of streaming via multipart upload
  • Multi-file uploads are not atomic - partial failures leave some files persisted

Confidence Score: 4/5

  • Safe to merge with minor memory optimization consideration for production load
  • The implementation is well-architected with proper separation of concerns, comprehensive error handling, and extensive integration tests. The code follows established patterns and includes thoughtful TODOs documenting known limitations. Frontend adheres to all web standards. Confidence reduced from 5 to 4 due to the acknowledged memory issue where 500MB files are loaded entirely into memory during upload, which could cause issues under concurrent load in production.
  • backend/onyx/server/features/build/api/user_library.py - Consider implementing streaming multipart upload to S3 to avoid loading 500MB files into memory

Important Files Changed

Filename Overview
backend/onyx/server/features/build/api/user_library.py New API endpoints for user file uploads to S3 with sync toggles and deletion - memory concerns noted in TODOs for upload handling
backend/onyx/server/features/build/db/user_library.py DB operations for user library storage quota tracking and connector setup with orphan recovery logic
backend/onyx/db/document.py Added utility functions for querying documents by source type and updating document metadata
backend/onyx/server/features/build/indexing/persistent_document_writer.py Extended with raw binary file write/delete methods for both local and S3 storage backends
backend/onyx/server/features/build/sandbox/local/local_sandbox_manager.py Implemented filtered file symlink system to exclude disabled files from sandbox sessions
backend/onyx/server/features/build/sandbox/kubernetes/kubernetes_sandbox_manager.py Updated Kubernetes sandbox setup to support excluded user library paths via shell script filtering
web/src/app/craft/v1/configure/components/UserLibraryModal.tsx New modal for file uploads with tree view, sync toggles, and folder management - uses proper refresh-components
web/src/app/craft/v1/configure/page.tsx Integrated user library modal and change tracking with reprovision workflow - fixed icon import

Sequence Diagram

sequenceDiagram
    participant User
    participant Frontend as UserLibraryModal
    participant API as user_library.py
    participant DB as user_library DB
    participant Storage as S3/LocalStorage
    participant Sandbox as SandboxManager
    
    User->>Frontend: Upload files
    Frontend->>API: POST /upload (files, path)
    API->>DB: Check storage quota
    API->>DB: Get/create CRAFT_FILE connector
    loop For each file
        API->>Storage: Write raw binary file
        API->>DB: Upsert document metadata
    end
    API->>DB: Update connector status
    API->>Sandbox: Trigger file sync (Celery task)
    API-->>Frontend: Upload response
    
    User->>Frontend: Toggle file sync
    Frontend->>API: PATCH /files/{id}/toggle
    API->>DB: Update doc_metadata (sync_disabled)
    API-->>Frontend: Success
    
    User->>Frontend: Delete file
    Frontend->>API: DELETE /files/{id}
    API->>Storage: Delete raw file
    API->>DB: Delete document record
    API->>Sandbox: Trigger file sync
    API-->>Frontend: Success
    
    Note over Sandbox: Sync task filters disabled files<br/>and creates symlinks in workspace
Loading

@rohoswagger rohoswagger changed the title feat: file/folder connector feat(craft): local file connector Feb 10, 2026
@rohoswagger rohoswagger force-pushed the file-folder-connector branch from 51856e5 to 0fe294a Compare February 10, 2026 20:57
@rohoswagger rohoswagger force-pushed the file-folder-connector branch from 0fe294a to bd7da75 Compare February 10, 2026 21:33
@rohoswagger
Copy link
Copy Markdown
Contributor Author

@greptile

@rohoswagger
Copy link
Copy Markdown
Contributor Author

@cubic-dev

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai bot commented Feb 11, 2026

@cubic-dev

@rohoswagger I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 36 files

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 11, 2026

Preview Deployment

Status Preview Commit Updated
https://onyx-preview-7qz42hgrn-danswer.vercel.app 41650ee 2026-02-12 19:43:00 UTC

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/background/indexing/run_docfetching.py">

<violation number="1" location="backend/onyx/background/indexing/run_docfetching.py:684">
P2: RAW_BINARY batches now fall into the regular docprocessing path because the explicit guard was removed. If a RAW_BINARY connector yields a batch, it will be processed and queued despite being documented as bypassing the connector indexing flow. Add an explicit RAW_BINARY guard (as before) to skip those batches.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

file_name = file_path.split("/")[-1]

# Guess content type
content_type, _ = mimetypes.guess_type(file_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a util for this im p sure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didnt find one, we just use mimetypes.guess_type()

enabled: bool = Query(...),
user: User = Depends(current_user),
db_session: Session = Depends(get_session),
) -> dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basemodel pls

)
heredoc_delim = f"_EXCL_{uuid4().hex[:12]}_"
files_symlink_setup = f"""
# Create filtered files directory with exclusions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this should be a separate python (or really any high level language) script that lives in our repo which gets copied onto the pod and invoked with the right arguments

@rohoswagger
Copy link
Copy Markdown
Contributor Author

@greptile

@rohoswagger
Copy link
Copy Markdown
Contributor Author

@cubic-dev

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai bot commented Feb 12, 2026

@cubic-dev

@rohoswagger I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 32 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="web/src/app/craft/v1/configure/components/ConfigureOverlays.tsx">

<violation number="1" location="web/src/app/craft/v1/configure/components/ConfigureOverlays.tsx:55">
P2: When isUpdating is true, this still renders a clickable action button with no handler, so users can focus/click a non-functional control. Consider hiding or disabling the action while updating instead of leaving an enabled button with no onAction.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants