Skip to content

feat(filesys): implement hierarchy injection into vector db chunks#7548

Merged
evan-onyx merged 2 commits intomainfrom
feat/file-struct5
Jan 28, 2026
Merged

feat(filesys): implement hierarchy injection into vector db chunks#7548
evan-onyx merged 2 commits intomainfrom
feat/file-struct5

Conversation

@evan-onyx
Copy link
Copy Markdown
Contributor

@evan-onyx evan-onyx commented Jan 20, 2026

Description

injecting hierarchy info into vector db chunks

How Has This Been Tested?

not yet

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Inject hierarchy ancestors into vector DB chunks to improve relevance and enable lineage-aware filtering. Adds a Redis-backed hierarchy cache and wires ancestor resolution into docfetching and indexing.

  • New Features
    • Chunks now include ancestor_hierarchy_node_ids in DocMetadataAwareIndexChunk.
    • Redis hierarchy cache for node_id -> parent_id and raw_id -> node_id with 6h TTL and a distributed lock refresh.
    • Docfetching upserts hierarchy nodes and caches them in batches for fast lookups.
    • Indexing adapter resolves ancestors from parent_hierarchy_raw_node_id via cache, falling back to the SOURCE node on cache miss.

Written for commit 18d21ad. Summary will update on new commits.

@evan-onyx evan-onyx requested a review from a team as a code owner January 20, 2026 02:24
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/indexing/adapters/document_indexing_adapter.py">

<violation number="1" location="backend/onyx/indexing/adapters/document_indexing_adapter.py:156">
P2: Use a safe lookup for ancestor_hierarchy_node_ids to avoid KeyError when a chunk’s document id isn’t present in doc_id_to_ancestor_ids.</violation>
</file>

<file name="backend/onyx/redis/redis_hierarchy.py">

<violation number="1" location="backend/onyx/redis/redis_hierarchy.py:113">
P2: Guard the enum parse so malformed cache values don’t raise ValueError and violate the function’s contract of returning (parent_id, None) for invalid formats.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 20, 2026

Greptile Summary

  • Implements hierarchy injection into Vespa chunks by adding ancestor hierarchy node IDs to document chunks for improved relevance and lineage-aware filtering
  • Introduces a Redis-based hierarchy cache system with 6-hour TTL and distributed locking to optimize ancestor resolution performance during document indexing
  • Wires hierarchy caching into the document extraction phase and integrates ancestor ID resolution into the document indexing adapter with cache-first lookup strategy

Important Files Changed

Filename Overview
backend/onyx/redis/redis_hierarchy.py New Redis hierarchy cache system with ancestor resolution, distributed locking, and TTL-based expiration; needs Redis connection error handling
backend/onyx/indexing/adapters/document_indexing_adapter.py Integrates ancestor hierarchy resolution into chunk metadata enrichment using Redis cache with database fallback
backend/onyx/background/indexing/run_docfetching.py Adds hierarchy node caching to Redis during document extraction phase after PostgreSQL persistence

Confidence score: 4/5

  • This PR introduces substantial new functionality with Redis caching and hierarchy resolution that appears well-architected but has some implementation concerns
  • Score reduced due to missing Redis connection error handling that could cause crashes, potential memory usage growth without bounds checking, and lack of testing mentioned in PR description
  • Pay close attention to backend/onyx/redis/redis_hierarchy.py for error handling improvements and backend/onyx/indexing/adapters/document_indexing_adapter.py for cache integration logic

Sequence Diagram

sequenceDiagram
    participant User
    participant "DocFetching Service" as DF
    participant "Connector Runner" as CR
    participant "Postgres DB" as PG
    participant "Redis Cache" as RC
    participant "Vespa Index" as VI
    participant "DocProcessing Task" as DP

    User->>DF: "Start document extraction"
    DF->>PG: "Get index attempt and connector config"
    DF->>CR: "Initialize connector with time window"
    
    loop "For each document batch"
        CR->>DF: "Return doc batch + hierarchy nodes"
        DF->>PG: "Upsert hierarchy nodes to database"
        DF->>RC: "Cache hierarchy nodes for fast lookup"
        DF->>DF: "Strip null characters from documents"
        DF->>DF: "Store document batch in file storage"
        DF->>DP: "Queue document processing task"
    end
    
    DP->>DP: "Load documents from batch storage"
    DP->>RC: "Get ancestor hierarchy IDs from parent raw ID"
    DP->>DP: "Build metadata-aware chunks with ancestors"
    DP->>VI: "Index chunks with hierarchy metadata"
    DP->>PG: "Mark documents as indexed"
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

parts = value.split(":", 1)
parent_str = parts[0]
node_type_str = parts[1] if len(parts) > 1 else ""
parent_id = int(parent_str) if parent_str else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This could raise ValueError if parent_str contains non-numeric characters. Add try-catch around int() conversion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 111:111

Comment:
**logic:** This could raise ValueError if parent_str contains non-numeric characters. Add try-catch around int() conversion.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent_str being non-numeric is the kind of bug we would want to raise loudly

node_type_str = parts[1] if len(parts) > 1 else ""
parent_id = int(parent_str) if parent_str else None

node_type = HierarchyNodeType(node_type_str) if node_type_str else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This could raise ValueError if node_type_str is not a valid enum value. Add try-catch around HierarchyNodeType() constructor.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 113:113

Comment:
**logic:** This could raise ValueError if node_type_str is not a valid enum value. Add try-catch around HierarchyNodeType() constructor.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

else:
value_str = str(value)

return int(value_str), True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This could raise ValueError if value_str contains non-numeric characters. Add try-catch around int() conversion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 230:230

Comment:
**logic:** This could raise ValueError if value_str contains non-numeric characters. Add try-catch around int() conversion.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that happens we have bigger probs

@evan-onyx evan-onyx changed the title implement hierarchy injection into vespa chunks implement hierarchy injection into vector db chunks Jan 27, 2026
@evan-onyx evan-onyx changed the title implement hierarchy injection into vector db chunks feat(filesys): implement hierarchy injection into vector db chunks Jan 27, 2026
@evan-onyx evan-onyx force-pushed the feat/file-struct5 branch 2 times, most recently from e2495a4 to 60a3353 Compare January 27, 2026 23:40
Base automatically changed from feat/file-struct4 to main January 28, 2026 04:11
switch from bitmap to int list

implement hierarchy injection into vespa chunks

nits
@evan-onyx evan-onyx added this pull request to the merge queue Jan 28, 2026
Merged via the queue into main with commit a2dc752 Jan 28, 2026
78 of 86 checks passed
@evan-onyx evan-onyx deleted the feat/file-struct5 branch January 28, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants