feat(filesys): implement hierarchy injection into vector db chunks#7548
feat(filesys): implement hierarchy injection into vector db chunks#7548
Conversation
There was a problem hiding this comment.
2 issues found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/indexing/adapters/document_indexing_adapter.py">
<violation number="1" location="backend/onyx/indexing/adapters/document_indexing_adapter.py:156">
P2: Use a safe lookup for ancestor_hierarchy_node_ids to avoid KeyError when a chunk’s document id isn’t present in doc_id_to_ancestor_ids.</violation>
</file>
<file name="backend/onyx/redis/redis_hierarchy.py">
<violation number="1" location="backend/onyx/redis/redis_hierarchy.py:113">
P2: Guard the enum parse so malformed cache values don’t raise ValueError and violate the function’s contract of returning (parent_id, None) for invalid formats.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Greptile Summary
Important Files Changed
Confidence score: 4/5
Sequence DiagramsequenceDiagram
participant User
participant "DocFetching Service" as DF
participant "Connector Runner" as CR
participant "Postgres DB" as PG
participant "Redis Cache" as RC
participant "Vespa Index" as VI
participant "DocProcessing Task" as DP
User->>DF: "Start document extraction"
DF->>PG: "Get index attempt and connector config"
DF->>CR: "Initialize connector with time window"
loop "For each document batch"
CR->>DF: "Return doc batch + hierarchy nodes"
DF->>PG: "Upsert hierarchy nodes to database"
DF->>RC: "Cache hierarchy nodes for fast lookup"
DF->>DF: "Strip null characters from documents"
DF->>DF: "Store document batch in file storage"
DF->>DP: "Queue document processing task"
end
DP->>DP: "Load documents from batch storage"
DP->>RC: "Get ancestor hierarchy IDs from parent raw ID"
DP->>DP: "Build metadata-aware chunks with ancestors"
DP->>VI: "Index chunks with hierarchy metadata"
DP->>PG: "Mark documents as indexed"
|
| parts = value.split(":", 1) | ||
| parent_str = parts[0] | ||
| node_type_str = parts[1] if len(parts) > 1 else "" | ||
| parent_id = int(parent_str) if parent_str else None |
There was a problem hiding this comment.
logic: This could raise ValueError if parent_str contains non-numeric characters. Add try-catch around int() conversion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 111:111
Comment:
**logic:** This could raise ValueError if parent_str contains non-numeric characters. Add try-catch around int() conversion.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
parent_str being non-numeric is the kind of bug we would want to raise loudly
| node_type_str = parts[1] if len(parts) > 1 else "" | ||
| parent_id = int(parent_str) if parent_str else None | ||
|
|
||
| node_type = HierarchyNodeType(node_type_str) if node_type_str else None |
There was a problem hiding this comment.
logic: This could raise ValueError if node_type_str is not a valid enum value. Add try-catch around HierarchyNodeType() constructor.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 113:113
Comment:
**logic:** This could raise ValueError if node_type_str is not a valid enum value. Add try-catch around HierarchyNodeType() constructor.
How can I resolve this? If you propose a fix, please make it concise.| else: | ||
| value_str = str(value) | ||
|
|
||
| return int(value_str), True |
There was a problem hiding this comment.
logic: This could raise ValueError if value_str contains non-numeric characters. Add try-catch around int() conversion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/redis/redis_hierarchy.py
Line: 230:230
Comment:
**logic:** This could raise ValueError if value_str contains non-numeric characters. Add try-catch around int() conversion.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
if that happens we have bigger probs
02695a4 to
1134467
Compare
cde7598 to
df1cc2f
Compare
1134467 to
539b2af
Compare
df1cc2f to
de0a787
Compare
539b2af to
ca6bd87
Compare
de0a787 to
98564b3
Compare
ca6bd87 to
dc852ff
Compare
98564b3 to
d33abc1
Compare
dc852ff to
4fdc23d
Compare
d33abc1 to
3be31a1
Compare
4fdc23d to
e9a7ba5
Compare
3be31a1 to
62a36cb
Compare
e9a7ba5 to
fcbf0e3
Compare
62a36cb to
d17a3d0
Compare
fcbf0e3 to
7a3b841
Compare
d17a3d0 to
1fd4e34
Compare
7a3b841 to
7804e4a
Compare
1fd4e34 to
2e37399
Compare
7804e4a to
8bba918
Compare
2e37399 to
2d45bb0
Compare
8bba918 to
2228382
Compare
2d45bb0 to
fbf5a35
Compare
2228382 to
a0bf5dc
Compare
fbf5a35 to
b806fee
Compare
a0bf5dc to
fd041b8
Compare
b806fee to
9d1312f
Compare
fd041b8 to
b0a6662
Compare
9d1312f to
425dba1
Compare
b0a6662 to
d07a508
Compare
425dba1 to
413bc17
Compare
413bc17 to
16c2fc2
Compare
3f528b2 to
2407cc4
Compare
e2495a4 to
60a3353
Compare
2407cc4 to
0200baa
Compare
60a3353 to
fa3f47d
Compare
0200baa to
70e4c2e
Compare
fa3f47d to
f5b540c
Compare
70e4c2e to
95295e1
Compare
switch from bitmap to int list implement hierarchy injection into vespa chunks nits
f5b540c to
13dfa3a
Compare
Description
injecting hierarchy info into vector db chunks
How Has This Been Tested?
not yet
Additional Options
Summary by cubic
Inject hierarchy ancestors into vector DB chunks to improve relevance and enable lineage-aware filtering. Adds a Redis-backed hierarchy cache and wires ancestor resolution into docfetching and indexing.
Written for commit 18d21ad. Summary will update on new commits.