Skip to content

feat: Maintain correct docs on replay#7683

Merged
yuhongsun96 merged 2 commits intomainfrom
selected-docs
Jan 23, 2026
Merged

feat: Maintain correct docs on replay#7683
yuhongsun96 merged 2 commits intomainfrom
selected-docs

Conversation

@yuhongsun96
Copy link
Copy Markdown
Contributor

@yuhongsun96 yuhongsun96 commented Jan 23, 2026

Description

Previously, when replaying a session, it would show all the docs that came back from the search, now it shows it same as the first pass. Addresses: ENG-3135

Additionally now sorts the cited sources on replay based on the order they appear in the text.

How Has This Been Tested?

Verified with main chat loop with some tools
Verified with deep research

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Fixes doc selection on chat replay so it shows the same displayed docs as the original turn, not the full search results. Addresses ENG-3135 (doc selection on replay).

  • Bug Fixes
    • Added displayed_docs to SearchDocsResponse and saved those per tool call; replay now uses this subset.
    • Tracked all fetched search docs and emitted citation numbers in ChatStateContainer; saved only emitted citations, deduped docs, and ordered citations by first appearance.
    • Updated save logic to create DB entries from all_search_docs, link tool calls to displayed docs, and build citations from the emitted mapping.

Written for commit 84e6067. Summary will update on new commits.

@yuhongsun96 yuhongsun96 requested a review from a team as a code owner January 23, 2026 00:58
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/fake_tools/research_agent.py">

<violation number="1" location="backend/onyx/tools/fake_tools/research_agent.py:508">
P2: `displayed_docs or search_docs` treats an empty list as falsy and will persist all search docs even when no docs were displayed. Preserve an empty `displayed_docs` by checking for `None` explicitly.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 23, 2026

Greptile Overview

Greptile Summary

This PR fixes the document selection issue on chat replay (ENG-3135) by separating the concepts of "all fetched documents" from "displayed documents" throughout the chat pipeline.

Key Changes:

  • Added displayed_docs field to SearchDocsResponse to distinguish between all search results and the subset shown to users
  • Introduced ChatStateContainer tracking for all search docs (deduplicated by document_id) and emitted citations
  • Modified save_chat_turn() to create DB entries from all fetched docs while linking only displayed docs to tool calls
  • Added citation filtering to save only citations that were actually emitted during streaming
  • Applied changes consistently across both main chat loop and research agent flows

Impact:
When replaying a chat session, the UI now displays the same documents that were shown originally (via displayed_docs) rather than the full search results, ensuring replay accuracy matches the original experience.

Confidence Score: 4/5

  • This PR is safe to merge with low risk - the changes are well-structured and maintain backward compatibility
  • The implementation correctly separates displayed docs from all search docs and adds proper citation tracking. The logic is consistent across both chat flows (main loop and research agent). Minor risk exists in the fallback logic and the complexity of the deduplication strategy, but the changes are well-contained and the core logic appears sound.
  • Pay attention to backend/onyx/chat/save_chat.py - the most complex file with multiple deduplication and mapping steps

Important Files Changed

Filename Overview
backend/onyx/chat/chat_state.py Added search doc tracking and citation emission tracking to ChatStateContainer with thread-safe methods and deduplication support
backend/onyx/chat/llm_loop.py Extracts displayed_docs, adds all search_docs to state container, and saves displayed_docs or search_docs fallback to tool call info
backend/onyx/chat/llm_step.py Added tracking of emitted citations by calling state_container.add_emitted_citation() whenever a citation is streamed to the frontend
backend/onyx/chat/save_chat.py Refactored to create DB docs from pre-deduplicated all_search_docs, link displayed docs to tool calls, and filter citations by emitted set

Sequence Diagram

sequenceDiagram
    participant SearchTool
    participant LLMLoop
    participant StateContainer
    participant LLMStep
    participant SaveChat
    participant DB

    SearchTool->>SearchTool: Execute search query
    SearchTool->>SearchTool: Generate search_docs & final_ui_docs
    SearchTool->>LLMLoop: Return SearchDocsResponse<br/>(search_docs, displayed_docs, citation_mapping)
    
    LLMLoop->>StateContainer: add_search_docs(search_docs)<br/>(stores ALL search docs)
    LLMLoop->>LLMLoop: Create ToolCallInfo with<br/>displayed_docs or search_docs
    LLMLoop->>StateContainer: add_tool_call(tool_call_info)
    
    LLMLoop->>LLMStep: Stream LLM response
    LLMStep->>LLMStep: Process citation in answer
    LLMStep->>StateContainer: add_emitted_citation(citation_num)<br/>(track citations that appear in text)
    LLMStep->>StateContainer: set_citation_mapping(citation_to_doc)
    
    LLMLoop->>SaveChat: save_chat_turn(citation_to_doc,<br/>all_search_docs, emitted_citations)
    
    SaveChat->>DB: Create SearchDoc entries<br/>from all_search_docs
    SaveChat->>SaveChat: Build tool_call -> displayed_docs mapping
    SaveChat->>SaveChat: Filter citations by emitted_citations
    SaveChat->>DB: Link displayed_docs to ToolCalls
    SaveChat->>DB: Link all search_docs to ChatMessage
    SaveChat->>DB: Save citations mapping (emitted only)
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

tool_call_arguments=tool_call.tool_args,
tool_call_response=tool_response.llm_facing_response,
search_docs=search_docs,
search_docs=displayed_docs or search_docs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this be the case that displayed_docs are None?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the research agents need all search docs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM filtering step could fail, I think it's ok to default to the larger set. I think it's likely ok to just let it fail also

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to the display docs is only for saving the tool call in the DB. It's for replaying and it's for the internal search / web search tool


// Separate cited documents from other documents
const citedDocumentIds = useMemo(() => {
// Get citations in order and build a set of cited document IDs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding a citation order? Earlier citations were in order of importance already I thought

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm it definitely wasn't showing in order in the UI. We just load them from the DB and pass them to the frontend via the relationship table and there is no sorting or anything prior to this so I'm pretty sure it wasn't sorted

tool_call_arguments=tool_call.tool_args,
tool_call_response=tool_response.llm_facing_response,
search_docs=search_docs,
search_docs=displayed_docs or search_docs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the research agents need all search docs?

@yuhongsun96 yuhongsun96 merged commit 3e4a1f8 into main Jan 23, 2026
78 of 79 checks passed
@yuhongsun96 yuhongsun96 deleted the selected-docs branch January 23, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants