Skip to content

feat(url): Open url around snippet#7488

Merged
Danelegend merged 16 commits intomainfrom
open_rp
Jan 23, 2026
Merged

feat(url): Open url around snippet#7488
Danelegend merged 16 commits intomainfrom
open_rp

Conversation

@Danelegend
Copy link
Copy Markdown
Contributor

@Danelegend Danelegend commented Jan 17, 2026

Description

Currently the agent may decide to make a web_search against some query. This results in it getting a list of urls with snippets associated with those urls. The llm then determines which urls it wants to read based on title + snippet, and we then read that website via open_url tool and parse the first 15000 tokens of the website into the LLMs context window.

The snippet is the primary motivator of the LLM to choose a website to read. So we should base our reading around the snippet. This PR locates where the snippet is in the content of the website, and loads the content such that the snippet is in the middle and surrounding content above and below is loaded in. This way we are more likely to load content that the LLM is likely to find more relevant.

How Has This Been Tested?

Played around with prompting on localhost.

There is a wide array of unit tests to test the output
(~3.8k lines of testing + data)

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Center the open_url content around the search snippet so the LLM reads the most relevant parts of a page. This improves context quality and reduces wasted tokens.

  • New Features
    • Find the snippet in page content (normalization + fuzzy token match) and truncate around it to a ~15k char window.
    • Extract a url→snippet map from search results and pass it through tool_runner to open_url; use it when building InferenceSections.
    • Fallback to the old truncation when no snippet is found.
    • Added comprehensive unit tests for snippet matching across HTML entities, zero-width chars, punctuation, and unicode cases.

Written for commit 35a04a1. Summary will update on new commits.

@Danelegend Danelegend requested a review from a team as a code owner January 17, 2026 04:03
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">

<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:279">
P2: Guard against `content_word_positions` being shorter than the computed window before indexing; otherwise this can throw `IndexError` for content that normalizes differently from `processed_content`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 17, 2026

Greptile Summary

This PR improves web search result relevance by centering open_url content around the search snippet that motivated the LLM to click through.

Key Changes

  • New snippet matcher (snippet_matcher.py): Two-phase matching strategy (normalization + fuzzy) that handles HTML entities, unicode normalization, zero-width chars, and punctuation differences
  • Snippet-centered truncation (web_search/utils.py): Locates snippet in scraped content and expands symmetrically to 15k char window, placing snippet in the middle rather than truncating from top
  • URL→snippet mapping: Extracts snippets from search results and threads them through llm_loop.pytool_runner.pyopen_url_tool.py → content fetching
  • Robust fallback: When snippet not found or no snippet provided, falls back to original top-truncation behavior

Testing

  • ~3.8k lines of test data across 931 JSON test cases covering normalization edge cases, HTML entities, unicode, and multiple match scenarios
  • Integration tests with 215-line tartan.txt fixture testing snippet positioning at start/middle/end, fuzzy matching, and fallback behavior
  • 10-char buffer tolerance in tests accounts for normalization variations

The implementation is well-structured with clear separation of concerns. Previous review comments about typos and type annotations have been addressed.

Confidence Score: 4/5

  • Safe to merge with minor monitoring recommended for snippet matching edge cases in production
  • Excellent test coverage (~3.8k lines), clean architecture with fallback handling, and previous review issues addressed. Score is 4 (not 5) due to the complexity of text normalization which may have edge cases in production that tests don't cover, though the fallback mechanism mitigates this risk.
  • No files require special attention - implementation is clean and well-tested

Important Files Changed

Filename Overview
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py New module implementing snippet matching with normalization and fuzzy matching strategies. Comprehensive test coverage. Previous comments addressed normalizer comment and type annotation issues.
backend/onyx/tools/tool_implementations/web_search/utils.py Added snippet-centered truncation logic and extract_url_snippet_map helper. Implementation correctly centers content around snippets with fallback to top truncation.
backend/onyx/tools/tool_runner.py Added url_snippet_map parameter and threaded it to OpenURLTool override kwargs. Clean integration with existing parallel tool execution logic.
backend/onyx/chat/llm_loop.py Imports extract_url_snippet_map and passes extracted snippets to run_tool_calls. Minimal change with correct integration.
backend/tests/unit/onyx/tools/tool_implementations/websearch/test_websearch_utils.py New tests for snippet-centered truncation using large tartan.txt fixture. Tests cover no snippet, snippet at bounds, middle positioning, fallback, and fuzzy matching.

Sequence Diagram

sequenceDiagram
    participant LLM as LLM Loop
    participant TR as Tool Runner
    participant WST as WebSearchTool
    participant OUT as OpenURLTool
    participant SM as Snippet Matcher
    participant WU as Web Utils

    Note over LLM: Agent decides to search web
    LLM->>WST: Execute web_search(queries)
    WST->>WST: Query search providers
    WST-->>LLM: Return SearchDocs with snippets
    
    Note over LLM: Extract url→snippet map
    LLM->>WU: extract_url_snippet_map(search_docs)
    WU-->>LLM: url_snippet_map dict
    
    Note over LLM: Agent decides to open URLs
    LLM->>TR: run_tool_calls(url_snippet_map)
    TR->>OUT: run(urls, url_snippet_map)
    
    Note over OUT: Fetch web content
    OUT->>OUT: _fetch_web_content(urls, url_snippet_map)
    
    loop For each URL
        OUT->>OUT: Scrape URL content
        OUT->>WU: inference_section_from_internet_page_scrape(content, snippet)
        
        alt Snippet provided
            WU->>SM: find_snippet_in_content(content, snippet)
            
            alt Normalize and match
                SM->>SM: _normalize_text_with_mapping()
                SM->>SM: Direct string match on normalized text
                SM-->>WU: Match found at indices
            else Token-based fuzzy match
                SM->>SM: fuzz.partial_ratio_alignment()
                SM-->>WU: Fuzzy match at indices
            end
            
            WU->>WU: _truncate_content_around_snippet()
            WU->>WU: _expand_range_centered()
            WU-->>OUT: InferenceSection (snippet-centered)
        else No snippet / not found
            WU->>WU: truncate_search_result_content()
            WU-->>OUT: InferenceSection (top 15k chars)
        end
    end
    
    OUT-->>TR: ToolResponse with sections
    TR-->>LLM: Sections with relevant content
    Note over LLM: LLM reads centered content
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 202 files (changes from recent commits).

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/utils/threadpool_concurrency.py">

<violation number="1">
P2: Avoid logging raw exception strings here; exceptions can embed sensitive URLs or tokens. Log only safe metadata such as the exception type.

(Based on your team's feedback about avoiding raw exception strings that may contain URLs with temporary auth tokens.) [FEEDBACK_USED]</violation>
</file>

<file name="backend/onyx/llm/models.py">

<violation number="1">
P3: The comment indicates strings are accepted, but the updated LanguageModelInput type no longer allows str, so the comment is now misleading. Update the type or the comment to match the actual accepted inputs.</violation>
</file>

<file name="backend/ee/onyx/secondary_llm_flows/query_expansion.py">

<violation number="1">
P2: Avoid logging the raw exception string here; it may include URLs or sensitive request details. Log only safe metadata like the exception type.

(Based on your team's feedback about not logging raw exception strings with URLs/tokens.) [FEEDBACK_USED]</violation>
</file>

<file name="backend/onyx/configs/app_configs.py">

<violation number="1">
P2: Avoid hardcoded defaults for new env configs so missing configuration is detectable. Defaulting to http/127.0.0.1 can silently route production traffic to a local address instead of forcing explicit configuration.

(Based on your team's feedback about avoiding truthy/hardcoded defaults for env configs.) [FEEDBACK_USED]</violation>
</file>

<file name="contributing_guides/contribution_process.md">

<violation number="1">
P3: The heading level jumps from "##" to "#" for steps 3 and "Implicit agreements", which breaks the section hierarchy and formatting. Use consistent "##" headings for these sections to keep the document structure aligned.</violation>
</file>

<file name="backend/onyx/image_gen/providers/vertex_img_gen.py">

<violation number="1">
P2: Guard against invalid JSON in vertex_credentials so validation fails gracefully with ImageProviderCredentialsError instead of raising JSONDecodeError.</violation>
</file>

<file name="backend/onyx/llm/prompt_cache/providers/factory.py">

<violation number="1">
P3: Update the Args docstring to describe the new `llm_config` parameter instead of a provider string to avoid misleading documentation.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@Danelegend
Copy link
Copy Markdown
Contributor Author

@greptile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably use git lfs/another file store for our test files (slack-https://onyx-company.slack.com/archives/C0771QKDBPE/p1768864722539549)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a chat w/ jamison. We decided the best course of action is to leave this here for now

]
},
{
"category": "no_match",
Copy link
Copy Markdown
Contributor

@jessicasingh7 jessicasingh7 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need as verbose "content" for no_match case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd disagree here. A false positive here is probably worse than a false negative. I'd argue that because we are fuzzy searching (which although isn't a black box, let's just think of it as a black box that spits out a somewhat random number), we want to be sure that we're not going to have false positives

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">

<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:232">
P1: The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

score = res.score

if score >= (min_threshold * 100):
start_idx = res.src_start
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The indices returned (res.src_start, res.src_end) are positions in the processed string, not the original content. The old implementation mapped these back to original positions via _get_word_positions(), but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py, line 232:

<comment>The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</comment>

<file context>
@@ -246,83 +221,21 @@ def _token_based_match(
-        else:
-            original_end = content_word_positions[end_word_idx][1]
+    if score >= (min_threshold * 100):
+        start_idx = res.src_start
+        end_idx = res.src_end
 
</file context>
Fix with Cubic

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danelegend is this valid?

@Danelegend
Copy link
Copy Markdown
Contributor Author

@greptile

score = res.score

if score >= (min_threshold * 100):
start_idx = res.src_start
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danelegend is this valid?

@Danelegend Danelegend added this pull request to the merge queue Jan 23, 2026
Merged via the queue into main with commit 31db112 Jan 23, 2026
76 checks passed
@Danelegend Danelegend deleted the open_rp branch January 23, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants