feat(url): Open url around snippet by Danelegend · Pull Request #7488 · onyx-dot-app/onyx

Danelegend · 2026-01-17T04:03:38Z

Description

Currently the agent may decide to make a web_search against some query. This results in it getting a list of urls with snippets associated with those urls. The llm then determines which urls it wants to read based on title + snippet, and we then read that website via open_url tool and parse the first 15000 tokens of the website into the LLMs context window.

The snippet is the primary motivator of the LLM to choose a website to read. So we should base our reading around the snippet. This PR locates where the snippet is in the content of the website, and loads the content such that the snippet is in the middle and surrounding content above and below is loaded in. This way we are more likely to load content that the LLM is likely to find more relevant.

How Has This Been Tested?

Played around with prompting on localhost.

There is a wide array of unit tests to test the output
(~3.8k lines of testing + data)

Additional Options

[Optional] Override Linear Check

Summary by cubic

Center the open_url content around the search snippet so the LLM reads the most relevant parts of a page. This improves context quality and reduces wasted tokens.

New Features
- Find the snippet in page content (normalization + fuzzy token match) and truncate around it to a ~15k char window.
- Extract a url→snippet map from search results and pass it through tool_runner to open_url; use it when building InferenceSections.
- Fallback to the old truncation when no snippet is found.
- Added comprehensive unit tests for snippet matching across HTML entities, zero-width chars, punctuation, and unicode cases.

^{Written for commit 35a04a1. Summary will update on new commits.}

cubic-dev-ai

1 issue found across 8 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">

<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:279">
P2: Guard against `content_word_positions` being shorter than the computed window before indexing; otherwise this can throw `IndexError` for content that normalizes differently from `processed_content`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py

greptile-apps · 2026-01-17T04:09:54Z

Greptile Summary

This PR improves web search result relevance by centering open_url content around the search snippet that motivated the LLM to click through.

Key Changes

New snippet matcher (snippet_matcher.py): Two-phase matching strategy (normalization + fuzzy) that handles HTML entities, unicode normalization, zero-width chars, and punctuation differences
Snippet-centered truncation (web_search/utils.py): Locates snippet in scraped content and expands symmetrically to 15k char window, placing snippet in the middle rather than truncating from top
URL→snippet mapping: Extracts snippets from search results and threads them through llm_loop.py → tool_runner.py → open_url_tool.py → content fetching
Robust fallback: When snippet not found or no snippet provided, falls back to original top-truncation behavior

Testing

~3.8k lines of test data across 931 JSON test cases covering normalization edge cases, HTML entities, unicode, and multiple match scenarios
Integration tests with 215-line tartan.txt fixture testing snippet positioning at start/middle/end, fuzzy matching, and fallback behavior
10-char buffer tolerance in tests accounts for normalization variations

The implementation is well-structured with clear separation of concerns. Previous review comments about typos and type annotations have been addressed.

Confidence Score: 4/5

Safe to merge with minor monitoring recommended for snippet matching edge cases in production
Excellent test coverage (~3.8k lines), clean architecture with fallback handling, and previous review issues addressed. Score is 4 (not 5) due to the complexity of text normalization which may have edge cases in production that tests don't cover, though the fallback mechanism mitigates this risk.
No files require special attention - implementation is clean and well-tested

Important Files Changed

Filename	Overview
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py	New module implementing snippet matching with normalization and fuzzy matching strategies. Comprehensive test coverage. Previous comments addressed normalizer comment and type annotation issues.
backend/onyx/tools/tool_implementations/web_search/utils.py	Added snippet-centered truncation logic and `extract_url_snippet_map` helper. Implementation correctly centers content around snippets with fallback to top truncation.
backend/onyx/tools/tool_runner.py	Added `url_snippet_map` parameter and threaded it to `OpenURLTool` override kwargs. Clean integration with existing parallel tool execution logic.
backend/onyx/chat/llm_loop.py	Imports `extract_url_snippet_map` and passes extracted snippets to `run_tool_calls`. Minimal change with correct integration.
backend/tests/unit/onyx/tools/tool_implementations/websearch/test_websearch_utils.py	New tests for snippet-centered truncation using large tartan.txt fixture. Tests cover no snippet, snippet at bounds, middle positioning, fallback, and fuzzy matching.

Sequence Diagram

sequenceDiagram
    participant LLM as LLM Loop
    participant TR as Tool Runner
    participant WST as WebSearchTool
    participant OUT as OpenURLTool
    participant SM as Snippet Matcher
    participant WU as Web Utils

    Note over LLM: Agent decides to search web
    LLM->>WST: Execute web_search(queries)
    WST->>WST: Query search providers
    WST-->>LLM: Return SearchDocs with snippets
    
    Note over LLM: Extract url→snippet map
    LLM->>WU: extract_url_snippet_map(search_docs)
    WU-->>LLM: url_snippet_map dict
    
    Note over LLM: Agent decides to open URLs
    LLM->>TR: run_tool_calls(url_snippet_map)
    TR->>OUT: run(urls, url_snippet_map)
    
    Note over OUT: Fetch web content
    OUT->>OUT: _fetch_web_content(urls, url_snippet_map)
    
    loop For each URL
        OUT->>OUT: Scrape URL content
        OUT->>WU: inference_section_from_internet_page_scrape(content, snippet)
        
        alt Snippet provided
            WU->>SM: find_snippet_in_content(content, snippet)
            
            alt Normalize and match
                SM->>SM: _normalize_text_with_mapping()
                SM->>SM: Direct string match on normalized text
                SM-->>WU: Match found at indices
            else Token-based fuzzy match
                SM->>SM: fuzz.partial_ratio_alignment()
                SM-->>WU: Fuzzy match at indices
            end
            
            WU->>WU: _truncate_content_around_snippet()
            WU->>WU: _expand_range_centered()
            WU-->>OUT: InferenceSection (snippet-centered)
        else No snippet / not found
            WU->>WU: truncate_search_result_content()
            WU-->>OUT: InferenceSection (top 15k chars)
        end
    end
    
    OUT-->>TR: ToolResponse with sections
    TR-->>LLM: Sections with relevant content
    Note over LLM: LLM reads centered content

greptile-apps

_{8 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py

cubic-dev-ai

7 issues found across 202 files (changes from recent commits).

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/utils/threadpool_concurrency.py">

<violation number="1">
P2: Avoid logging raw exception strings here; exceptions can embed sensitive URLs or tokens. Log only safe metadata such as the exception type.

(Based on your team's feedback about avoiding raw exception strings that may contain URLs with temporary auth tokens.) [FEEDBACK_USED]</violation>
</file>

<file name="backend/onyx/llm/models.py">

<violation number="1">
P3: The comment indicates strings are accepted, but the updated LanguageModelInput type no longer allows str, so the comment is now misleading. Update the type or the comment to match the actual accepted inputs.</violation>
</file>

<file name="backend/ee/onyx/secondary_llm_flows/query_expansion.py">

<violation number="1">
P2: Avoid logging the raw exception string here; it may include URLs or sensitive request details. Log only safe metadata like the exception type.

(Based on your team's feedback about not logging raw exception strings with URLs/tokens.) [FEEDBACK_USED]</violation>
</file>

<file name="backend/onyx/configs/app_configs.py">

<violation number="1">
P2: Avoid hardcoded defaults for new env configs so missing configuration is detectable. Defaulting to http/127.0.0.1 can silently route production traffic to a local address instead of forcing explicit configuration.

(Based on your team's feedback about avoiding truthy/hardcoded defaults for env configs.) [FEEDBACK_USED]</violation>
</file>

<file name="contributing_guides/contribution_process.md">

<violation number="1">
P3: The heading level jumps from "##" to "#" for steps 3 and "Implicit agreements", which breaks the section hierarchy and formatting. Use consistent "##" headings for these sections to keep the document structure aligned.</violation>
</file>

<file name="backend/onyx/image_gen/providers/vertex_img_gen.py">

<violation number="1">
P2: Guard against invalid JSON in vertex_credentials so validation fails gracefully with ImageProviderCredentialsError instead of raising JSONDecodeError.</violation>
</file>

<file name="backend/onyx/llm/prompt_cache/providers/factory.py">

<violation number="1">
P3: Update the Args docstring to describe the new `llm_config` parameter instead of a provider string to avoid misleading documentation.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Danelegend · 2026-01-19T03:34:56Z

@greptile

jessicasingh7 · 2026-01-19T23:27:10Z

backend/tests/unit/onyx/tools/tool_implementations/websearch/data/tartan.txt

we could probably use git lfs/another file store for our test files (slack-https://onyx-company.slack.com/archives/C0771QKDBPE/p1768864722539549)

Just had a chat w/ jamison. We decided the best course of action is to leave this here for now

backend/tests/unit/onyx/tools/tool_implementations/websearch/test_websearch_utils.py

jessicasingh7 · 2026-01-19T23:41:33Z

backend/tests/unit/onyx/tools/tool_implementations/open_url/data/test_snippet_finding_data.json

+			]
+		},
+		{
+			"category": "no_match",


not sure if we need as verbose "content" for no_match case

I'd disagree here. A false positive here is probably worse than a false negative. I'd argue that because we are fuzzy searching (which although isn't a black box, let's just think of it as a black box that spits out a somewhat random number), we want to be sure that we're not going to have false positives

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">

<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:232">
P1: The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-20T01:10:33Z

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py

+    score = res.score
+
+    if score >= (min_threshold * 100):
+        start_idx = res.src_start


P1: The indices returned (res.src_start, res.src_end) are positions in the processed string, not the original content. The old implementation mapped these back to original positions via _get_word_positions(), but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py, line 232: <comment>The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</comment> <file context> @@ -246,83 +221,21 @@ def _token_based_match( - else: - original_end = content_word_positions[end_word_idx][1] + if score >= (min_threshold * 100): + start_idx = res.src_start + end_idx = res.src_end </file context>

@Danelegend is this valid?

Danelegend · 2026-01-20T17:22:55Z

@greptile

jessicasingh7 · 2026-01-22T18:51:11Z

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py

+    score = res.score
+
+    if score >= (min_threshold * 100):
+        start_idx = res.src_start


@Danelegend is this valid?

Danelegend added 3 commits January 16, 2026 16:53

nits

8790d58

nits

b79070d

Add more tests

89f25d6

Danelegend requested a review from a team as a code owner January 17, 2026 04:03

cubic-dev-ai bot reviewed Jan 17, 2026

View reviewed changes

backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py Outdated Show resolved Hide resolved

rename

eea8f57

greptile-apps bot reviewed Jan 17, 2026

View reviewed changes

Danelegend added 3 commits January 18, 2026 13:01

add tests and fix mypy

106df78

Merge branch 'main' into open_rp

d072bf7

nit

88c15f8

cubic-dev-ai bot reviewed Jan 19, 2026

View reviewed changes

nit

12e084b

Danelegend added 3 commits January 18, 2026 19:42

.

54e4ed7

small rework

a207bd8

remove comments

9e97c84

jessicasingh7 requested changes Jan 19, 2026

View reviewed changes

Danelegend added 2 commits January 19, 2026 16:38

nits

3ab6344

nit

0f7c26e

cubic-dev-ai bot reviewed Jan 20, 2026

View reviewed changes

nits

09a729a

jessicasingh7 approved these changes Jan 22, 2026

View reviewed changes

Danelegend and others added 2 commits January 22, 2026 18:53

nit

7c91c09

Merge branch 'main' into open_rp

35a04a1

Danelegend added this pull request to the merge queue Jan 23, 2026

Merged via the queue into main with commit 31db112 Jan 23, 2026
76 checks passed

Danelegend deleted the open_rp branch January 23, 2026 17:06

Conversation

Danelegend commented Jan 17, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Key Changes

Testing

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Danelegend commented Jan 19, 2026

Uh oh!

jessicasingh7 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Danelegend Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jessicasingh7 Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danelegend Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jessicasingh7 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Danelegend commented Jan 20, 2026

Uh oh!

jessicasingh7 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Danelegend commented Jan 17, 2026 •

edited by cubic-dev-ai bot

Loading

greptile-apps bot commented Jan 17, 2026 •

edited

Loading

jessicasingh7 Jan 19, 2026 •

edited

Loading

cubic-dev-ai bot Jan 20, 2026 •

edited

Loading