Conversation
There was a problem hiding this comment.
1 issue found across 8 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">
<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:279">
P2: Guard against `content_word_positions` being shorter than the computed window before indexing; otherwise this can throw `IndexError` for content that normalizes differently from `processed_content`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
Greptile SummaryThis PR improves web search result relevance by centering Key Changes
Testing
The implementation is well-structured with clear separation of concerns. Previous review comments about typos and type annotations have been addressed. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant LLM as LLM Loop
participant TR as Tool Runner
participant WST as WebSearchTool
participant OUT as OpenURLTool
participant SM as Snippet Matcher
participant WU as Web Utils
Note over LLM: Agent decides to search web
LLM->>WST: Execute web_search(queries)
WST->>WST: Query search providers
WST-->>LLM: Return SearchDocs with snippets
Note over LLM: Extract url→snippet map
LLM->>WU: extract_url_snippet_map(search_docs)
WU-->>LLM: url_snippet_map dict
Note over LLM: Agent decides to open URLs
LLM->>TR: run_tool_calls(url_snippet_map)
TR->>OUT: run(urls, url_snippet_map)
Note over OUT: Fetch web content
OUT->>OUT: _fetch_web_content(urls, url_snippet_map)
loop For each URL
OUT->>OUT: Scrape URL content
OUT->>WU: inference_section_from_internet_page_scrape(content, snippet)
alt Snippet provided
WU->>SM: find_snippet_in_content(content, snippet)
alt Normalize and match
SM->>SM: _normalize_text_with_mapping()
SM->>SM: Direct string match on normalized text
SM-->>WU: Match found at indices
else Token-based fuzzy match
SM->>SM: fuzz.partial_ratio_alignment()
SM-->>WU: Fuzzy match at indices
end
WU->>WU: _truncate_content_around_snippet()
WU->>WU: _expand_range_centered()
WU-->>OUT: InferenceSection (snippet-centered)
else No snippet / not found
WU->>WU: truncate_search_result_content()
WU-->>OUT: InferenceSection (top 15k chars)
end
end
OUT-->>TR: ToolResponse with sections
TR-->>LLM: Sections with relevant content
Note over LLM: LLM reads centered content
|
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
7 issues found across 202 files (changes from recent commits).
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/utils/threadpool_concurrency.py">
<violation number="1">
P2: Avoid logging raw exception strings here; exceptions can embed sensitive URLs or tokens. Log only safe metadata such as the exception type.
(Based on your team's feedback about avoiding raw exception strings that may contain URLs with temporary auth tokens.) [FEEDBACK_USED]</violation>
</file>
<file name="backend/onyx/llm/models.py">
<violation number="1">
P3: The comment indicates strings are accepted, but the updated LanguageModelInput type no longer allows str, so the comment is now misleading. Update the type or the comment to match the actual accepted inputs.</violation>
</file>
<file name="backend/ee/onyx/secondary_llm_flows/query_expansion.py">
<violation number="1">
P2: Avoid logging the raw exception string here; it may include URLs or sensitive request details. Log only safe metadata like the exception type.
(Based on your team's feedback about not logging raw exception strings with URLs/tokens.) [FEEDBACK_USED]</violation>
</file>
<file name="backend/onyx/configs/app_configs.py">
<violation number="1">
P2: Avoid hardcoded defaults for new env configs so missing configuration is detectable. Defaulting to http/127.0.0.1 can silently route production traffic to a local address instead of forcing explicit configuration.
(Based on your team's feedback about avoiding truthy/hardcoded defaults for env configs.) [FEEDBACK_USED]</violation>
</file>
<file name="contributing_guides/contribution_process.md">
<violation number="1">
P3: The heading level jumps from "##" to "#" for steps 3 and "Implicit agreements", which breaks the section hierarchy and formatting. Use consistent "##" headings for these sections to keep the document structure aligned.</violation>
</file>
<file name="backend/onyx/image_gen/providers/vertex_img_gen.py">
<violation number="1">
P2: Guard against invalid JSON in vertex_credentials so validation fails gracefully with ImageProviderCredentialsError instead of raising JSONDecodeError.</violation>
</file>
<file name="backend/onyx/llm/prompt_cache/providers/factory.py">
<violation number="1">
P3: Update the Args docstring to describe the new `llm_config` parameter instead of a provider string to avoid misleading documentation.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@greptile |
There was a problem hiding this comment.
we could probably use git lfs/another file store for our test files (slack-https://onyx-company.slack.com/archives/C0771QKDBPE/p1768864722539549)
There was a problem hiding this comment.
Just had a chat w/ jamison. We decided the best course of action is to leave this here for now
backend/tests/unit/onyx/tools/tool_implementations/websearch/test_websearch_utils.py
Outdated
Show resolved
Hide resolved
backend/tests/unit/onyx/tools/tool_implementations/websearch/test_websearch_utils.py
Outdated
Show resolved
Hide resolved
| ] | ||
| }, | ||
| { | ||
| "category": "no_match", |
There was a problem hiding this comment.
not sure if we need as verbose "content" for no_match case
There was a problem hiding this comment.
I'd disagree here. A false positive here is probably worse than a false negative. I'd argue that because we are fuzzy searching (which although isn't a black box, let's just think of it as a black box that spits out a somewhat random number), we want to be sure that we're not going to have false positives
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py">
<violation number="1" location="backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py:232">
P1: The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| score = res.score | ||
|
|
||
| if score >= (min_threshold * 100): | ||
| start_idx = res.src_start |
There was a problem hiding this comment.
P1: The indices returned (res.src_start, res.src_end) are positions in the processed string, not the original content. The old implementation mapped these back to original positions via _get_word_positions(), but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/tools/tool_implementations/open_url/snippet_matcher.py, line 232:
<comment>The indices returned (`res.src_start`, `res.src_end`) are positions in the **processed** string, not the original `content`. The old implementation mapped these back to original positions via `_get_word_positions()`, but this refactored code returns processed string indices directly. This will cause incorrect snippet extraction since callers expect indices into the original content string.</comment>
<file context>
@@ -246,83 +221,21 @@ def _token_based_match(
- else:
- original_end = content_word_positions[end_word_idx][1]
+ if score >= (min_threshold * 100):
+ start_idx = res.src_start
+ end_idx = res.src_end
</file context>
|
@greptile |
| score = res.score | ||
|
|
||
| if score >= (min_threshold * 100): | ||
| start_idx = res.src_start |
Description
Currently the agent may decide to make a web_search against some query. This results in it getting a list of urls with snippets associated with those urls. The llm then determines which urls it wants to read based on title + snippet, and we then read that website via open_url tool and parse the first 15000 tokens of the website into the LLMs context window.
The snippet is the primary motivator of the LLM to choose a website to read. So we should base our reading around the snippet. This PR locates where the snippet is in the content of the website, and loads the content such that the snippet is in the middle and surrounding content above and below is loaded in. This way we are more likely to load content that the LLM is likely to find more relevant.
How Has This Been Tested?
Played around with prompting on localhost.
There is a wide array of unit tests to test the output
(~3.8k lines of testing + data)
Additional Options
Summary by cubic
Center the open_url content around the search snippet so the LLM reads the most relevant parts of a page. This improves context quality and reduces wasted tokens.
Written for commit 35a04a1. Summary will update on new commits.