Skip to content

fix(web search): removing site: operator from exa query#7248

Merged
jessicasingh7 merged 6 commits intomainfrom
jessica/web-search-eng-3276
Jan 12, 2026
Merged

fix(web search): removing site: operator from exa query#7248
jessicasingh7 merged 6 commits intomainfrom
jessica/web-search-eng-3276

Conversation

@jessicasingh7
Copy link
Copy Markdown
Contributor

@jessicasingh7 jessicasingh7 commented Jan 7, 2026

Description

ENG-3276

Before vs. After
image

How Has This Been Tested?

Additional Options

  • [Optional] Override Linear Check

@jessicasingh7 jessicasingh7 requested a review from a team as a code owner January 7, 2026 02:21
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/tools/tool_implementations/web_search/web_search_tool.py">

<violation number="1" location="backend/onyx/tools/tool_implementations/web_search/web_search_tool.py:148">
P1: Regex inconsistency: extraction allows space after `site:` but removal doesn't. If user writes `site: example.com`, the domain will be extracted but `site: example.com` won't be removed from the query.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +148 to +150
cleaned_query = re.sub(
r"site:\S+\s*", "", query, flags=re.IGNORECASE
).strip()
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Regex inconsistency: extraction allows space after site: but removal doesn't. If user writes site: example.com, the domain will be extracted but site: example.com won't be removed from the query.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/tools/tool_implementations/web_search/web_search_tool.py, line 148:

<comment>Regex inconsistency: extraction allows space after `site:` but removal doesn't. If user writes `site: example.com`, the domain will be extracted but `site: example.com` won't be removed from the query.</comment>

<file context>
@@ -118,12 +122,62 @@ def emit_start(self, turn_index: int) -> None:
+            site_domains = re.findall(r"site:\s*([^\s]+)", query, re.IGNORECASE)
+
+            # Remove site: operator for Exa
+            cleaned_query = re.sub(
+                r"site:\S+\s*", "", query, flags=re.IGNORECASE
+            ).strip()
</file context>
Suggested change
cleaned_query = re.sub(
r"site:\S+\s*", "", query, flags=re.IGNORECASE
).strip()
cleaned_query = re.sub(
r"site:\s*\S+\s*", "", query, flags=re.IGNORECASE
).strip()

✅ Addressed in d520fd2

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile Summary

Fixed Exa web search by converting site: operators to Exa's native include_domains parameter instead of passing them in the query string. Exa doesn't support the site: syntax, causing searches to fail. The fix extracts domains from site: operators, removes them from the query, and passes them via Exa's API parameter. Also fixed regex pattern mismatch from previous review where extraction used site:\s*([^\s]+) but removal used site:\S+\s* - now both use site:\s*\S+\s* for consistency.

Key changes:

  • Added _transform_queries_for_provider() to extract domains and clean queries for Exa
  • Modified _execute_single_search() to accept include_domains parameter with fallback logic
  • Updated ExaClient.search() to support include_domains parameter
  • Improved error handling to return structured message when no results found instead of raising exception
  • Added SectionEnd emission for proper streaming completion signal

Confidence Score: 5/5

  • Safe to merge - addresses reported bug with clean implementation
  • The PR correctly fixes the Exa site: operator issue by using Exa's native API parameter, fixes the previously reported regex mismatch, includes proper fallback logic, and improves error handling. The changes are focused, well-structured, and follow good practices.
  • No files require special attention

Important Files Changed

Filename Overview
backend/onyx/tools/tool_implementations/web_search/web_search_tool.py Fixed regex pattern mismatch for site: operator extraction/removal. Added Exa-specific query transformation with domain extraction and fallback search logic. Improved error handling for empty search results.
backend/onyx/tools/tool_implementations/web_search/clients/exa_client.py Added include_domains parameter to search() method to support Exa's native domain filtering API.

Sequence Diagram

sequenceDiagram
    participant User
    participant WebSearchTool
    participant ExaClient
    participant ExaAPI

    User->>WebSearchTool: run(queries=["site:example.com python"])
    WebSearchTool->>WebSearchTool: _transform_queries_for_provider()
    Note over WebSearchTool: Extract domains: ["example.com"]<br/>Clean query: "python"<br/>Map: {"python": ["example.com"]}
    WebSearchTool->>WebSearchTool: emit SearchToolQueriesDelta
    WebSearchTool->>WebSearchTool: _execute_single_search(query="python", include_domains=["example.com"])
    alt include_domains provided
        WebSearchTool->>ExaClient: search(query="python", include_domains=["example.com"])
        ExaClient->>ExaAPI: search_and_contents(query="python", include_domains=["example.com"])
        ExaAPI-->>ExaClient: results
        ExaClient-->>WebSearchTool: WebSearchResult[]
        alt results found
            WebSearchTool->>WebSearchTool: return results
        else no results
            WebSearchTool->>ExaClient: search(query="python", include_domains=None)
            Note over WebSearchTool: Fallback without domain restriction
            ExaClient->>ExaAPI: search_and_contents(query="python", include_domains=None)
            ExaAPI-->>ExaClient: results
            ExaClient-->>WebSearchTool: WebSearchResult[]
        end
    else no include_domains
        WebSearchTool->>ExaClient: search(query="python", include_domains=None)
        ExaClient->>ExaAPI: search_and_contents(query="python", include_domains=None)
        ExaAPI-->>ExaClient: results
        ExaClient-->>WebSearchTool: WebSearchResult[]
    end
    WebSearchTool->>WebSearchTool: emit SearchToolDocumentsDelta
    WebSearchTool->>WebSearchTool: emit SectionEnd
    WebSearchTool->>User: ToolResponse
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. backend/onyx/tools/tool_implementations/web_search/clients/exa_client.py, line 36 (link)

    style: ternary is redundant: include_domains already defaults to None, and empty list [] would also be falsy. This line effectively just converts [] to None.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@jessicasingh7 jessicasingh7 force-pushed the jessica/web-search-eng-3276 branch from a77afc6 to 8b1916b Compare January 7, 2026 17:48
@jessicasingh7
Copy link
Copy Markdown
Contributor Author

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".


def _transform_queries_for_provider(
self, queries: list[str]
) -> tuple[list[str], dict[str, list[str]]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me this type is a lil too complex, could you make it a BaseModel?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe define type names ie. QueryDomainMap = dict[str, list[str]]. Might make it a bit more readable

cleaned_query = re.sub(
r"site:\s*\S+\s*", "", query, flags=re.IGNORECASE
).strip()
if not cleaned_query and site_domains:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have a comment here to explain why this happens/ why we do this


cleaned_queries.append(cleaned_query)

return cleaned_queries if cleaned_queries else queries, query_domains_map
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleaned_queries or queries, query_domains_map

if not added_any:
break

if not all_search_results:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we probably should propagate this error up?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^.^

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you address this in your PR @yuhongsun96 ?

include_domains: list[str] | None = None,
) -> list[WebSearchResult]:
"""Execute a single search query and return results."""
if include_domains:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would include domains not work?

"""
query_domains_map: dict[str, list[str]] = {}

if not isinstance(self._provider, ExaClient):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're checking the instance, might be worth making an abstract method (default to nothing) and override in ExaClient. Could be a bit cleaner.

But also, that could just be more work so ceebs

Danelegend
Danelegend previously approved these changes Jan 9, 2026
Copy link
Copy Markdown
Contributor

@Danelegend Danelegend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got handed reviewing duty for this. Looks functionally good but couple style things that could change. If we wanna get it in quick tho, here's a tick

@Danelegend Danelegend dismissed their stale review January 9, 2026 04:02

yuhong handed me to wrong pr to review haha.

Copy link
Copy Markdown
Contributor

@evan-onyx evan-onyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good, one more issue to address

@jessicasingh7 jessicasingh7 added this pull request to the merge queue Jan 12, 2026
Merged via the queue into main with commit cd36baa Jan 12, 2026
73 checks passed
@jessicasingh7 jessicasingh7 deleted the jessica/web-search-eng-3276 branch January 12, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants