Skip to content

chore: NLTK and stopwords#7587

Merged
yuhongsun96 merged 2 commits intomainfrom
search-tuneup
Jan 20, 2026
Merged

chore: NLTK and stopwords#7587
yuhongsun96 merged 2 commits intomainfrom
search-tuneup

Conversation

@yuhongsun96
Copy link
Copy Markdown
Contributor

@yuhongsun96 yuhongsun96 commented Jan 20, 2026

Description

Remove NLTK and replace the stopword handling. It's the same list as nltk anyway.

How Has This Been Tested?

Ran queries, works

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Removed NLTK by replacing stopword handling and n-gram generation with lightweight local utilities, simplifying deployment with no behavior change. Added num_hits to cap search results and wired it through the search pipeline.

  • New Features

    • Added num_hits to requests and enforced result truncation.
    • Propagated limit/offset to retrieval for consistent caps.
  • Refactors

    • Dropped NLTK usage and dependency (Dockerfile, setup, tests, requirements).
    • Introduced english_stopwords utilities and used for Slack recency parsing and query keyword extraction.
    • Replaced nltk ngrams with a local _ngrams in entity normalization.
    • Removed NLTK download logic and related Trivy ignore entries.
    • Simplified search runner by removing the unused dedupe and NLTK helpers.

Written for commit 01b2e1f. Summary will update on new commits.

@yuhongsun96 yuhongsun96 requested a review from a team as a code owner January 20, 2026 21:03
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 15 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/context/search/pipeline.py">

<violation number="1" location="backend/onyx/context/search/pipeline.py:22">
P1: Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.</violation>
</file>

<file name="backend/onyx/context/search/federated/slack_search_utils.py">

<violation number="1" location="backend/onyx/context/search/federated/slack_search_utils.py:18">
P1: The new import references `onyx.natural_language_processing.english_stopwords`, but that module is not present in the repository. This will raise `ModuleNotFoundError` at import time. Add the missing module or update the import to an existing stopword provider.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

from onyx.db.models import User
from onyx.document_index.interfaces import DocumentIndex
from onyx.llm.interfaces import LLM
from onyx.natural_language_processing.english_stopwords import strip_stopwords
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/context/search/pipeline.py, line 22:

<comment>Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.</comment>

<file context>
@@ -19,6 +19,7 @@
 from onyx.db.models import User
 from onyx.document_index.interfaces import DocumentIndex
 from onyx.llm.interfaces import LLM
+from onyx.natural_language_processing.english_stopwords import strip_stopwords
 from onyx.secondary_llm_flows.source_filter import extract_source_filter
 from onyx.secondary_llm_flows.time_filter import extract_time_filter
</file context>
Fix with Cubic

from onyx.llm.interfaces import LLM
from onyx.llm.models import UserMessage
from onyx.llm.utils import llm_response_to_string
from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The new import references onyx.natural_language_processing.english_stopwords, but that module is not present in the repository. This will raise ModuleNotFoundError at import time. Add the missing module or update the import to an existing stopword provider.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/context/search/federated/slack_search_utils.py, line 18:

<comment>The new import references `onyx.natural_language_processing.english_stopwords`, but that module is not present in the repository. This will raise `ModuleNotFoundError` at import time. Add the missing module or update the import to an existing stopword provider.</comment>

<file context>
@@ -15,6 +15,7 @@
 from onyx.llm.interfaces import LLM
 from onyx.llm.models import UserMessage
 from onyx.llm.utils import llm_response_to_string
+from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET
 from onyx.onyxbot.slack.models import ChannelType
 from onyx.prompts.federated_search import SLACK_DATE_EXTRACTION_PROMPT
</file context>
Fix with Cubic

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 20, 2026

Greptile Summary

This PR attempts to remove the NLTK dependency and replace stopword handling with a custom implementation, but the replacement module is missing from the PR, causing critical import errors.

Key Changes

  • Removed NLTK dependency from pyproject.toml, requirements/default.txt, and lockfiles
  • Removed NLTK data downloads from Dockerfile and test setup
  • Replaced NLTK's ngrams function with a custom _ngrams() implementation in backend/onyx/kg/clustering/normalizations.py
  • Moved query_keywords field from BasicChunkRequest to ChunkIndexRequest for better encapsulation
  • Added num_hits parameter (default 50) to SendSearchQueryRequest to control search result limits

Critical Issues

The PR imports from onyx.natural_language_processing.english_stopwords module that doesn't exist in the codebase. Two files import from this missing module:

  • backend/onyx/context/search/federated/slack_search_utils.py imports ENGLISH_STOPWORDS_SET
  • backend/onyx/context/search/pipeline.py imports strip_stopwords

This will cause immediate ImportError failures at runtime when any search operation is attempted.

Root Cause

The PR description states "Remove NLTK and replace the stopword handling. It's the same list as nltk anyway." This suggests the author intended to include a new backend/onyx/natural_language_processing/english_stopwords.py file containing the stopwords list and helper function, but this file was not committed to the PR.

Confidence Score: 0/5

  • This PR is NOT safe to merge - it will cause immediate runtime failures
  • Score is 0 because the PR imports from a non-existent module (onyx.natural_language_processing.english_stopwords), which will cause ImportError exceptions whenever search operations are performed. The missing module needs to be added before this PR can function.
  • backend/onyx/context/search/federated/slack_search_utils.py and backend/onyx/context/search/pipeline.py both import from the missing english_stopwords module and will fail at runtime

Important Files Changed

Filename Overview
backend/onyx/context/search/federated/slack_search_utils.py Imports from non-existent module english_stopwords, causing runtime error. Replaced NLTK stopwords with missing module.
backend/onyx/context/search/pipeline.py Imports strip_stopwords from non-existent module, causing runtime error. Populates query_keywords using missing function.
backend/onyx/context/search/models.py Moved query_keywords from BasicChunkRequest to ChunkIndexRequest. Clean refactoring to make field usage more explicit.
backend/onyx/kg/clustering/normalizations.py Replaced NLTK's ngrams with custom implementation. Simple and correct n-gram generation function.
backend/ee/onyx/search/process_search_query.py Added num_hits parameter to control search result limit. Implementation correctly passes through the parameter and truncates results.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as query_and_chat API
    participant ProcessSearch as process_search_query
    participant Pipeline as search_pipeline
    participant StopWords as english_stopwords (MISSING)
    participant Index as DocumentIndex

    Client->>API: SendSearchQueryRequest(query, num_hits=50)
    API->>ProcessSearch: stream_search_query(request)
    
    alt Single Query
        ProcessSearch->>ProcessSearch: _run_single_search(query, num_hits)
        ProcessSearch->>Pipeline: search_pipeline(ChunkSearchRequest)
        Pipeline->>StopWords: strip_stopwords(query)
        Note over StopWords: ❌ MODULE MISSING<br/>ImportError at runtime
        StopWords-->>Pipeline: query_keywords
        Pipeline->>Index: search_chunks(query_keywords)
        Index-->>Pipeline: chunks
        Pipeline-->>ProcessSearch: chunks
    else Multiple Queries
        ProcessSearch->>ProcessSearch: Run searches in parallel
        loop For each query
            ProcessSearch->>Pipeline: search_pipeline(query, num_hits)
            Pipeline->>StopWords: strip_stopwords(query)
            Note over StopWords: ❌ MODULE MISSING<br/>ImportError at runtime
            StopWords-->>Pipeline: query_keywords
            Pipeline->>Index: search_chunks(query_keywords)
            Index-->>Pipeline: chunks
        end
        ProcessSearch->>ProcessSearch: Merge with RRF
    end
    
    ProcessSearch->>ProcessSearch: merge_individual_chunks()
    ProcessSearch->>ProcessSearch: Truncate to num_hits
    ProcessSearch-->>API: sections
    API-->>Client: Search results
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

from onyx.llm.interfaces import LLM
from onyx.llm.models import UserMessage
from onyx.llm.utils import llm_response_to_string
from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: The module onyx.natural_language_processing.english_stopwords doesn't exist in the codebase. This import will cause a runtime error.

Check that the english_stopwords.py file with ENGLISH_STOPWORDS_SET definition was included in this PR.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/context/search/federated/slack_search_utils.py
Line: 18:18

Comment:
**syntax:** The module `onyx.natural_language_processing.english_stopwords` doesn't exist in the codebase. This import will cause a runtime error.

Check that the `english_stopwords.py` file with `ENGLISH_STOPWORDS_SET` definition was included in this PR.

How can I resolve this? If you propose a fix, please make it concise.

from onyx.db.models import User
from onyx.document_index.interfaces import DocumentIndex
from onyx.llm.interfaces import LLM
from onyx.natural_language_processing.english_stopwords import strip_stopwords
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: The module onyx.natural_language_processing.english_stopwords doesn't exist in the codebase. This import will cause a runtime error.

Check that the english_stopwords.py file with strip_stopwords function was included in this PR.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/context/search/pipeline.py
Line: 22:22

Comment:
**syntax:** The module `onyx.natural_language_processing.english_stopwords` doesn't exist in the codebase. This import will cause a runtime error.

Check that the `english_stopwords.py` file with `strip_stopwords` function was included in this PR.

How can I resolve this? If you propose a fix, please make it concise.

sections = merge_individual_chunks(chunks)

# Truncate to the requested number of hits
sections = sections[: request.num_hits]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Slicing may fail if request.num_hits is None (from the model default). The slice sections[: None] returns all sections, but consider whether this is the intended behavior.

Suggested change
sections = sections[: request.num_hits]
sections = sections[: request.num_hits] if request.num_hits else sections
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/ee/onyx/search/process_search_query.py
Line: 182:182

Comment:
**logic:** Slicing may fail if `request.num_hits` is `None` (from the model default). The slice `sections[: None]` returns all sections, but consider whether this is the intended behavior.

```suggestion
    sections = sections[: request.num_hits] if request.num_hits else sections
```

How can I resolve this? If you propose a fix, please make it concise.

@yuhongsun96 yuhongsun96 merged commit 83a543a into main Jan 20, 2026
77 checks passed
@yuhongsun96 yuhongsun96 deleted the search-tuneup branch January 20, 2026 21:36
rohoswagger pushed a commit that referenced this pull request Jan 20, 2026
jessicasingh7 pushed a commit that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant