chore: NLTK and stopwords by yuhongsun96 · Pull Request #7587 · onyx-dot-app/onyx

yuhongsun96 · 2026-01-20T21:03:00Z

Description

Remove NLTK and replace the stopword handling. It's the same list as nltk anyway.

How Has This Been Tested?

Ran queries, works

Additional Options

[Optional] Override Linear Check

Summary by cubic

Removed NLTK by replacing stopword handling and n-gram generation with lightweight local utilities, simplifying deployment with no behavior change. Added num_hits to cap search results and wired it through the search pipeline.

New Features
- Added num_hits to requests and enforced result truncation.
- Propagated limit/offset to retrieval for consistent caps.
Refactors
- Dropped NLTK usage and dependency (Dockerfile, setup, tests, requirements).
- Introduced english_stopwords utilities and used for Slack recency parsing and query keyword extraction.
- Replaced nltk ngrams with a local _ngrams in entity normalization.
- Removed NLTK download logic and related Trivy ignore entries.
- Simplified search runner by removing the unused dedupe and NLTK helpers.

^{Written for commit 01b2e1f. Summary will update on new commits.}

cubic-dev-ai

2 issues found across 15 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/context/search/pipeline.py">

<violation number="1" location="backend/onyx/context/search/pipeline.py:22">
P1: Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.</violation>
</file>

<file name="backend/onyx/context/search/federated/slack_search_utils.py">

<violation number="1" location="backend/onyx/context/search/federated/slack_search_utils.py:18">
P1: The new import references `onyx.natural_language_processing.english_stopwords`, but that module is not present in the repository. This will raise `ModuleNotFoundError` at import time. Add the missing module or update the import to an existing stopword provider.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-20T21:07:38Z

backend/onyx/context/search/pipeline.py

 from onyx.db.models import User
 from onyx.document_index.interfaces import DocumentIndex
 from onyx.llm.interfaces import LLM
+from onyx.natural_language_processing.english_stopwords import strip_stopwords


P1: Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/context/search/pipeline.py, line 22: <comment>Importing strip_stopwords from onyx.natural_language_processing.english_stopwords will raise ModuleNotFoundError because that module/file is missing in the repo. Ensure the module is added or update the import to the correct existing stopwords module.</comment> <file context> @@ -19,6 +19,7 @@ from onyx.db.models import User from onyx.document_index.interfaces import DocumentIndex from onyx.llm.interfaces import LLM +from onyx.natural_language_processing.english_stopwords import strip_stopwords from onyx.secondary_llm_flows.source_filter import extract_source_filter from onyx.secondary_llm_flows.time_filter import extract_time_filter </file context>

cubic-dev-ai · 2026-01-20T21:07:38Z

backend/onyx/context/search/federated/slack_search_utils.py

 from onyx.llm.interfaces import LLM
 from onyx.llm.models import UserMessage
 from onyx.llm.utils import llm_response_to_string
+from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET


P1: The new import references onyx.natural_language_processing.english_stopwords, but that module is not present in the repository. This will raise ModuleNotFoundError at import time. Add the missing module or update the import to an existing stopword provider.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/onyx/context/search/federated/slack_search_utils.py, line 18: <comment>The new import references `onyx.natural_language_processing.english_stopwords`, but that module is not present in the repository. This will raise `ModuleNotFoundError` at import time. Add the missing module or update the import to an existing stopword provider.</comment> <file context> @@ -15,6 +15,7 @@ from onyx.llm.interfaces import LLM from onyx.llm.models import UserMessage from onyx.llm.utils import llm_response_to_string +from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET from onyx.onyxbot.slack.models import ChannelType from onyx.prompts.federated_search import SLACK_DATE_EXTRACTION_PROMPT </file context>

greptile-apps · 2026-01-20T21:09:59Z

Greptile Summary

This PR attempts to remove the NLTK dependency and replace stopword handling with a custom implementation, but the replacement module is missing from the PR, causing critical import errors.

Key Changes

Removed NLTK dependency from pyproject.toml, requirements/default.txt, and lockfiles
Removed NLTK data downloads from Dockerfile and test setup
Replaced NLTK's ngrams function with a custom _ngrams() implementation in backend/onyx/kg/clustering/normalizations.py
Moved query_keywords field from BasicChunkRequest to ChunkIndexRequest for better encapsulation
Added num_hits parameter (default 50) to SendSearchQueryRequest to control search result limits

Critical Issues

The PR imports from onyx.natural_language_processing.english_stopwords module that doesn't exist in the codebase. Two files import from this missing module:

backend/onyx/context/search/federated/slack_search_utils.py imports ENGLISH_STOPWORDS_SET
backend/onyx/context/search/pipeline.py imports strip_stopwords

This will cause immediate ImportError failures at runtime when any search operation is attempted.

Root Cause

The PR description states "Remove NLTK and replace the stopword handling. It's the same list as nltk anyway." This suggests the author intended to include a new backend/onyx/natural_language_processing/english_stopwords.py file containing the stopwords list and helper function, but this file was not committed to the PR.

Confidence Score: 0/5

This PR is NOT safe to merge - it will cause immediate runtime failures
Score is 0 because the PR imports from a non-existent module (onyx.natural_language_processing.english_stopwords), which will cause ImportError exceptions whenever search operations are performed. The missing module needs to be added before this PR can function.
backend/onyx/context/search/federated/slack_search_utils.py and backend/onyx/context/search/pipeline.py both import from the missing english_stopwords module and will fail at runtime

Important Files Changed

Filename	Overview
backend/onyx/context/search/federated/slack_search_utils.py	Imports from non-existent module `english_stopwords`, causing runtime error. Replaced NLTK stopwords with missing module.
backend/onyx/context/search/pipeline.py	Imports `strip_stopwords` from non-existent module, causing runtime error. Populates query_keywords using missing function.
backend/onyx/context/search/models.py	Moved `query_keywords` from `BasicChunkRequest` to `ChunkIndexRequest`. Clean refactoring to make field usage more explicit.
backend/onyx/kg/clustering/normalizations.py	Replaced NLTK's `ngrams` with custom implementation. Simple and correct n-gram generation function.
backend/ee/onyx/search/process_search_query.py	Added `num_hits` parameter to control search result limit. Implementation correctly passes through the parameter and truncates results.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as query_and_chat API
    participant ProcessSearch as process_search_query
    participant Pipeline as search_pipeline
    participant StopWords as english_stopwords (MISSING)
    participant Index as DocumentIndex

    Client->>API: SendSearchQueryRequest(query, num_hits=50)
    API->>ProcessSearch: stream_search_query(request)
    
    alt Single Query
        ProcessSearch->>ProcessSearch: _run_single_search(query, num_hits)
        ProcessSearch->>Pipeline: search_pipeline(ChunkSearchRequest)
        Pipeline->>StopWords: strip_stopwords(query)
        Note over StopWords: ❌ MODULE MISSING<br/>ImportError at runtime
        StopWords-->>Pipeline: query_keywords
        Pipeline->>Index: search_chunks(query_keywords)
        Index-->>Pipeline: chunks
        Pipeline-->>ProcessSearch: chunks
    else Multiple Queries
        ProcessSearch->>ProcessSearch: Run searches in parallel
        loop For each query
            ProcessSearch->>Pipeline: search_pipeline(query, num_hits)
            Pipeline->>StopWords: strip_stopwords(query)
            Note over StopWords: ❌ MODULE MISSING<br/>ImportError at runtime
            StopWords-->>Pipeline: query_keywords
            Pipeline->>Index: search_chunks(query_keywords)
            Index-->>Pipeline: chunks
        end
        ProcessSearch->>ProcessSearch: Merge with RRF
    end
    
    ProcessSearch->>ProcessSearch: merge_individual_chunks()
    ProcessSearch->>ProcessSearch: Truncate to num_hits
    ProcessSearch-->>API: sections
    API-->>Client: Search results

greptile-apps

_{14 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-20T21:09:57Z

backend/onyx/context/search/federated/slack_search_utils.py

 from onyx.llm.interfaces import LLM
 from onyx.llm.models import UserMessage
 from onyx.llm.utils import llm_response_to_string
+from onyx.natural_language_processing.english_stopwords import ENGLISH_STOPWORDS_SET


syntax: The module onyx.natural_language_processing.english_stopwords doesn't exist in the codebase. This import will cause a runtime error.

Check that the english_stopwords.py file with ENGLISH_STOPWORDS_SET definition was included in this PR.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/onyx/context/search/federated/slack_search_utils.py Line: 18:18 Comment: **syntax:** The module `onyx.natural_language_processing.english_stopwords` doesn't exist in the codebase. This import will cause a runtime error. Check that the `english_stopwords.py` file with `ENGLISH_STOPWORDS_SET` definition was included in this PR. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-20T21:09:58Z

backend/onyx/context/search/pipeline.py

 from onyx.db.models import User
 from onyx.document_index.interfaces import DocumentIndex
 from onyx.llm.interfaces import LLM
+from onyx.natural_language_processing.english_stopwords import strip_stopwords


syntax: The module onyx.natural_language_processing.english_stopwords doesn't exist in the codebase. This import will cause a runtime error.

Check that the english_stopwords.py file with strip_stopwords function was included in this PR.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/onyx/context/search/pipeline.py Line: 22:22 Comment: **syntax:** The module `onyx.natural_language_processing.english_stopwords` doesn't exist in the codebase. This import will cause a runtime error. Check that the `english_stopwords.py` file with `strip_stopwords` function was included in this PR. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-20T21:09:58Z

backend/ee/onyx/search/process_search_query.py

    sections = merge_individual_chunks(chunks)

+    # Truncate to the requested number of hits
+    sections = sections[: request.num_hits]


logic: Slicing may fail if request.num_hits is None (from the model default). The slice sections[: None] returns all sections, but consider whether this is the intended behavior.

Suggested change

sections = sections[: request.num_hits]

sections = sections[: request.num_hits] if request.num_hits else sections

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/ee/onyx/search/process_search_query.py Line: 182:182 Comment: **logic:** Slicing may fail if `request.num_hits` is `None` (from the model default). The slice `sections[: None]` returns all sections, but consider whether this is the intended behavior. ```suggestion sections = sections[: request.num_hits] if request.num_hits else sections ``` How can I resolve this? If you propose a fix, please make it concise.

k

d6c86ab

yuhongsun96 requested a review from a team as a code owner January 20, 2026 21:03

k

01b2e1f

cubic-dev-ai bot reviewed Jan 20, 2026

View reviewed changes

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

yuhongsun96 merged commit 83a543a into main Jan 20, 2026
77 checks passed

yuhongsun96 deleted the search-tuneup branch January 20, 2026 21:36

rohoswagger pushed a commit that referenced this pull request Jan 20, 2026

chore: NLTK and stopwords (#7587)

3038946

jessicasingh7 pushed a commit that referenced this pull request Jan 21, 2026

chore: NLTK and stopwords (#7587)

6e08529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: NLTK and stopwords#7587

chore: NLTK and stopwords#7587
yuhongsun96 merged 2 commits intomainfrom
search-tuneup

yuhongsun96 commented Jan 20, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Jan 20, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Jan 20, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 20, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 20, 2026

Uh oh!

greptile-apps bot Jan 20, 2026

Uh oh!

greptile-apps bot Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	sections = sections[: request.num_hits]
	sections = sections[: request.num_hits] if request.num_hits else sections

Conversation

yuhongsun96 commented Jan 20, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 20, 2026

Greptile Summary

Key Changes

Critical Issues

Root Cause

Confidence Score: 0/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuhongsun96 commented Jan 20, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Jan 20, 2026 •

edited

Loading

cubic-dev-ai bot Jan 20, 2026 •

edited

Loading