perf: switch BeautifulSoup parser from html.parser to lxml for web crawler by nmgarza5 · Pull Request #7350 · onyx-dot-app/onyx

nmgarza5 · 2026-01-11T20:24:16Z

Description

Switch BeautifulSoup's HTML parser from html.parser to lxml.

Why:

5-10x faster parsing
More tolerant of malformed HTML fragments
Lower memory usage
lxml is already a dependency (lxml==5.3.0)

Changes:

parse_html_page_basic(): use lxml parser
web_html_cleanup(): use lxml parser

How Has This Been Tested?

Verified lxml parses HTML correctly via manual testing (running a web search)
Existing tests pass

Additional Options

[Optional] Override Linear Check

lxml is 5-10x faster and more tolerant of malformed HTML.

cubic-dev-ai

No issues found across 1 file

greptile-apps · 2026-01-11T20:33:22Z

Greptile Overview

Greptile Summary

This PR switches BeautifulSoup's HTML parser from html.parser to lxml in two functions: parse_html_page_basic() and web_html_cleanup(). The change is motivated by performance improvements (5-10x faster), better tolerance for malformed HTML, and lower memory usage.

What Changed:

parse_html_page_basic(): now uses lxml parser instead of html.parser
web_html_cleanup(): now uses lxml parser instead of html.parser

Impact:
These functions are widely used across the codebase for HTML processing in multiple connectors (web, zendesk, teams, guru, freshdesk, discourse, document360, bookstack, drupal_wiki, loopio, and more).

Key Concern:
While the performance benefits are desirable, this creates an inconsistency in the codebase. There are at least 10+ other BeautifulSoup instantiations across various connectors that still use html.parser:

xenforo/connector.py (2 instances)
web/connector.py (2 instances)
testrail/connector.py (1 instance)
productboard/connector.py (1 instance)
highspot/utils.py (1 instance)
google_site/connector.py (1 instance)
confluence/onyx_confluence.py (1 instance)
imap/connector.py (1 instance)

Some of these (testrail, confluence) even pass their html.parser-based soup objects to format_document_soup(), which is also used by the changed functions, potentially creating subtle behavioral differences.

Parser Differences:
lxml and html.parser can behave differently:

lxml automatically adds <html> and <body> wrapper tags when parsing HTML fragments
Different handling of malformed HTML
Potential whitespace normalization differences

The PR description states that tests pass, but there's only one test file (test_html_utils.py) with a simple table parsing test, which may not catch all edge cases or behavioral differences.

Confidence Score: 4/5

This PR is safe to merge with low risk - the changes are straightforward and performance-focused
Score of 4/5 reflects that the code changes are minimal, well-intentioned (performance improvement), and the dependency is already present. However, the score is not 5/5 due to: (1) inconsistency created with other BeautifulSoup usages across the codebase that still use html.parser, and (2) potential subtle behavioral differences between lxml and html.parser that may not be fully covered by the existing test suite (only one simple test exists). The PR is safe to merge but would benefit from either updating all BeautifulSoup instances for consistency or documenting the rationale for partial adoption.
No files require special attention - the single changed file is straightforward

Important Files Changed

File Analysis

Filename	Score	Overview
backend/onyx/file_processing/html_utils.py	4/5	Changed BeautifulSoup parser from html.parser to lxml in parse_html_page_basic() and web_html_cleanup(). Performance improvement but creates inconsistency with other BeautifulSoup usages in the codebase.

Sequence Diagram

sequenceDiagram
    participant Connector as Various Connectors
    participant ParseBasic as parse_html_page_basic()
    participant WebCleanup as web_html_cleanup()
    participant FormatSoup as format_document_soup()
    participant BS4 as BeautifulSoup
    
    Note over Connector,BS4: HTML Processing Flow (Post-Change)
    
    Connector->>ParseBasic: HTML string/BytesIO
    ParseBasic->>BS4: BeautifulSoup(text, "lxml")
    BS4-->>ParseBasic: soup object
    ParseBasic->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>ParseBasic: formatted text
    ParseBasic-->>Connector: parsed text
    
    Connector->>WebCleanup: HTML string
    WebCleanup->>BS4: BeautifulSoup(page_content, "lxml")
    BS4-->>WebCleanup: soup object
    WebCleanup->>WebCleanup: extract title, cleanup
    WebCleanup->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>WebCleanup: formatted text
    WebCleanup-->>Connector: ParsedHTML
    
    Note over Connector,BS4: Other Connectors Still Use html.parser
    
    Connector->>BS4: BeautifulSoup(content, "html.parser")
    BS4-->>Connector: soup object (different behavior)
    Connector->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>Connector: formatted text

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

backend/onyx/file_processing/html_utils.py

…awler (#7350)

perf: switch BeautifulSoup parser from html.parser to lxml

d421734

lxml is 5-10x faster and more tolerant of malformed HTML.

nmgarza5 requested a review from a team as a code owner January 11, 2026 20:24

cubic-dev-ai bot reviewed Jan 11, 2026

View reviewed changes

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

backend/onyx/file_processing/html_utils.py Show resolved Hide resolved

nmgarza5 changed the title ~~perf: switch BeautifulSoup parser from html.parser to lxml~~ perf: switch BeautifulSoup parser from html.parser to lxml for web crawler Jan 11, 2026

nmgarza5 requested review from evan-onyx and jessicasingh7 January 11, 2026 20:37

evan-onyx approved these changes Jan 11, 2026

View reviewed changes

nmgarza5 added this pull request to the merge queue Jan 11, 2026

Merged via the queue into main with commit cb2951a Jan 11, 2026
81 of 84 checks passed

nmgarza5 deleted the nikg/lxml-parser branch January 11, 2026 20:50

jessicasingh7 pushed a commit that referenced this pull request Jan 12, 2026

perf: switch BeautifulSoup parser from html.parser to lxml for web cr…

0e10e3c

…awler (#7350)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350

perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350
nmgarza5 merged 1 commit intomainfrom
nikg/lxml-parser

nmgarza5 commented Jan 11, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

greptile-apps bot commented Jan 11, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nmgarza5 commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 11, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nmgarza5 commented Jan 11, 2026 •

edited

Loading