Skip to content

perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350

Merged
nmgarza5 merged 1 commit intomainfrom
nikg/lxml-parser
Jan 11, 2026
Merged

perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350
nmgarza5 merged 1 commit intomainfrom
nikg/lxml-parser

Conversation

@nmgarza5
Copy link
Copy Markdown
Contributor

@nmgarza5 nmgarza5 commented Jan 11, 2026

Description

Switch BeautifulSoup's HTML parser from html.parser to lxml.

Why:

  • 5-10x faster parsing
  • More tolerant of malformed HTML fragments
  • Lower memory usage
  • lxml is already a dependency (lxml==5.3.0)

Changes:

  • parse_html_page_basic(): use lxml parser
  • web_html_cleanup(): use lxml parser

How Has This Been Tested?

  • Verified lxml parses HTML correctly via manual testing (running a web search)
  • Existing tests pass

Additional Options

  • [Optional] Override Linear Check

lxml is 5-10x faster and more tolerant of malformed HTML.
@nmgarza5 nmgarza5 requested a review from a team as a code owner January 11, 2026 20:24
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 11, 2026

Greptile Overview

Greptile Summary

This PR switches BeautifulSoup's HTML parser from html.parser to lxml in two functions: parse_html_page_basic() and web_html_cleanup(). The change is motivated by performance improvements (5-10x faster), better tolerance for malformed HTML, and lower memory usage.

What Changed:

  • parse_html_page_basic(): now uses lxml parser instead of html.parser
  • web_html_cleanup(): now uses lxml parser instead of html.parser

Impact:
These functions are widely used across the codebase for HTML processing in multiple connectors (web, zendesk, teams, guru, freshdesk, discourse, document360, bookstack, drupal_wiki, loopio, and more).

Key Concern:
While the performance benefits are desirable, this creates an inconsistency in the codebase. There are at least 10+ other BeautifulSoup instantiations across various connectors that still use html.parser:

  • xenforo/connector.py (2 instances)
  • web/connector.py (2 instances)
  • testrail/connector.py (1 instance)
  • productboard/connector.py (1 instance)
  • highspot/utils.py (1 instance)
  • google_site/connector.py (1 instance)
  • confluence/onyx_confluence.py (1 instance)
  • imap/connector.py (1 instance)

Some of these (testrail, confluence) even pass their html.parser-based soup objects to format_document_soup(), which is also used by the changed functions, potentially creating subtle behavioral differences.

Parser Differences:
lxml and html.parser can behave differently:

  • lxml automatically adds <html> and <body> wrapper tags when parsing HTML fragments
  • Different handling of malformed HTML
  • Potential whitespace normalization differences

The PR description states that tests pass, but there's only one test file (test_html_utils.py) with a simple table parsing test, which may not catch all edge cases or behavioral differences.

Confidence Score: 4/5

  • This PR is safe to merge with low risk - the changes are straightforward and performance-focused
  • Score of 4/5 reflects that the code changes are minimal, well-intentioned (performance improvement), and the dependency is already present. However, the score is not 5/5 due to: (1) inconsistency created with other BeautifulSoup usages across the codebase that still use html.parser, and (2) potential subtle behavioral differences between lxml and html.parser that may not be fully covered by the existing test suite (only one simple test exists). The PR is safe to merge but would benefit from either updating all BeautifulSoup instances for consistency or documenting the rationale for partial adoption.
  • No files require special attention - the single changed file is straightforward

Important Files Changed

File Analysis

Filename Score Overview
backend/onyx/file_processing/html_utils.py 4/5 Changed BeautifulSoup parser from html.parser to lxml in parse_html_page_basic() and web_html_cleanup(). Performance improvement but creates inconsistency with other BeautifulSoup usages in the codebase.

Sequence Diagram

sequenceDiagram
    participant Connector as Various Connectors
    participant ParseBasic as parse_html_page_basic()
    participant WebCleanup as web_html_cleanup()
    participant FormatSoup as format_document_soup()
    participant BS4 as BeautifulSoup
    
    Note over Connector,BS4: HTML Processing Flow (Post-Change)
    
    Connector->>ParseBasic: HTML string/BytesIO
    ParseBasic->>BS4: BeautifulSoup(text, "lxml")
    BS4-->>ParseBasic: soup object
    ParseBasic->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>ParseBasic: formatted text
    ParseBasic-->>Connector: parsed text
    
    Connector->>WebCleanup: HTML string
    WebCleanup->>BS4: BeautifulSoup(page_content, "lxml")
    BS4-->>WebCleanup: soup object
    WebCleanup->>WebCleanup: extract title, cleanup
    WebCleanup->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>WebCleanup: formatted text
    WebCleanup-->>Connector: ParsedHTML
    
    Note over Connector,BS4: Other Connectors Still Use html.parser
    
    Connector->>BS4: BeautifulSoup(content, "html.parser")
    BS4-->>Connector: soup object (different behavior)
    Connector->>FormatSoup: format_document_soup(soup)
    FormatSoup-->>Connector: formatted text
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@nmgarza5 nmgarza5 changed the title perf: switch BeautifulSoup parser from html.parser to lxml perf: switch BeautifulSoup parser from html.parser to lxml for web crawler Jan 11, 2026
@nmgarza5 nmgarza5 added this pull request to the merge queue Jan 11, 2026
Merged via the queue into main with commit cb2951a Jan 11, 2026
81 of 84 checks passed
@nmgarza5 nmgarza5 deleted the nikg/lxml-parser branch January 11, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants