perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350
perf: switch BeautifulSoup parser from html.parser to lxml for web crawler#7350
Conversation
lxml is 5-10x faster and more tolerant of malformed HTML.
Greptile OverviewGreptile SummaryThis PR switches BeautifulSoup's HTML parser from What Changed:
Impact: Key Concern:
Some of these (testrail, confluence) even pass their Parser Differences:
The PR description states that tests pass, but there's only one test file ( Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Connector as Various Connectors
participant ParseBasic as parse_html_page_basic()
participant WebCleanup as web_html_cleanup()
participant FormatSoup as format_document_soup()
participant BS4 as BeautifulSoup
Note over Connector,BS4: HTML Processing Flow (Post-Change)
Connector->>ParseBasic: HTML string/BytesIO
ParseBasic->>BS4: BeautifulSoup(text, "lxml")
BS4-->>ParseBasic: soup object
ParseBasic->>FormatSoup: format_document_soup(soup)
FormatSoup-->>ParseBasic: formatted text
ParseBasic-->>Connector: parsed text
Connector->>WebCleanup: HTML string
WebCleanup->>BS4: BeautifulSoup(page_content, "lxml")
BS4-->>WebCleanup: soup object
WebCleanup->>WebCleanup: extract title, cleanup
WebCleanup->>FormatSoup: format_document_soup(soup)
FormatSoup-->>WebCleanup: formatted text
WebCleanup-->>Connector: ParsedHTML
Note over Connector,BS4: Other Connectors Still Use html.parser
Connector->>BS4: BeautifulSoup(content, "html.parser")
BS4-->>Connector: soup object (different behavior)
Connector->>FormatSoup: format_document_soup(soup)
FormatSoup-->>Connector: formatted text
|
Description
Switch BeautifulSoup's HTML parser from
html.parsertolxml.Why:
lxmlis already a dependency (lxml==5.3.0)Changes:
parse_html_page_basic(): uselxmlparserweb_html_cleanup(): uselxmlparserHow Has This Been Tested?
lxmlparses HTML correctly via manual testing (running a web search)Additional Options