-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem
Agents that need to process file attachments (PDFs, images, audio, video) face several limitations today:
Azure Content Understanding (CU) is an Azure AI service that extracts structured content from documents, images, audio, and video using state-of-the-art OCR, transcription, and field extraction. It addresses the following gaps in the Agent Framework:
- Poor OCR / structure extraction — Free digital-text PDF parsers miss scanned content, handwritten text, complex tables, and multi-column layouts. CU provides state-of-the-art OCR with markdown extraction that preserves document structure.
- No multimodal support — Most LLMs don't natively accept audio, video, or rich document formats (DOCX, XLSX, PPTX). Even those that accept images often can't handle multi-page PDFs or long audio. A preprocessing layer is needed to extract structured text from these formats before sending to the LLM.
- No built-in integration — Developers today must write custom code to call CU, manage analysis state across turns, handle timeouts, and format results for the LLM — this is boilerplate that should be handled by a reusable context provider.
Proposed solution
A new optional package agent-framework-azure-ai-contentunderstanding that integrates CU into the Agent Framework as a BaseContextProvider. It follows the same pattern as the existing agent-framework-azure-ai-search package — an optional connector that users install separately when they need the capability.
Package details:
| PyPI name | agent-framework-azure-ai-contentunderstanding |
| Python module | agent_framework_azure_ai_contentunderstanding |
| Pattern | BaseContextProvider (same as AzureAISearchContextProvider) |
| Dependencies | agent-framework-core, azure-ai-contentunderstanding, aiohttp, filetype |
Key features:
- Auto-detects file attachments in
Message.contents, sends them to CU for analysis, and injects structured results (markdown + extracted fields) into the LLM context - Supports documents (PDF, DOCX, XLSX, PPTX, HTML), images (JPEG, PNG, TIFF, BMP), audio (WAV, MP3, M4A, FLAC), and video (MP4, MOV, AVI, WebM)
- Works with any LLM client — the extracted markdown/fields are plain text, so any model can consume them
- Auto-detects analyzer by media type (
prebuilt-documentSearch,prebuilt-audioSearch,prebuilt-videoSearch) - Per-file analyzer override via
additional_properties["analyzer_id"] - Background processing with configurable
max_waittimeout - Multi-document session state with status tracking (
analyzing/uploading/ready/failed) - MIME sniffing for misidentified files (
application/octet-stream) - Optional
file_searchintegration for token-efficient RAG on large documents - Per-session state isolation (safe for concurrent sessions on same provider instance)
- Follows the existing
BaseContextProviderpattern — zero custom wiring needed
Implementation plan
This feature will be implemented in both Python and .NET. Python will be delivered first to gather feedback on the API surface and usage patterns, then .NET will follow.
Python PR: #4829
Code Sample 1 — Multi-turn document Q&A
cu = ContentUnderstandingContextProvider(
endpoint="https://my-resource.services.ai.azure.com/",
credential=AzureCliCredential(),
)
async with cu:
agent = Agent(client=client, name="DocQA", instructions="...", context_providers=[cu])
session = AgentSession()
# Turn 1: Upload PDF — CU extracts markdown + fields, injects into LLM context
response = await agent.run(
Message(role="user", contents=[
Content.from_text("What's on this invoice?"),
Content.from_uri("https://example.com/invoice.pdf", media_type="application/pdf",
additional_properties={"filename": "invoice.pdf"}),
]),
session=session,
)
# Turn 2: Follow-up — no re-upload, CU results cached in session state
response = await agent.run("What is the total amount due?", session=session)Complete samples: 01_document_qa.py · 02_multi_turn_session.py
Code Sample 2 — file_search integration for large documents
# CU extracts markdown → auto-uploads to vector store → file_search tool registered
cu = ContentUnderstandingContextProvider(
endpoint="https://my-resource.services.ai.azure.com/",
credential=credential,
file_search=FileSearchConfig.from_foundry(
openai_client,
vector_store_id=vector_store.id,
file_search_tool=client.get_file_search_tool(vector_store_ids=[vector_store.id]),
),
)Complete sample: 06_large_doc_file_search.py