Document Parsing
Learn about supported file formats, processing configuration, and how to extract structured data from documents.
Supported Formats
ParseSphere accepts a wide range of document formats, each with specific extraction capabilities:
PDF Documents
Native PDFs: Extracts text, tables, and metadata with high accuracy. Text-based PDFs process faster than scanned documents.
Scanned PDFs: Applies OCR to extract text from images. Includes automatic table detection within scanned pages. Processing time depends on page count and image quality.
Microsoft Office
Word (.docx): Extracts full text content, embedded tables, and document structure. Preserves formatting information and paragraph boundaries.
PowerPoint (.pptx): Captures slide text, speaker notes, and applies OCR to embedded images. Tables detected within images are returned as structured data.
Excel (.xlsx): Extracts all sheets with full preservation of cell values and data types. For querying Excel files with natural language, use Workspaces instead.
Tabular Data
CSV: Parsed with automatic delimiter detection and column type inference. Best suited for Workspaces for natural language queries.
Plain Text & Structured Text
Text (.txt): Read directly with support for RTF format detection. Fastest processing option.
Markdown (.md): Parsed with full Markdown structure preservation.
JSON (.json): Extracted as raw text content. Useful for API responses, configuration files, and data exports.
XML (.xml): Extracted as raw text content. Supports data interchange files, SOAP responses, and configuration.
HTML (.html, .htm): Extracted as raw text content. Useful for saved web pages and exported reports.
YAML (.yaml, .yml): Extracted as raw text content. Common for configuration files and CI/CD pipelines.
Log (.log): Extracted as raw text content. Useful for server logs and application log analysis.
File Size Limits
All file formats accept documents up to 200 MB. Larger files should be split or compressed before upload.
Image Handling
Images embedded within documents (PDFs, Word, PowerPoint) are automatically processed with OCR to extract text. Standalone image files (PNG, JPG) are not supported for parsing.
Processing Configuration
Configure how ParseSphere processes your documents using these parameters:
fileDocument to parse (max 200 MB). Supported formats: PDF, DOCX, PPTX, XLSX, CSV, TXT, MD, JSON, XML, HTML, YAML, LOG
process_imagestrueEnable OCR for images embedded in documents and scanned pages. Disable for text-native documents to reduce processing time
extract_tablestrueExtract tables as structured JSON objects with headers, rows, and metadata
chunkfalseSplit document into semantic chunks for vector databases or RAG pipelines
chunk_size600Maximum chunk size in tokens (only used when chunk=true)
chunk_overlap200Number of overlapping tokens between chunks (only used when chunk=true)
session_ttl1800Results cache duration in seconds. Default is 30 minutes. Minimum: 60 seconds
webhook_urlURL to receive HTTP POST notification when processing completes
webhook_secretSecret for HMAC-SHA256 signature verification of webhook payloads
Understanding process_images
Optimize Processing Time
Disable process_images when processing text-native documents to reduce processing time by 50-70%.
When enabled, ParseSphere automatically:
- Detects scanned PDFs and applies OCR
- Extracts text from images embedded in documents
- Recognizes tables within images
- Processes handwritten content (with reduced accuracy)
When to disable: Processing Word documents, text-based PDFs, or any document where images are decorative rather than content-bearing.
Understanding extract_tables
Controls whether tables are returned as structured JSON objects. Table extraction support varies by format:
| Format | Native Tables | OCR Tables | Best Use Case |
|---|---|---|---|
| ✓ | ✓ | All table extraction | |
| Word (.docx) | ✓ | ✗ | Native tables only |
| PowerPoint (.pptx) | ✗ | ✓ | Image-based tables |
| Excel/CSV | N/A | N/A | Use Workspaces |
Each extracted table includes:
- Headers: Column names (if detected)
- Rows: Data as key-value dictionaries
- Metadata: Row/column counts, page number (when applicable)
Information
For Excel and CSV files, use Workspaces to query data with natural language instead of extracting as table objects.
Understanding chunk
Perfect for RAG
Enable chunking when building retrieval-augmented generation (RAG) systems or storing content in vector databases.
Set chunk=true to split text at semantic boundaries:
- Sentence boundaries: Intelligent splitting that preserves sentence integrity
- Token limits: Configurable via
chunk_sizeparameter (default: 600 tokens) - Overlap: Configurable via
chunk_overlapparameter (default: 200 tokens)
Chunk response format: An array of text strings:
{
"chunks": [
"First chunk of text content...",
"Second chunk with overlapping context...",
"Third chunk continues the document..."
]
}
Understanding session_ttl
Controls result cache duration:
- Default: 1800 seconds (30 minutes)
- Minimum: 60 seconds (1 minute)
Use longer TTL when:
- Sharing results across multiple systems
- Processing reference documents repeatedly
- Building user-facing applications with multiple views
Use shorter TTL for:
- Sensitive documents
- High-volume processing
- Cost optimization
Submit a Document
Create a parse job by uploading a document:
/v1/parsesSubmit a document for text extraction and processing
curl -X POST https://api.parsesphere.com/v1/parses \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@contract.pdf"Receiving Results via Webhook
Production Best Practice
Use webhooks instead of polling for production applications. Webhooks are more reliable, reduce API calls, and provide instant notifications.
Provide a webhook_url parameter to receive an HTTP POST when processing completes:
https://your-app.com/webhookWebhook notification sent by ParseSphere when processing completes
{
"parse_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": {
"text": "Full extracted text content...",
"metadata": {
"filename": "contract.pdf",
"file_type": "pdf",
"file_size": 2048576,
"page_count": 50,
"processing_time": 45.2,
"extraction_method": "standard",
"characters": 125000,
"tokens": 31250,
"ocr_pages": 0,
"table_count": 3,
"native_tables": 3,
"ocr_tables": 0
},
"tables": [
{
"page": 5,
"headers": ["Name", "Amount", "Date"],
"rows": [
{"Name": "Item 1", "Amount": "100", "Date": "2024-01-01"},
{"Name": "Item 2", "Amount": "200", "Date": "2024-01-02"}
],
"row_count": 2,
"column_count": 3
}
],
"chunks": null
},
"timestamp": "2025-01-03T12:00:45Z",
"processing_time": 45.2
}Result Structure
text: The complete extracted text content from the document.
metadata: Document information including:
filename,file_type,file_size- Basic file informationpage_count- Number of pages (for paginated formats)processing_time- Time taken to process in secondsextraction_method- Method used (standardorocr)characters- Character count of extracted texttokens- Token count for AI processing cost estimationocr_pages- Number of pages processed with OCRtable_count,native_tables,ocr_tables- Table extraction statistics
tables: Array of extracted tables, or null if none found or extract_tables=false. Each table includes page number, headers, rows as dictionaries, and counts.
chunks: Array of text strings if chunk=true, otherwise null.
Webhook Headers
Each webhook request includes these headers:
X-ParseSphere-Signature: HMAC-SHA256 signature for payload verification (prefixed withsha256=)X-ParseSphere-Idempotency-Key: Unique key (theparse_id) for deduplicating deliveriesContent-Type:application/json
Webhook Security
Always Verify Signatures
Never trust webhook payloads without verifying the HMAC signature. This prevents spoofed or malicious requests.
The signature is computed over a deterministically sorted JSON serialization of the payload. To verify:
- Parse the request body as JSON
- Re-serialize with sorted keys:
json.dumps(body, sort_keys=True) - Compute HMAC-SHA256 using your
webhook_secret - Compare against the header value (strip the
sha256=prefix first)
import hmac
import hashlib
import json
def verify_webhook(body_dict, signature_header, secret):
"""Verify webhook signature.
The server signs json.dumps(payload, sort_keys=True),
and the signature header is prefixed with 'sha256='.
"""
# Re-serialize with sorted keys to match server signing
canonical_payload = json.dumps(body_dict, sort_keys=True)
expected = hmac.new(
secret.encode('utf-8'),
canonical_payload.encode('utf-8'),
hashlib.sha256
).hexdigest()
# Strip the 'sha256=' prefix from the header
received = signature_header.removeprefix("sha256=")
return hmac.compare_digest(expected, received)
# In your webhook handler
@app.post("/webhook")
def handle_webhook(request):
signature = request.headers.get("X-ParseSphere-Signature")
body = request.json()
if not verify_webhook(body, signature, WEBHOOK_SECRET):
return {"error": "Invalid signature"}, 401
# Use X-ParseSphere-Idempotency-Key to deduplicate
idempotency_key = request.headers.get("X-ParseSphere-Idempotency-Key")
# Process webhook...
return {"success": True}Processing Duration
Processing time varies based on file size, content complexity, and document type. Key factors that affect processing speed:
Text-native PDFs: Fastest processing. Affected by page count and file size.
Scanned PDFs: Slower processing due to OCR requirements. Affected by page count, image quality, and content complexity.
Word/PowerPoint: Medium processing speed. Affected by embedded images and table count.
Excel/CSV: Fast processing. Affected by file size and row count.
Large files (over 10MB): Variable processing time, may require additional processing steps.
Information
The API provides an estimated processing time when you submit a document. Actual processing time depends on current system load, file complexity, and selected options.
Warning
Design your integration with appropriate timeout handling and use webhooks for reliable notifications instead of polling.
Optimization Tips
Reduce Processing Time:
- Disable
process_imagesfor text-native documents - Disable
extract_tablesif you don't need table data - Compress large PDFs before upload
- Split documents over 200MB
Improve Accuracy:
- Use high-quality scans (300+ DPI)
- Ensure good contrast and minimal noise
- Provide straight-aligned documents
- Use text-based PDFs when possible
What's Next?
Explore related topics:
- Quick Start - Make your first parse request
- Core Concepts - Understand parse job lifecycle
- VIVI Document Intelligence - Query tabular data with natural language
- Error Handling - Handle parsing errors