Document Processing

Overview

Empire Chain provides powerful document processing capabilities through its Docling module. This guide covers how to process different types of documents and extract meaningful information from them.

Supported Document Types

PDF Documents
Text Files
Images (with OCR)
Word Documents
HTML Pages

Basic Document Processing

Loading Documents

from empire_chain.docling import Docling

# Load a PDF document
pdf_doc = Docling("document.pdf")

# Load a text file
text_doc = Docling("document.txt")

# Load from bytes
doc = Docling(bytes_content, content_type="application/pdf")

Text Extraction

# Extract all text
text = doc.extract_text()

# Extract text from specific pages
text = doc.extract_text(pages=[1, 3, 5])

# Extract with layout preservation
text = doc.extract_text(preserve_layout=True)

Document Analysis

# Get document metadata
metadata = doc.get_metadata()

# Analyze document structure
structure = doc.analyze_structure()

# Extract tables
tables = doc.extract_tables()

Advanced Features

OCR Processing

# Enable OCR for image-based PDFs
doc = Docling("scanned.pdf", config={"ocr_enabled": True})

# Extract text with OCR
text = doc.extract_text()

Document Transformation

# Convert to different format
doc.convert_to("docx")

# Split document
doc.split(pages=[1, 3, 5])

# Merge documents
Docling.merge(["doc1.pdf", "doc2.pdf"], output="merged.pdf")

Content Extraction

# Extract images
images = doc.extract_images()

# Extract tables to pandas DataFrame
tables = doc.extract_tables(output_format="pandas")

# Extract specific regions
content = doc.extract_region(bbox=(100, 100, 500, 500))

Best Practices

Memory Management

with Docling("large.pdf") as doc:
    text = doc.extract_text()

Error Handling

try:
    doc = Docling("document.pdf")
    text = doc.extract_text()
except DocumentError as e:
    print(f"Error processing document: {e}")

Batch Processing

from empire_chain.docling import BatchProcessor

processor = BatchProcessor()
results = processor.process_directory("docs/")

Configuration Options

config = {
    "ocr": {
        "enabled": True,
        "language": "eng",
        "dpi": 300
    },
    "extraction": {
        "preserve_layout": True,
        "include_images": False
    },
    "processing": {
        "chunk_size": 1000,
        "max_workers": 4
    }
}

doc = Docling("document.pdf", config=config)