Document Intelligence and Analysis

Transform unstructured documents into actionable intelligence with AI-powered analysis. This guide covers document intelligence use cases, implementation patterns, and best practices for extracting maximum value from your documents.

Overview

What is Document Intelligence?

Document intelligence combines multiple AI capabilities to understand, analyze, and extract insights from documents:

Text Extraction: Convert PDFs to machine-readable text
Entity Recognition: Identify people, companies, dates, amounts
Relationship Extraction: Understand connections between entities
Sentiment Analysis: Gauge tone and sentiment
Summarization: Generate concise summaries
Classification: Categorize documents automatically
Question Answering: Answer questions about document content

Common Use Cases

Invoice Processing

Challenge:

Manual data entry from hundreds of invoices monthly
Prone to errors and delays
Expensive labor costs
Difficult to track payment schedules

Solution with Alactic:

Step 1: Upload invoices in batch

import requests

def process_invoice_batch(invoice_files):
    files = [("files[]", open(f, "rb")) for f in invoice_files]
    
    response = requests.post(
        "https://your-vm-ip/api/v1/batch",
        headers={"X-Deployment-Key": "ak-xxxxx"},
        files=files,
        data={
            "model": "gpt-4o-mini",
            "analysis_depth": "standard"
        }
    )
    
    return response.json()["batch_id"]

Step 2: Extract structured data

Configure extraction to focus on key invoice fields:

Vendor name and address
Invoice number and date
Due date
Line items (description, quantity, unit price, total)
Subtotal, tax, total amount
Payment terms

Step 3: Export to accounting system

def export_to_accounting(invoice_data):
    structured_data = {
        "vendor": invoice_data["entities"]["company"][0],
        "invoice_number": extract_invoice_number(invoice_data),
        "date": invoice_data["entities"]["date"][0],
        "amount": invoice_data["entities"]["money"][0],
        "line_items": parse_line_items(invoice_data["text"])
    }
    
    # Export to QuickBooks, Xero, etc.
    accounting_api.create_bill(structured_data)

Results:

Processing time: 5 minutes vs 2 hours manually
Accuracy: 98%+ with GPT-4o
Cost savings: $1,500/month in labor
Faster payment processing

Best Practices:

Use GPT-4o mini for simple invoices (cost-effective)
Use GPT-4o for complex invoices with multiple line items
Implement validation rules for critical fields
Review flagged invoices manually

Contract Analysis

Challenge:

Hundreds of contracts to review for M&A due diligence
Identify key terms, obligations, and risks
Manual review: 1-2 hours per contract
Expensive legal fees

Solution with Alactic:

Step 1: Process contract portfolio

def analyze_contracts(contract_folder):
    contracts = os.listdir(contract_folder)
    
    for contract in contracts:
        result = process_document(
            file_path=f"{contract_folder}/{contract}",
            model="gpt-4o",  # Use GPT-4o for accuracy
            analysis_depth="deep",  # Get detailed analysis
            custom_prompt="Extract: parties, effective date, term length, " +
                         "termination clauses, payment terms, obligations, " +
                         "indemnification, liability limits, governing law"
        )
        
        save_to_database(contract, result)

Step 2: Extract key provisions

Focus on critical contract elements:

Parties: Who are the contracting parties?
Term: Contract duration and renewal terms
Payment terms: Amounts, schedules, payment methods
Obligations: What must each party do?
Termination: Conditions for ending contract
Liability: Liability caps and limitations
Indemnification: Who protects whom from what?
Governing law: Which jurisdiction applies?

Step 3: Risk assessment

def assess_contract_risk(contract_analysis):
    risk_factors = {
        "unlimited_liability": "high",
        "auto_renewal": "medium",
        "long_term": "medium",
        "broad_indemnification": "high",
        "no_termination_clause": "high"
    }
    
    identified_risks = []
    for risk, severity in risk_factors.items():
        if check_for_risk(contract_analysis, risk):
            identified_risks.append({
                "risk": risk,
                "severity": severity,
                "details": extract_details(contract_analysis, risk)
            })
    
    return identified_risks

Step 4: Generate summary report

def generate_contract_summary(contracts_data):
    summary = {
        "total_contracts": len(contracts_data),
        "total_value": sum(c["payment_terms"]["total"] for c in contracts_data),
        "high_risk_contracts": [c for c in contracts_data if c["risk_level"] == "high"],
        "expiring_soon": [c for c in contracts_data if c["days_until_expiry"] < 90],
        "key_obligations": aggregate_obligations(contracts_data),
        "liability_exposure": calculate_liability(contracts_data)
    }
    
    return summary

Results:

Review time: 50 contracts in 2 hours vs 100 hours manually
Cost savings: $50,000 in legal fees
Better risk identification
Faster deal closure

Best Practices:

Always use GPT-4o for legal contracts (accuracy critical)
Use Deep Analysis for comprehensive extraction
Have legal team review AI-flagged risks
Build template for recurring contract types

Research Paper Analysis

Challenge:

Literature review requires reading 500+ papers
Identify relevant studies and key findings
Manual process: 3-6 months
Risk of missing important research

Solution with Alactic:

Step 1: Collect papers

import requests
from bs4 import BeautifulSoup

def scrape_pubmed(search_query, num_papers=500):
    papers = []
    
    # Search PubMed
    search_results = pubmed_api.search(search_query, max_results=num_papers)
    
    for paper_id in search_results:
        # Get PDF URL
        pdf_url = get_paper_pdf_url(paper_id)
        papers.append(pdf_url)
    
    return papers

Step 2: Process with Alactic

def analyze_research_papers(paper_urls):
    results = []
    
    # Process in batches of 50
    for i in range(0, len(paper_urls), 50):
        batch = paper_urls[i:i+50]
        
        batch_result = requests.post(
            "https://your-vm-ip/api/v1/batch",
            headers={"X-Deployment-Key": "ak-xxxxx"},
            json={
                "urls": batch,
                "model": "gpt-4o",
                "analysis_depth": "deep",
                "extract": [
                    "research_question",
                    "methodology",
                    "sample_size",
                    "key_findings",
                    "limitations",
                    "conclusions"
                ]
            }
        )
        
        results.extend(wait_for_batch(batch_result["batch_id"]))
    
    return results

Step 3: Extract structured data

For each paper, extract:

Research question: What did they investigate?
Methodology: How did they conduct the study?
Sample size: How many participants/data points?
Key findings: What did they discover?
Statistical significance: P-values, confidence intervals
Limitations: Study weaknesses
Conclusions: What do results mean?

Step 4: Synthesize findings

def synthesize_literature(papers_data):
    synthesis = {
        "total_papers": len(papers_data),
        "date_range": get_date_range(papers_data),
        "common_methodologies": count_methodologies(papers_data),
        "key_themes": extract_themes(papers_data),
        "consistent_findings": find_consensus(papers_data),
        "contradictions": find_contradictions(papers_data),
        "research_gaps": identify_gaps(papers_data),
        "meta_analysis_data": prepare_meta_analysis(papers_data)
    }
    
    return synthesis

Results:

Analysis time: 2 weeks vs 6 months
More comprehensive coverage
Quantitative synthesis possible
Better identification of research gaps

Best Practices:

Use GPT-4o for research papers (complex content)
Enable Deep Analysis for full extraction
Validate key findings against paper text
Have domain expert review synthesis

Financial Report Analysis

Challenge:

Analyze quarterly reports from 100+ companies
Extract financial metrics and trends
Identify investment opportunities or risks
Manual analysis: Full-time job

Solution with Alactic:

Step 1: Collect financial reports

def collect_earnings_reports(companies, quarter):
    reports = []
    
    for company in companies:
        # Get 10-Q or earnings report URL
        report_url = get_sec_filing(company, "10-Q", quarter)
        reports.append({
            "company": company,
            "url": report_url,
            "quarter": quarter
        })
    
    return reports

Step 2: Process reports

def analyze_financial_reports(reports):
    results = []
    
    for report in reports:
        analysis = process_document(
            url=report["url"],
            model="gpt-4o",
            analysis_depth="deep",
            extract=[
                "revenue",
                "net_income",
                "operating_margin",
                "eps",
                "guidance",
                "key_metrics",
                "risks_mentioned",
                "management_commentary"
            ]
        )
        
        results.append({
            "company": report["company"],
            "quarter": report["quarter"],
            "financials": analysis["entities"]["money"],
            "metrics": analysis["key_metrics"],
            "sentiment": analysis["sentiment"],
            "risks": analysis["risks_mentioned"]
        })
    
    return results

Step 3: Comparative analysis

def compare_companies(companies_data):
    comparison = pd.DataFrame(companies_data)
    
    # Calculate metrics
    comparison["revenue_growth"] = calculate_growth(comparison, "revenue")
    comparison["margin_trend"] = calculate_trend(comparison, "operating_margin")
    comparison["sentiment_score"] = normalize_sentiment(comparison, "sentiment")
    
    # Rank companies
    comparison["investment_score"] = (
        comparison["revenue_growth"] * 0.3 +
        comparison["margin_trend"] * 0.3 +
        comparison["sentiment_score"] * 0.2 +
        comparison["guidance_strength"] * 0.2
    )
    
    return comparison.sort_values("investment_score", ascending=False)

Results:

Analysis time: 2 days vs continuous monitoring
Identify opportunities faster
Data-driven investment decisions
Better risk management

Best Practices:

Use GPT-4o for financial reports (precision matters)
Cross-reference extracted numbers with tables
Track metrics over multiple quarters
Combine with quantitative data

Legal Discovery

Challenge:

E-discovery with 100,000+ documents
Identify relevant documents for litigation
Manual review: $5-10 million
Timeline: 12-18 months

Solution with Alactic:

Step 1: Initial processing

def process_discovery_documents(document_folder):
    documents = collect_all_files(document_folder)
    
    # Process in large batches
    batch_size = 500  # Enterprise plan capacity
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        
        result = process_batch(
            files=batch,
            model="gpt-4o-mini",  # Cost-effective for initial pass
            analysis_depth="standard",
            enable_vectors=True  # Enable semantic search
        )
        
        store_results(result)

Step 2: Relevance scoring

def score_relevance(document, case_keywords):
    # Get document analysis
    analysis = get_document_analysis(document["id"])
    
    # Calculate relevance score
    keyword_matches = count_keyword_matches(analysis["text"], case_keywords)
    entity_relevance = check_relevant_entities(analysis["entities"], case_entities)
    semantic_similarity = calculate_semantic_similarity(
        analysis["summary"],
        case_description
    )
    
    relevance_score = (
        keyword_matches * 0.3 +
        entity_relevance * 0.4 +
        semantic_similarity * 0.3
    )
    
    return relevance_score

Step 3: Prioritized review

def prioritize_for_review(documents_data, threshold=0.7):
    # Score all documents
    scored_docs = [
        {
            "doc": doc,
            "relevance": score_relevance(doc, case_keywords)
        }
        for doc in documents_data
    ]
    
    # Sort by relevance
    scored_docs.sort(key=lambda x: x["relevance"], reverse=True)
    
    # High relevance: manual review required
    high_relevance = [d for d in scored_docs if d["relevance"] > threshold]
    
    # Medium relevance: spot check
    medium_relevance = [d for d in scored_docs if 0.4 < d["relevance"] <= threshold]
    
    # Low relevance: can likely exclude
    low_relevance = [d for d in scored_docs if d["relevance"] <= 0.4]
    
    return {
        "high": high_relevance,
        "medium": medium_relevance,
        "low": low_relevance,
        "review_reduction": len(low_relevance) / len(scored_docs) * 100
    }

Step 4: Hot document identification

def identify_hot_documents(documents_data):
    hot_docs = []
    
    # Look for smoking gun indicators
    indicators = [
        "confidential",
        "do not distribute",
        "privileged",
        "attorney-client",
        "destroy after reading",
        specific_date_ranges,
        key_people_communication
    ]
    
    for doc in documents_data:
        hot_score = calculate_hot_score(doc, indicators)
        if hot_score > 0.8:
            hot_docs.append({
                "document": doc,
                "hot_score": hot_score,
                "reasons": explain_hot_score(doc, indicators)
            })
    
    return hot_docs

Results:

Review set reduced by 90% (10,000 vs 100,000 docs)
Cost: $500K vs $5M
Timeline: 3 months vs 18 months
Better case outcomes with AI-identified hot docs

Best Practices:

Use mini for initial pass (cost control)
Use GPT-4o for high-relevance documents
Enable vector search for semantic queries
Have legal team validate high-relevance docs
Use cascading review strategy

Medical Record Processing

Challenge:

Extract patient information from unstructured medical records
Identify diagnoses, medications, allergies, procedures
Manual extraction: 30 minutes per record
Error-prone and inconsistent

Solution with Alactic:

Step 1: HIPAA-compliant deployment

Ensure compliance:

Enterprise plan with BAA
Data encryption at rest and in transit
Access controls and audit logging
Data retention policies

Step 2: Process medical records

def process_medical_record(record_pdf):
    result = process_document(
        file_path=record_pdf,
        model="gpt-4o",  # Accuracy critical for medical data
        analysis_depth="deep",
        extract=[
            "patient_demographics",
            "diagnoses",
            "medications",
            "allergies",
            "procedures",
            "lab_results",
            "vital_signs",
            "medical_history",
            "treatment_plan"
        ]
    )
    
    return structure_medical_data(result)

Step 3: Extract structured medical data

def structure_medical_data(analysis):
    structured = {
        "patient": {
            "name": extract_patient_name(analysis),
            "dob": extract_date_of_birth(analysis),
            "mrn": extract_medical_record_number(analysis)
        },
        "diagnoses": [
            {
                "condition": d["value"],
                "icd_code": map_to_icd10(d["value"]),
                "date": d["date"]
            }
            for d in analysis["entities"]["medical_condition"]
        ],
        "medications": [
            {
                "name": m["value"],
                "dosage": extract_dosage(m),
                "frequency": extract_frequency(m),
                "start_date": m["date"]
            }
            for m in analysis["entities"]["medication"]
        ],
        "allergies": [
            {
                "allergen": a["value"],
                "reaction": extract_reaction(a),
                "severity": classify_severity(a)
            }
            for a in analysis["entities"]["allergy"]
        ]
    }
    
    return structured

Step 4: Validate and store

def validate_medical_data(structured_data):
    # Validation rules
    validations = [
        validate_mrn_format(structured_data["patient"]["mrn"]),
        validate_icd_codes(structured_data["diagnoses"]),
        check_medication_interactions(structured_data["medications"]),
        verify_allergy_severity(structured_data["allergies"]),
        ensure_completeness(structured_data)
    ]
    
    if all(validations):
        store_in_ehr(structured_data)
        return True
    else:
        flag_for_manual_review(structured_data, validations)
        return False

Results:

Processing time: 2 minutes vs 30 minutes per record
Accuracy: 99%+ for critical fields
Cost savings: $25 per record
Improved data accessibility for care teams

Best Practices:

Always use GPT-4o for medical records (accuracy critical)
Implement strict validation rules
Have medical staff review AI-extracted data
Maintain audit trail for all extractions
Ensure HIPAA compliance at all stages

Implementation Patterns

Batch Processing Pattern

When to use:

Processing large volumes of similar documents
Not time-sensitive (can process overnight)
Want to maximize efficiency

Implementation:

from concurrent.futures import ThreadPoolExecutor
import time

def batch_processing_pattern(documents, batch_size=50):
    results = []
    
    # Split into batches
    batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]
    
    for batch in batches:
        # Submit batch
        batch_id = submit_batch(batch)
        
        # Wait for completion
        while not is_batch_complete(batch_id):
            time.sleep(10)
        
        # Collect results
        batch_results = get_batch_results(batch_id)
        results.extend(batch_results)
        
        # Clean up (optional)
        cleanup_batch(batch_id)
    
    return results

Benefits:

Efficient API usage
Lower cost per document
Easier error handling

Streaming Pattern

When to use:

Real-time processing required
Documents arrive continuously
Need immediate results

Implementation:

import queue
import threading

def streaming_processing_pattern():
    document_queue = queue.Queue()
    results_queue = queue.Queue()
    
    # Worker threads
    def worker():
        while True:
            doc = document_queue.get()
            if doc is None:
                break
            
            result = process_document(doc)
            results_queue.put(result)
            document_queue.task_done()
    
    # Start workers
    num_workers = 5
    threads = []
    for _ in range(num_workers):
        t = threading.Thread(target=worker)
        t.start()
        threads.append(t)
    
    # Feed documents
    for doc in incoming_documents():
        document_queue.put(doc)
    
    # Wait for completion
    document_queue.join()
    
    # Stop workers
    for _ in range(num_workers):
        document_queue.put(None)
    
    return collect_results(results_queue)

Benefits:

Low latency
Handles variable load
Scalable

Cascade Pattern

When to use:

Want to optimize cost and quality
Can tolerate two-pass processing
Have clear quality thresholds

Implementation:

def cascade_processing_pattern(document):
    # First pass: GPT-4o mini (fast and cheap)
    result_mini = process_document(
        document,
        model="gpt-4o-mini",
        analysis_depth="standard"
    )
    
    # Check quality
    if result_mini["confidence"] > 0.85:
        # High confidence: use mini result
        return result_mini
    else:
        # Low confidence: reprocess with GPT-4o
        result_gpt4 = process_document(
            document,
            model="gpt-4o",
            analysis_depth="deep"
        )
        return result_gpt4

Benefits:

Optimal cost-quality balance
80-90% of docs use cheaper model
High quality when needed

Best Practices

Quality Assurance

1. Implement validation rules

def validate_extraction(extracted_data, document_type):
    rules = get_validation_rules(document_type)
    
    errors = []
    for rule in rules:
        if not rule.validate(extracted_data):
            errors.append({
                "rule": rule.name,
                "message": rule.error_message
            })
    
    return len(errors) == 0, errors

2. Manual spot checks

Review 5-10% of processed documents randomly
Focus on high-value or high-risk documents
Track accuracy over time

3. Confidence scoring

def calculate_confidence(analysis):
    factors = {
        "entity_count": len(analysis["entities"]),
        "clear_structure": has_clear_structure(analysis["text"]),
        "ambiguity_score": calculate_ambiguity(analysis["text"]),
        "model_certainty": analysis["model_metadata"]["certainty"]
    }
    
    confidence = compute_confidence(factors)
    return confidence

Performance Optimization

1. Choose appropriate analysis depth

Quick Extract: Text only, no analysis (fastest)
Standard: Summary + key points (balanced)
Deep: Full analysis (slowest, most comprehensive)

2. Optimize model selection

Use mini by default
Use GPT-4o selectively
Implement cascade strategy

3. Enable vector storage selectively

Only enable if using semantic search:

enable_vectors = use_case in ["legal_discovery", "research_synthesis"]

Cost Management

1. Monitor usage

def monitor_costs():
    usage = get_monthly_usage()
    
    metrics = {
        "documents_processed": usage["total_documents"],
        "model_costs": usage["model_costs"],
        "cost_per_document": usage["model_costs"] / usage["total_documents"],
        "projected_monthly": usage["model_costs"] / usage["days_elapsed"] * 30
    }
    
    if metrics["projected_monthly"] > budget_threshold:
        alert_team("Projected costs exceed budget")
    
    return metrics

2. Optimize processing

Use Quick Extract when appropriate
Disable vectors if not needed
Batch process during off-peak hours

3. Track ROI

def calculate_roi(documents_processed, processing_time_saved):
    # Cost
    alactic_cost = infrastructure_cost + model_costs
    
    # Benefit
    labor_cost_saved = processing_time_saved * hourly_labor_rate
    error_cost_avoided = error_rate_reduction * cost_per_error
    time_to_value_benefit = faster_processing * value_per_day
    
    total_benefit = labor_cost_saved + error_cost_avoided + time_to_value_benefit
    
    roi = (total_benefit - alactic_cost) / alactic_cost * 100
    
    return {
        "roi_percentage": roi,
        "monthly_savings": total_benefit - alactic_cost,
        "payback_period_months": alactic_cost / (total_benefit / 12)
    }

Overview​

Common Use Cases​

Invoice Processing​

Contract Analysis​

Research Paper Analysis​

Financial Report Analysis​

Legal Discovery​

Medical Record Processing​

Implementation Patterns​

Batch Processing Pattern​

Streaming Pattern​

Cascade Pattern​

Best Practices​

Quality Assurance​

Performance Optimization​

Cost Management​

Related Documentation​

Overview

Common Use Cases

Invoice Processing

Contract Analysis

Research Paper Analysis

Financial Report Analysis

Legal Discovery

Medical Record Processing

Implementation Patterns

Batch Processing Pattern

Streaming Pattern

Cascade Pattern

Best Practices

Quality Assurance

Performance Optimization

Cost Management

Related Documentation