Skip to main content

Document Intelligence and Analysis

Transform unstructured documents into actionable intelligence with AI-powered analysis. This guide covers document intelligence use cases, implementation patterns, and best practices for extracting maximum value from your documents.

Overview

What is Document Intelligence?

Document intelligence combines multiple AI capabilities to understand, analyze, and extract insights from documents:

  • Text Extraction: Convert PDFs to machine-readable text
  • Entity Recognition: Identify people, companies, dates, amounts
  • Relationship Extraction: Understand connections between entities
  • Sentiment Analysis: Gauge tone and sentiment
  • Summarization: Generate concise summaries
  • Classification: Categorize documents automatically
  • Question Answering: Answer questions about document content

Common Use Cases

Invoice Processing

Challenge:

  • Manual data entry from hundreds of invoices monthly
  • Prone to errors and delays
  • Expensive labor costs
  • Difficult to track payment schedules

Solution with Alactic:

Step 1: Upload invoices in batch

import requests

def process_invoice_batch(invoice_files):
files = [("files[]", open(f, "rb")) for f in invoice_files]

response = requests.post(
"https://your-vm-ip/api/v1/batch",
headers={"X-Deployment-Key": "ak-xxxxx"},
files=files,
data={
"model": "gpt-4o-mini",
"analysis_depth": "standard"
}
)

return response.json()["batch_id"]

Step 2: Extract structured data

Configure extraction to focus on key invoice fields:

  • Vendor name and address
  • Invoice number and date
  • Due date
  • Line items (description, quantity, unit price, total)
  • Subtotal, tax, total amount
  • Payment terms

Step 3: Export to accounting system

def export_to_accounting(invoice_data):
structured_data = {
"vendor": invoice_data["entities"]["company"][0],
"invoice_number": extract_invoice_number(invoice_data),
"date": invoice_data["entities"]["date"][0],
"amount": invoice_data["entities"]["money"][0],
"line_items": parse_line_items(invoice_data["text"])
}

# Export to QuickBooks, Xero, etc.
accounting_api.create_bill(structured_data)

Results:

  • Processing time: 5 minutes vs 2 hours manually
  • Accuracy: 98%+ with GPT-4o
  • Cost savings: $1,500/month in labor
  • Faster payment processing

Best Practices:

  • Use GPT-4o mini for simple invoices (cost-effective)
  • Use GPT-4o for complex invoices with multiple line items
  • Implement validation rules for critical fields
  • Review flagged invoices manually

Contract Analysis

Challenge:

  • Hundreds of contracts to review for M&A due diligence
  • Identify key terms, obligations, and risks
  • Manual review: 1-2 hours per contract
  • Expensive legal fees

Solution with Alactic:

Step 1: Process contract portfolio

def analyze_contracts(contract_folder):
contracts = os.listdir(contract_folder)

for contract in contracts:
result = process_document(
file_path=f"{contract_folder}/{contract}",
model="gpt-4o", # Use GPT-4o for accuracy
analysis_depth="deep", # Get detailed analysis
custom_prompt="Extract: parties, effective date, term length, " +
"termination clauses, payment terms, obligations, " +
"indemnification, liability limits, governing law"
)

save_to_database(contract, result)

Step 2: Extract key provisions

Focus on critical contract elements:

  • Parties: Who are the contracting parties?
  • Term: Contract duration and renewal terms
  • Payment terms: Amounts, schedules, payment methods
  • Obligations: What must each party do?
  • Termination: Conditions for ending contract
  • Liability: Liability caps and limitations
  • Indemnification: Who protects whom from what?
  • Governing law: Which jurisdiction applies?

Step 3: Risk assessment

def assess_contract_risk(contract_analysis):
risk_factors = {
"unlimited_liability": "high",
"auto_renewal": "medium",
"long_term": "medium",
"broad_indemnification": "high",
"no_termination_clause": "high"
}

identified_risks = []
for risk, severity in risk_factors.items():
if check_for_risk(contract_analysis, risk):
identified_risks.append({
"risk": risk,
"severity": severity,
"details": extract_details(contract_analysis, risk)
})

return identified_risks

Step 4: Generate summary report

def generate_contract_summary(contracts_data):
summary = {
"total_contracts": len(contracts_data),
"total_value": sum(c["payment_terms"]["total"] for c in contracts_data),
"high_risk_contracts": [c for c in contracts_data if c["risk_level"] == "high"],
"expiring_soon": [c for c in contracts_data if c["days_until_expiry"] < 90],
"key_obligations": aggregate_obligations(contracts_data),
"liability_exposure": calculate_liability(contracts_data)
}

return summary

Results:

  • Review time: 50 contracts in 2 hours vs 100 hours manually
  • Cost savings: $50,000 in legal fees
  • Better risk identification
  • Faster deal closure

Best Practices:

  • Always use GPT-4o for legal contracts (accuracy critical)
  • Use Deep Analysis for comprehensive extraction
  • Have legal team review AI-flagged risks
  • Build template for recurring contract types

Research Paper Analysis

Challenge:

  • Literature review requires reading 500+ papers
  • Identify relevant studies and key findings
  • Manual process: 3-6 months
  • Risk of missing important research

Solution with Alactic:

Step 1: Collect papers

import requests
from bs4 import BeautifulSoup

def scrape_pubmed(search_query, num_papers=500):
papers = []

# Search PubMed
search_results = pubmed_api.search(search_query, max_results=num_papers)

for paper_id in search_results:
# Get PDF URL
pdf_url = get_paper_pdf_url(paper_id)
papers.append(pdf_url)

return papers

Step 2: Process with Alactic

def analyze_research_papers(paper_urls):
results = []

# Process in batches of 50
for i in range(0, len(paper_urls), 50):
batch = paper_urls[i:i+50]

batch_result = requests.post(
"https://your-vm-ip/api/v1/batch",
headers={"X-Deployment-Key": "ak-xxxxx"},
json={
"urls": batch,
"model": "gpt-4o",
"analysis_depth": "deep",
"extract": [
"research_question",
"methodology",
"sample_size",
"key_findings",
"limitations",
"conclusions"
]
}
)

results.extend(wait_for_batch(batch_result["batch_id"]))

return results

Step 3: Extract structured data

For each paper, extract:

  • Research question: What did they investigate?
  • Methodology: How did they conduct the study?
  • Sample size: How many participants/data points?
  • Key findings: What did they discover?
  • Statistical significance: P-values, confidence intervals
  • Limitations: Study weaknesses
  • Conclusions: What do results mean?

Step 4: Synthesize findings

def synthesize_literature(papers_data):
synthesis = {
"total_papers": len(papers_data),
"date_range": get_date_range(papers_data),
"common_methodologies": count_methodologies(papers_data),
"key_themes": extract_themes(papers_data),
"consistent_findings": find_consensus(papers_data),
"contradictions": find_contradictions(papers_data),
"research_gaps": identify_gaps(papers_data),
"meta_analysis_data": prepare_meta_analysis(papers_data)
}

return synthesis

Results:

  • Analysis time: 2 weeks vs 6 months
  • More comprehensive coverage
  • Quantitative synthesis possible
  • Better identification of research gaps

Best Practices:

  • Use GPT-4o for research papers (complex content)
  • Enable Deep Analysis for full extraction
  • Validate key findings against paper text
  • Have domain expert review synthesis

Financial Report Analysis

Challenge:

  • Analyze quarterly reports from 100+ companies
  • Extract financial metrics and trends
  • Identify investment opportunities or risks
  • Manual analysis: Full-time job

Solution with Alactic:

Step 1: Collect financial reports

def collect_earnings_reports(companies, quarter):
reports = []

for company in companies:
# Get 10-Q or earnings report URL
report_url = get_sec_filing(company, "10-Q", quarter)
reports.append({
"company": company,
"url": report_url,
"quarter": quarter
})

return reports

Step 2: Process reports

def analyze_financial_reports(reports):
results = []

for report in reports:
analysis = process_document(
url=report["url"],
model="gpt-4o",
analysis_depth="deep",
extract=[
"revenue",
"net_income",
"operating_margin",
"eps",
"guidance",
"key_metrics",
"risks_mentioned",
"management_commentary"
]
)

results.append({
"company": report["company"],
"quarter": report["quarter"],
"financials": analysis["entities"]["money"],
"metrics": analysis["key_metrics"],
"sentiment": analysis["sentiment"],
"risks": analysis["risks_mentioned"]
})

return results

Step 3: Comparative analysis

def compare_companies(companies_data):
comparison = pd.DataFrame(companies_data)

# Calculate metrics
comparison["revenue_growth"] = calculate_growth(comparison, "revenue")
comparison["margin_trend"] = calculate_trend(comparison, "operating_margin")
comparison["sentiment_score"] = normalize_sentiment(comparison, "sentiment")

# Rank companies
comparison["investment_score"] = (
comparison["revenue_growth"] * 0.3 +
comparison["margin_trend"] * 0.3 +
comparison["sentiment_score"] * 0.2 +
comparison["guidance_strength"] * 0.2
)

return comparison.sort_values("investment_score", ascending=False)

Results:

  • Analysis time: 2 days vs continuous monitoring
  • Identify opportunities faster
  • Data-driven investment decisions
  • Better risk management

Best Practices:

  • Use GPT-4o for financial reports (precision matters)
  • Cross-reference extracted numbers with tables
  • Track metrics over multiple quarters
  • Combine with quantitative data

Challenge:

  • E-discovery with 100,000+ documents
  • Identify relevant documents for litigation
  • Manual review: $5-10 million
  • Timeline: 12-18 months

Solution with Alactic:

Step 1: Initial processing

def process_discovery_documents(document_folder):
documents = collect_all_files(document_folder)

# Process in large batches
batch_size = 500 # Enterprise plan capacity

for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]

result = process_batch(
files=batch,
model="gpt-4o-mini", # Cost-effective for initial pass
analysis_depth="standard",
enable_vectors=True # Enable semantic search
)

store_results(result)

Step 2: Relevance scoring

def score_relevance(document, case_keywords):
# Get document analysis
analysis = get_document_analysis(document["id"])

# Calculate relevance score
keyword_matches = count_keyword_matches(analysis["text"], case_keywords)
entity_relevance = check_relevant_entities(analysis["entities"], case_entities)
semantic_similarity = calculate_semantic_similarity(
analysis["summary"],
case_description
)

relevance_score = (
keyword_matches * 0.3 +
entity_relevance * 0.4 +
semantic_similarity * 0.3
)

return relevance_score

Step 3: Prioritized review

def prioritize_for_review(documents_data, threshold=0.7):
# Score all documents
scored_docs = [
{
"doc": doc,
"relevance": score_relevance(doc, case_keywords)
}
for doc in documents_data
]

# Sort by relevance
scored_docs.sort(key=lambda x: x["relevance"], reverse=True)

# High relevance: manual review required
high_relevance = [d for d in scored_docs if d["relevance"] > threshold]

# Medium relevance: spot check
medium_relevance = [d for d in scored_docs if 0.4 < d["relevance"] <= threshold]

# Low relevance: can likely exclude
low_relevance = [d for d in scored_docs if d["relevance"] <= 0.4]

return {
"high": high_relevance,
"medium": medium_relevance,
"low": low_relevance,
"review_reduction": len(low_relevance) / len(scored_docs) * 100
}

Step 4: Hot document identification

def identify_hot_documents(documents_data):
hot_docs = []

# Look for smoking gun indicators
indicators = [
"confidential",
"do not distribute",
"privileged",
"attorney-client",
"destroy after reading",
specific_date_ranges,
key_people_communication
]

for doc in documents_data:
hot_score = calculate_hot_score(doc, indicators)
if hot_score > 0.8:
hot_docs.append({
"document": doc,
"hot_score": hot_score,
"reasons": explain_hot_score(doc, indicators)
})

return hot_docs

Results:

  • Review set reduced by 90% (10,000 vs 100,000 docs)
  • Cost: $500K vs $5M
  • Timeline: 3 months vs 18 months
  • Better case outcomes with AI-identified hot docs

Best Practices:

  • Use mini for initial pass (cost control)
  • Use GPT-4o for high-relevance documents
  • Enable vector search for semantic queries
  • Have legal team validate high-relevance docs
  • Use cascading review strategy

Medical Record Processing

Challenge:

  • Extract patient information from unstructured medical records
  • Identify diagnoses, medications, allergies, procedures
  • Manual extraction: 30 minutes per record
  • Error-prone and inconsistent

Solution with Alactic:

Step 1: HIPAA-compliant deployment

Ensure compliance:

  • Enterprise plan with BAA
  • Data encryption at rest and in transit
  • Access controls and audit logging
  • Data retention policies

Step 2: Process medical records

def process_medical_record(record_pdf):
result = process_document(
file_path=record_pdf,
model="gpt-4o", # Accuracy critical for medical data
analysis_depth="deep",
extract=[
"patient_demographics",
"diagnoses",
"medications",
"allergies",
"procedures",
"lab_results",
"vital_signs",
"medical_history",
"treatment_plan"
]
)

return structure_medical_data(result)

Step 3: Extract structured medical data

def structure_medical_data(analysis):
structured = {
"patient": {
"name": extract_patient_name(analysis),
"dob": extract_date_of_birth(analysis),
"mrn": extract_medical_record_number(analysis)
},
"diagnoses": [
{
"condition": d["value"],
"icd_code": map_to_icd10(d["value"]),
"date": d["date"]
}
for d in analysis["entities"]["medical_condition"]
],
"medications": [
{
"name": m["value"],
"dosage": extract_dosage(m),
"frequency": extract_frequency(m),
"start_date": m["date"]
}
for m in analysis["entities"]["medication"]
],
"allergies": [
{
"allergen": a["value"],
"reaction": extract_reaction(a),
"severity": classify_severity(a)
}
for a in analysis["entities"]["allergy"]
]
}

return structured

Step 4: Validate and store

def validate_medical_data(structured_data):
# Validation rules
validations = [
validate_mrn_format(structured_data["patient"]["mrn"]),
validate_icd_codes(structured_data["diagnoses"]),
check_medication_interactions(structured_data["medications"]),
verify_allergy_severity(structured_data["allergies"]),
ensure_completeness(structured_data)
]

if all(validations):
store_in_ehr(structured_data)
return True
else:
flag_for_manual_review(structured_data, validations)
return False

Results:

  • Processing time: 2 minutes vs 30 minutes per record
  • Accuracy: 99%+ for critical fields
  • Cost savings: $25 per record
  • Improved data accessibility for care teams

Best Practices:

  • Always use GPT-4o for medical records (accuracy critical)
  • Implement strict validation rules
  • Have medical staff review AI-extracted data
  • Maintain audit trail for all extractions
  • Ensure HIPAA compliance at all stages

Implementation Patterns

Batch Processing Pattern

When to use:

  • Processing large volumes of similar documents
  • Not time-sensitive (can process overnight)
  • Want to maximize efficiency

Implementation:

from concurrent.futures import ThreadPoolExecutor
import time

def batch_processing_pattern(documents, batch_size=50):
results = []

# Split into batches
batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]

for batch in batches:
# Submit batch
batch_id = submit_batch(batch)

# Wait for completion
while not is_batch_complete(batch_id):
time.sleep(10)

# Collect results
batch_results = get_batch_results(batch_id)
results.extend(batch_results)

# Clean up (optional)
cleanup_batch(batch_id)

return results

Benefits:

  • Efficient API usage
  • Lower cost per document
  • Easier error handling

Streaming Pattern

When to use:

  • Real-time processing required
  • Documents arrive continuously
  • Need immediate results

Implementation:

import queue
import threading

def streaming_processing_pattern():
document_queue = queue.Queue()
results_queue = queue.Queue()

# Worker threads
def worker():
while True:
doc = document_queue.get()
if doc is None:
break

result = process_document(doc)
results_queue.put(result)
document_queue.task_done()

# Start workers
num_workers = 5
threads = []
for _ in range(num_workers):
t = threading.Thread(target=worker)
t.start()
threads.append(t)

# Feed documents
for doc in incoming_documents():
document_queue.put(doc)

# Wait for completion
document_queue.join()

# Stop workers
for _ in range(num_workers):
document_queue.put(None)

return collect_results(results_queue)

Benefits:

  • Low latency
  • Handles variable load
  • Scalable

Cascade Pattern

When to use:

  • Want to optimize cost and quality
  • Can tolerate two-pass processing
  • Have clear quality thresholds

Implementation:

def cascade_processing_pattern(document):
# First pass: GPT-4o mini (fast and cheap)
result_mini = process_document(
document,
model="gpt-4o-mini",
analysis_depth="standard"
)

# Check quality
if result_mini["confidence"] > 0.85:
# High confidence: use mini result
return result_mini
else:
# Low confidence: reprocess with GPT-4o
result_gpt4 = process_document(
document,
model="gpt-4o",
analysis_depth="deep"
)
return result_gpt4

Benefits:

  • Optimal cost-quality balance
  • 80-90% of docs use cheaper model
  • High quality when needed

Best Practices

Quality Assurance

1. Implement validation rules

def validate_extraction(extracted_data, document_type):
rules = get_validation_rules(document_type)

errors = []
for rule in rules:
if not rule.validate(extracted_data):
errors.append({
"rule": rule.name,
"message": rule.error_message
})

return len(errors) == 0, errors

2. Manual spot checks

  • Review 5-10% of processed documents randomly
  • Focus on high-value or high-risk documents
  • Track accuracy over time

3. Confidence scoring

def calculate_confidence(analysis):
factors = {
"entity_count": len(analysis["entities"]),
"clear_structure": has_clear_structure(analysis["text"]),
"ambiguity_score": calculate_ambiguity(analysis["text"]),
"model_certainty": analysis["model_metadata"]["certainty"]
}

confidence = compute_confidence(factors)
return confidence

Performance Optimization

1. Choose appropriate analysis depth

  • Quick Extract: Text only, no analysis (fastest)
  • Standard: Summary + key points (balanced)
  • Deep: Full analysis (slowest, most comprehensive)

2. Optimize model selection

  • Use mini by default
  • Use GPT-4o selectively
  • Implement cascade strategy

3. Enable vector storage selectively

Only enable if using semantic search:

enable_vectors = use_case in ["legal_discovery", "research_synthesis"]

Cost Management

1. Monitor usage

def monitor_costs():
usage = get_monthly_usage()

metrics = {
"documents_processed": usage["total_documents"],
"model_costs": usage["model_costs"],
"cost_per_document": usage["model_costs"] / usage["total_documents"],
"projected_monthly": usage["model_costs"] / usage["days_elapsed"] * 30
}

if metrics["projected_monthly"] > budget_threshold:
alert_team("Projected costs exceed budget")

return metrics

2. Optimize processing

  • Use Quick Extract when appropriate
  • Disable vectors if not needed
  • Batch process during off-peak hours

3. Track ROI

def calculate_roi(documents_processed, processing_time_saved):
# Cost
alactic_cost = infrastructure_cost + model_costs

# Benefit
labor_cost_saved = processing_time_saved * hourly_labor_rate
error_cost_avoided = error_rate_reduction * cost_per_error
time_to_value_benefit = faster_processing * value_per_day

total_benefit = labor_cost_saved + error_cost_avoided + time_to_value_benefit

roi = (total_benefit - alactic_cost) / alactic_cost * 100

return {
"roi_percentage": roi,
"monthly_savings": total_benefit - alactic_cost,
"payback_period_months": alactic_cost / (total_benefit / 12)
}