Document Intelligence and Analysis
Transform unstructured documents into actionable intelligence with AI-powered analysis. This guide covers document intelligence use cases, implementation patterns, and best practices for extracting maximum value from your documents.
Overview
What is Document Intelligence?
Document intelligence combines multiple AI capabilities to understand, analyze, and extract insights from documents:
- Text Extraction: Convert PDFs to machine-readable text
- Entity Recognition: Identify people, companies, dates, amounts
- Relationship Extraction: Understand connections between entities
- Sentiment Analysis: Gauge tone and sentiment
- Summarization: Generate concise summaries
- Classification: Categorize documents automatically
- Question Answering: Answer questions about document content
Common Use Cases
Invoice Processing
Challenge:
- Manual data entry from hundreds of invoices monthly
- Prone to errors and delays
- Expensive labor costs
- Difficult to track payment schedules
Solution with Alactic:
Step 1: Upload invoices in batch
import requests
def process_invoice_batch(invoice_files):
files = [("files[]", open(f, "rb")) for f in invoice_files]
response = requests.post(
"https://your-vm-ip/api/v1/batch",
headers={"X-Deployment-Key": "ak-xxxxx"},
files=files,
data={
"model": "gpt-4o-mini",
"analysis_depth": "standard"
}
)
return response.json()["batch_id"]
Step 2: Extract structured data
Configure extraction to focus on key invoice fields:
- Vendor name and address
- Invoice number and date
- Due date
- Line items (description, quantity, unit price, total)
- Subtotal, tax, total amount
- Payment terms
Step 3: Export to accounting system
def export_to_accounting(invoice_data):
structured_data = {
"vendor": invoice_data["entities"]["company"][0],
"invoice_number": extract_invoice_number(invoice_data),
"date": invoice_data["entities"]["date"][0],
"amount": invoice_data["entities"]["money"][0],
"line_items": parse_line_items(invoice_data["text"])
}
# Export to QuickBooks, Xero, etc.
accounting_api.create_bill(structured_data)
Results:
- Processing time: 5 minutes vs 2 hours manually
- Accuracy: 98%+ with GPT-4o
- Cost savings: $1,500/month in labor
- Faster payment processing
Best Practices:
- Use GPT-4o mini for simple invoices (cost-effective)
- Use GPT-4o for complex invoices with multiple line items
- Implement validation rules for critical fields
- Review flagged invoices manually
Contract Analysis
Challenge:
- Hundreds of contracts to review for M&A due diligence
- Identify key terms, obligations, and risks
- Manual review: 1-2 hours per contract
- Expensive legal fees
Solution with Alactic:
Step 1: Process contract portfolio
def analyze_contracts(contract_folder):
contracts = os.listdir(contract_folder)
for contract in contracts:
result = process_document(
file_path=f"{contract_folder}/{contract}",
model="gpt-4o", # Use GPT-4o for accuracy
analysis_depth="deep", # Get detailed analysis
custom_prompt="Extract: parties, effective date, term length, " +
"termination clauses, payment terms, obligations, " +
"indemnification, liability limits, governing law"
)
save_to_database(contract, result)
Step 2: Extract key provisions
Focus on critical contract elements:
- Parties: Who are the contracting parties?
- Term: Contract duration and renewal terms
- Payment terms: Amounts, schedules, payment methods
- Obligations: What must each party do?
- Termination: Conditions for ending contract
- Liability: Liability caps and limitations
- Indemnification: Who protects whom from what?
- Governing law: Which jurisdiction applies?
Step 3: Risk assessment
def assess_contract_risk(contract_analysis):
risk_factors = {
"unlimited_liability": "high",
"auto_renewal": "medium",
"long_term": "medium",
"broad_indemnification": "high",
"no_termination_clause": "high"
}
identified_risks = []
for risk, severity in risk_factors.items():
if check_for_risk(contract_analysis, risk):
identified_risks.append({
"risk": risk,
"severity": severity,
"details": extract_details(contract_analysis, risk)
})
return identified_risks
Step 4: Generate summary report
def generate_contract_summary(contracts_data):
summary = {
"total_contracts": len(contracts_data),
"total_value": sum(c["payment_terms"]["total"] for c in contracts_data),
"high_risk_contracts": [c for c in contracts_data if c["risk_level"] == "high"],
"expiring_soon": [c for c in contracts_data if c["days_until_expiry"] < 90],
"key_obligations": aggregate_obligations(contracts_data),
"liability_exposure": calculate_liability(contracts_data)
}
return summary
Results:
- Review time: 50 contracts in 2 hours vs 100 hours manually
- Cost savings: $50,000 in legal fees
- Better risk identification
- Faster deal closure
Best Practices:
- Always use GPT-4o for legal contracts (accuracy critical)
- Use Deep Analysis for comprehensive extraction
- Have legal team review AI-flagged risks
- Build template for recurring contract types
Research Paper Analysis
Challenge:
- Literature review requires reading 500+ papers
- Identify relevant studies and key findings
- Manual process: 3-6 months
- Risk of missing important research
Solution with Alactic:
Step 1: Collect papers
import requests
from bs4 import BeautifulSoup
def scrape_pubmed(search_query, num_papers=500):
papers = []
# Search PubMed
search_results = pubmed_api.search(search_query, max_results=num_papers)
for paper_id in search_results:
# Get PDF URL
pdf_url = get_paper_pdf_url(paper_id)
papers.append(pdf_url)
return papers
Step 2: Process with Alactic
def analyze_research_papers(paper_urls):
results = []
# Process in batches of 50
for i in range(0, len(paper_urls), 50):
batch = paper_urls[i:i+50]
batch_result = requests.post(
"https://your-vm-ip/api/v1/batch",
headers={"X-Deployment-Key": "ak-xxxxx"},
json={
"urls": batch,
"model": "gpt-4o",
"analysis_depth": "deep",
"extract": [
"research_question",
"methodology",
"sample_size",
"key_findings",
"limitations",
"conclusions"
]
}
)
results.extend(wait_for_batch(batch_result["batch_id"]))
return results
Step 3: Extract structured data
For each paper, extract:
- Research question: What did they investigate?
- Methodology: How did they conduct the study?
- Sample size: How many participants/data points?
- Key findings: What did they discover?
- Statistical significance: P-values, confidence intervals
- Limitations: Study weaknesses
- Conclusions: What do results mean?
Step 4: Synthesize findings
def synthesize_literature(papers_data):
synthesis = {
"total_papers": len(papers_data),
"date_range": get_date_range(papers_data),
"common_methodologies": count_methodologies(papers_data),
"key_themes": extract_themes(papers_data),
"consistent_findings": find_consensus(papers_data),
"contradictions": find_contradictions(papers_data),
"research_gaps": identify_gaps(papers_data),
"meta_analysis_data": prepare_meta_analysis(papers_data)
}
return synthesis
Results:
- Analysis time: 2 weeks vs 6 months
- More comprehensive coverage
- Quantitative synthesis possible
- Better identification of research gaps
Best Practices:
- Use GPT-4o for research papers (complex content)
- Enable Deep Analysis for full extraction
- Validate key findings against paper text
- Have domain expert review synthesis
Financial Report Analysis
Challenge:
- Analyze quarterly reports from 100+ companies
- Extract financial metrics and trends
- Identify investment opportunities or risks
- Manual analysis: Full-time job
Solution with Alactic:
Step 1: Collect financial reports
def collect_earnings_reports(companies, quarter):
reports = []
for company in companies:
# Get 10-Q or earnings report URL
report_url = get_sec_filing(company, "10-Q", quarter)
reports.append({
"company": company,
"url": report_url,
"quarter": quarter
})
return reports
Step 2: Process reports
def analyze_financial_reports(reports):
results = []
for report in reports:
analysis = process_document(
url=report["url"],
model="gpt-4o",
analysis_depth="deep",
extract=[
"revenue",
"net_income",
"operating_margin",
"eps",
"guidance",
"key_metrics",
"risks_mentioned",
"management_commentary"
]
)
results.append({
"company": report["company"],
"quarter": report["quarter"],
"financials": analysis["entities"]["money"],
"metrics": analysis["key_metrics"],
"sentiment": analysis["sentiment"],
"risks": analysis["risks_mentioned"]
})
return results
Step 3: Comparative analysis
def compare_companies(companies_data):
comparison = pd.DataFrame(companies_data)
# Calculate metrics
comparison["revenue_growth"] = calculate_growth(comparison, "revenue")
comparison["margin_trend"] = calculate_trend(comparison, "operating_margin")
comparison["sentiment_score"] = normalize_sentiment(comparison, "sentiment")
# Rank companies
comparison["investment_score"] = (
comparison["revenue_growth"] * 0.3 +
comparison["margin_trend"] * 0.3 +
comparison["sentiment_score"] * 0.2 +
comparison["guidance_strength"] * 0.2
)
return comparison.sort_values("investment_score", ascending=False)
Results:
- Analysis time: 2 days vs continuous monitoring
- Identify opportunities faster
- Data-driven investment decisions
- Better risk management
Best Practices:
- Use GPT-4o for financial reports (precision matters)
- Cross-reference extracted numbers with tables
- Track metrics over multiple quarters
- Combine with quantitative data
Legal Discovery
Challenge:
- E-discovery with 100,000+ documents
- Identify relevant documents for litigation
- Manual review: $5-10 million
- Timeline: 12-18 months
Solution with Alactic:
Step 1: Initial processing
def process_discovery_documents(document_folder):
documents = collect_all_files(document_folder)
# Process in large batches
batch_size = 500 # Enterprise plan capacity
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
result = process_batch(
files=batch,
model="gpt-4o-mini", # Cost-effective for initial pass
analysis_depth="standard",
enable_vectors=True # Enable semantic search
)
store_results(result)
Step 2: Relevance scoring
def score_relevance(document, case_keywords):
# Get document analysis
analysis = get_document_analysis(document["id"])
# Calculate relevance score
keyword_matches = count_keyword_matches(analysis["text"], case_keywords)
entity_relevance = check_relevant_entities(analysis["entities"], case_entities)
semantic_similarity = calculate_semantic_similarity(
analysis["summary"],
case_description
)
relevance_score = (
keyword_matches * 0.3 +
entity_relevance * 0.4 +
semantic_similarity * 0.3
)
return relevance_score
Step 3: Prioritized review
def prioritize_for_review(documents_data, threshold=0.7):
# Score all documents
scored_docs = [
{
"doc": doc,
"relevance": score_relevance(doc, case_keywords)
}
for doc in documents_data
]
# Sort by relevance
scored_docs.sort(key=lambda x: x["relevance"], reverse=True)
# High relevance: manual review required
high_relevance = [d for d in scored_docs if d["relevance"] > threshold]
# Medium relevance: spot check
medium_relevance = [d for d in scored_docs if 0.4 < d["relevance"] <= threshold]
# Low relevance: can likely exclude
low_relevance = [d for d in scored_docs if d["relevance"] <= 0.4]
return {
"high": high_relevance,
"medium": medium_relevance,
"low": low_relevance,
"review_reduction": len(low_relevance) / len(scored_docs) * 100
}
Step 4: Hot document identification
def identify_hot_documents(documents_data):
hot_docs = []
# Look for smoking gun indicators
indicators = [
"confidential",
"do not distribute",
"privileged",
"attorney-client",
"destroy after reading",
specific_date_ranges,
key_people_communication
]
for doc in documents_data:
hot_score = calculate_hot_score(doc, indicators)
if hot_score > 0.8:
hot_docs.append({
"document": doc,
"hot_score": hot_score,
"reasons": explain_hot_score(doc, indicators)
})
return hot_docs
Results:
- Review set reduced by 90% (10,000 vs 100,000 docs)
- Cost: $500K vs $5M
- Timeline: 3 months vs 18 months
- Better case outcomes with AI-identified hot docs
Best Practices:
- Use mini for initial pass (cost control)
- Use GPT-4o for high-relevance documents
- Enable vector search for semantic queries
- Have legal team validate high-relevance docs
- Use cascading review strategy
Medical Record Processing
Challenge:
- Extract patient information from unstructured medical records
- Identify diagnoses, medications, allergies, procedures
- Manual extraction: 30 minutes per record
- Error-prone and inconsistent
Solution with Alactic:
Step 1: HIPAA-compliant deployment
Ensure compliance:
- Enterprise plan with BAA
- Data encryption at rest and in transit
- Access controls and audit logging
- Data retention policies
Step 2: Process medical records
def process_medical_record(record_pdf):
result = process_document(
file_path=record_pdf,
model="gpt-4o", # Accuracy critical for medical data
analysis_depth="deep",
extract=[
"patient_demographics",
"diagnoses",
"medications",
"allergies",
"procedures",
"lab_results",
"vital_signs",
"medical_history",
"treatment_plan"
]
)
return structure_medical_data(result)
Step 3: Extract structured medical data
def structure_medical_data(analysis):
structured = {
"patient": {
"name": extract_patient_name(analysis),
"dob": extract_date_of_birth(analysis),
"mrn": extract_medical_record_number(analysis)
},
"diagnoses": [
{
"condition": d["value"],
"icd_code": map_to_icd10(d["value"]),
"date": d["date"]
}
for d in analysis["entities"]["medical_condition"]
],
"medications": [
{
"name": m["value"],
"dosage": extract_dosage(m),
"frequency": extract_frequency(m),
"start_date": m["date"]
}
for m in analysis["entities"]["medication"]
],
"allergies": [
{
"allergen": a["value"],
"reaction": extract_reaction(a),
"severity": classify_severity(a)
}
for a in analysis["entities"]["allergy"]
]
}
return structured
Step 4: Validate and store
def validate_medical_data(structured_data):
# Validation rules
validations = [
validate_mrn_format(structured_data["patient"]["mrn"]),
validate_icd_codes(structured_data["diagnoses"]),
check_medication_interactions(structured_data["medications"]),
verify_allergy_severity(structured_data["allergies"]),
ensure_completeness(structured_data)
]
if all(validations):
store_in_ehr(structured_data)
return True
else:
flag_for_manual_review(structured_data, validations)
return False
Results:
- Processing time: 2 minutes vs 30 minutes per record
- Accuracy: 99%+ for critical fields
- Cost savings: $25 per record
- Improved data accessibility for care teams
Best Practices:
- Always use GPT-4o for medical records (accuracy critical)
- Implement strict validation rules
- Have medical staff review AI-extracted data
- Maintain audit trail for all extractions
- Ensure HIPAA compliance at all stages
Implementation Patterns
Batch Processing Pattern
When to use:
- Processing large volumes of similar documents
- Not time-sensitive (can process overnight)
- Want to maximize efficiency
Implementation:
from concurrent.futures import ThreadPoolExecutor
import time
def batch_processing_pattern(documents, batch_size=50):
results = []
# Split into batches
batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]
for batch in batches:
# Submit batch
batch_id = submit_batch(batch)
# Wait for completion
while not is_batch_complete(batch_id):
time.sleep(10)
# Collect results
batch_results = get_batch_results(batch_id)
results.extend(batch_results)
# Clean up (optional)
cleanup_batch(batch_id)
return results
Benefits:
- Efficient API usage
- Lower cost per document
- Easier error handling
Streaming Pattern
When to use:
- Real-time processing required
- Documents arrive continuously
- Need immediate results
Implementation:
import queue
import threading
def streaming_processing_pattern():
document_queue = queue.Queue()
results_queue = queue.Queue()
# Worker threads
def worker():
while True:
doc = document_queue.get()
if doc is None:
break
result = process_document(doc)
results_queue.put(result)
document_queue.task_done()
# Start workers
num_workers = 5
threads = []
for _ in range(num_workers):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
# Feed documents
for doc in incoming_documents():
document_queue.put(doc)
# Wait for completion
document_queue.join()
# Stop workers
for _ in range(num_workers):
document_queue.put(None)
return collect_results(results_queue)
Benefits:
- Low latency
- Handles variable load
- Scalable
Cascade Pattern
When to use:
- Want to optimize cost and quality
- Can tolerate two-pass processing
- Have clear quality thresholds
Implementation:
def cascade_processing_pattern(document):
# First pass: GPT-4o mini (fast and cheap)
result_mini = process_document(
document,
model="gpt-4o-mini",
analysis_depth="standard"
)
# Check quality
if result_mini["confidence"] > 0.85:
# High confidence: use mini result
return result_mini
else:
# Low confidence: reprocess with GPT-4o
result_gpt4 = process_document(
document,
model="gpt-4o",
analysis_depth="deep"
)
return result_gpt4
Benefits:
- Optimal cost-quality balance
- 80-90% of docs use cheaper model
- High quality when needed
Best Practices
Quality Assurance
1. Implement validation rules
def validate_extraction(extracted_data, document_type):
rules = get_validation_rules(document_type)
errors = []
for rule in rules:
if not rule.validate(extracted_data):
errors.append({
"rule": rule.name,
"message": rule.error_message
})
return len(errors) == 0, errors
2. Manual spot checks
- Review 5-10% of processed documents randomly
- Focus on high-value or high-risk documents
- Track accuracy over time
3. Confidence scoring
def calculate_confidence(analysis):
factors = {
"entity_count": len(analysis["entities"]),
"clear_structure": has_clear_structure(analysis["text"]),
"ambiguity_score": calculate_ambiguity(analysis["text"]),
"model_certainty": analysis["model_metadata"]["certainty"]
}
confidence = compute_confidence(factors)
return confidence
Performance Optimization
1. Choose appropriate analysis depth
- Quick Extract: Text only, no analysis (fastest)
- Standard: Summary + key points (balanced)
- Deep: Full analysis (slowest, most comprehensive)
2. Optimize model selection
- Use mini by default
- Use GPT-4o selectively
- Implement cascade strategy
3. Enable vector storage selectively
Only enable if using semantic search:
enable_vectors = use_case in ["legal_discovery", "research_synthesis"]
Cost Management
1. Monitor usage
def monitor_costs():
usage = get_monthly_usage()
metrics = {
"documents_processed": usage["total_documents"],
"model_costs": usage["model_costs"],
"cost_per_document": usage["model_costs"] / usage["total_documents"],
"projected_monthly": usage["model_costs"] / usage["days_elapsed"] * 30
}
if metrics["projected_monthly"] > budget_threshold:
alert_team("Projected costs exceed budget")
return metrics
2. Optimize processing
- Use Quick Extract when appropriate
- Disable vectors if not needed
- Batch process during off-peak hours
3. Track ROI
def calculate_roi(documents_processed, processing_time_saved):
# Cost
alactic_cost = infrastructure_cost + model_costs
# Benefit
labor_cost_saved = processing_time_saved * hourly_labor_rate
error_cost_avoided = error_rate_reduction * cost_per_error
time_to_value_benefit = faster_processing * value_per_day
total_benefit = labor_cost_saved + error_cost_avoided + time_to_value_benefit
roi = (total_benefit - alactic_cost) / alactic_cost * 100
return {
"roi_percentage": roi,
"monthly_savings": total_benefit - alactic_cost,
"payback_period_months": alactic_cost / (total_benefit / 12)
}