Skip to main content

Monitoring and Observability

Monitoring your Alactic AGI deployment ensures optimal performance, reliability, and cost efficiency. This comprehensive guide covers monitoring strategies, tools, metrics, and best practices for maintaining a healthy production deployment.

Monitoring Strategy

Why Monitor?

Key Benefits:

  • Early Problem Detection: Identify issues before users notice
  • Performance Optimization: Find bottlenecks and optimize
  • Cost Control: Track spending and prevent overruns
  • Capacity Planning: Forecast growth and scale proactively
  • Troubleshooting: Quickly diagnose and resolve issues

Monitoring Layers

1. Infrastructure Layer

  • VM health (CPU, memory, disk)
  • Network connectivity
  • Azure service availability

2. Application Layer

  • Service status (API, frontend, worker)
  • Processing queue depth
  • Error rates and types

3. Business Layer

  • Documents processed
  • Processing success rate
  • User activity patterns

4. Cost Layer

  • Infrastructure spending
  • Model API costs
  • Monthly budget tracking

Built-in Dashboard Monitoring

Usage Statistics

Access via Settings → Usage Statistics

Key Metrics:

Document Processing:

PDFs Processed This Month: 45 / 100
URLs Scraped This Month: 132 / 200
Remaining Quota: 55 PDFs, 68 URLs

Storage:

Storage Used: 3.2 GB / 5 GB (64%)
Growth Rate: +120 MB/week
Projected Full: 6 weeks

Token Usage:

Input Tokens (Month): 2,450,000
Output Tokens (Month): 385,000
Total Cost: $0.87
Average per Document: $0.0049

Model Distribution:

GPT-4o mini: 140 documents (77%), $0.21
GPT-4o: 42 documents (23%), $1.05
Total: 182 documents, $1.26

Processing Status

Active Jobs:

Processing: 3 jobs
Queued: 12 jobs
Completed (Today): 28 jobs
Failed (Today): 1 job

Recent Activity:

[2024-03-20 14:35] invoice_March.pdf - Completed (18.2s)
[2024-03-20 14:33] contract_NDA.pdf - Completed (42.5s)
[2024-03-20 14:30] article_scrape.url - Failed (timeout)

Service Health

Check service status:

Dashboard → Settings → Service Status

✓ API Service: Running (healthy)
✓ Worker Service: Running (2 jobs active)
✓ Frontend: Running
✓ Solr Search: Running
✓ Database: Connected
✓ Azure OpenAI: Connected (120ms latency)

What to Watch:

  • Any service showing "Stopped" or "Failed"
  • High latency to Azure OpenAI (more than 500ms)
  • Database connection errors

Azure Monitor Integration

Enable Azure Monitor

Step 1: Create Log Analytics Workspace

  1. Go to Azure Portal
  2. Click "+ Create a resource"
  3. Search "Log Analytics Workspace"
  4. Click "Create"
  5. Configure:
    • Resource group: Same as Alactic deployment
    • Name: alactic-logs-workspace
    • Region: Same as VM region
    • Pricing tier: Pay-as-you-go (about $2-5/month)
  6. Click "Review + Create"

Step 2: Connect VM to Log Analytics

  1. Go to your VM resource
  2. Click "Insights" in left sidebar
  3. Click "Enable"
  4. Select your Log Analytics workspace
  5. Wait 5-10 minutes for setup

Step 3: Enable Diagnostic Logs

For each resource (VM, Cosmos DB, Storage):

  1. Go to resource
  2. Click "Diagnostic settings"
  3. Click "+ Add diagnostic setting"
  4. Configure:
    • Name: "Send to Log Analytics"
    • Select all log categories
    • Destination: Log Analytics workspace
  5. Click "Save"

Key Metrics to Monitor

Virtual Machine Metrics:

MetricThresholdAction
Percentage CPUmore than 80% sustainedInvestigate workload or upgrade
Available Memoryless than 1 GBCheck for memory leaks or upgrade
Disk Queue Lengthmore than 5Disk bottleneck, upgrade to faster disk
Network In/OutNear bandwidth limitCheck for unexpected traffic
Disk Read/Write IOPSNear IOPS limitUpgrade to Premium SSD

Azure OpenAI Metrics:

MetricThresholdAction
HTTP 429 ErrorsAny occurrenceRequest quota increase
Average Latencymore than 5 secondsCheck network or OpenAI status
Failed Requestsmore than 5%Investigate errors, check quota
Token UsageApproaching quotaPlan for overage or reduce usage

Cosmos DB Metrics:

MetricThresholdAction
Request Units (RU/s)Near 5,000 (serverless limit)Optimize queries or upgrade
Storage Usedmore than 40 GBClean up old data
HTTP 429 (Rate Limited)Any occurrenceReduce request rate

Querying Logs with KQL

Access Log Analytics:

  1. Azure Portal → Log Analytics workspace
  2. Click "Logs"
  3. Write KQL queries

Common Queries:

1. VM CPU Usage Over Time

Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where Computer == "your-vm-name"
| where TimeGenerated > ago(24h)
| summarize avg(CounterValue) by bin(TimeGenerated, 5m)
| render timechart

2. Application Errors

Syslog
| where Facility == "user" and SeverityLevel == "err"
| where TimeGenerated > ago(24h)
| project TimeGenerated, ProcessName, SyslogMessage
| order by TimeGenerated desc

3. Azure OpenAI API Calls

AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(24h)
| summarize Count=count() by OperationName, ResultType
| order by Count desc

4. Failed Document Processing

Syslog
| where ProcessName == "alactic-worker"
| where SyslogMessage contains "failed" or SyslogMessage contains "error"
| where TimeGenerated > ago(7d)
| project TimeGenerated, SyslogMessage
| order by TimeGenerated desc

5. Cost Analysis by Service

AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(30d)
| extend Tokens = toint(Properties.tokens)
| summarize TotalTokens = sum(Tokens) by Model = tostring(Properties.model)
| extend EstimatedCost = TotalTokens * 0.15 / 1000000 // Adjust rate per model

Alert Configuration

1. High CPU Usage

Condition:

  • Metric: Percentage CPU
  • Operator: Greater than
  • Threshold: 85%
  • Duration: 15 minutes

Action:

Why: Indicates VM struggling, may need upgrade or workload reduction.

2. Low Memory

Condition:

  • Metric: Available Memory Bytes
  • Operator: Less than
  • Threshold: 1 GB (1,073,741,824 bytes)
  • Duration: 5 minutes

Action: Email alert

Why: May cause service crashes or poor performance.

3. Azure OpenAI Throttling

Condition:

  • Metric: HTTP 429 responses
  • Operator: Greater than
  • Threshold: 1
  • Duration: 5 minutes

Action: Email + Webhook to pause processing

Why: Hitting rate limits, requests failing.

4. High Disk Usage

Condition:

  • Metric: Percentage Disk Used
  • Operator: Greater than
  • Threshold: 85%
  • Duration: 1 hour

Action: Email alert

Why: Running out of storage, may cause failures.

5. Service Down

Condition:

  • Metric: VM Availability
  • Operator: Equals
  • Value: 0 (not available)
  • Duration: 5 minutes

Action: Email + SMS to on-call engineer

Why: Critical - services inaccessible.

6. High Daily Cost

Condition:

  • Metric: Daily Cost
  • Operator: Greater than
  • Threshold: $15 (adjust per plan)
  • Evaluation: Daily

Action: Email alert

Why: Unexpected spending, may indicate issue or misuse.

7. Processing Failures

Condition:

  • Custom log query: Failed jobs more than 10% of total
  • Evaluation: Every 1 hour

KQL Query:

Syslog
| where ProcessName == "alactic-worker"
| where TimeGenerated > ago(1h)
| summarize Total = count(),
Failed = countif(SyslogMessage contains "failed")
| extend FailureRate = (Failed * 100.0) / Total
| where FailureRate > 10

Action: Email alert

Why: High failure rate indicates systemic issue.

Creating Alerts in Azure Portal

Step-by-Step:

  1. Go to Azure Portal → Monitor
  2. Click "Alerts" in left sidebar
  3. Click "+ Create" → "Alert rule"
  4. Select scope:
    • Click "Select resource"
    • Choose your VM, Cosmos DB, or OpenAI resource
  5. Define condition:
    • Click "Add condition"
    • Select metric (e.g., "Percentage CPU")
    • Configure threshold and duration
  6. Add action group:
  7. Configure alert details:
    • Severity: Select appropriate level
    • Alert rule name: "Alactic High CPU Usage"
    • Description: "CPU usage exceeded 85% for 15 minutes"
  8. Click "Create alert rule"

Application-Level Monitoring

Service Health Checks

Check Service Status via SSH:

# SSH into VM
ssh -i ~/.ssh/id_rsa appuser@your-vm-ip

# Check all Alactic services
sudo systemctl status alactic-*

# Individual services
sudo systemctl status alactic-api
sudo systemctl status alactic-worker
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr

# View recent logs
sudo journalctl -u alactic-api -n 50
sudo journalctl -u alactic-worker -n 50 --follow

Automated Health Check Script:

#!/bin/bash
# save as check-health.sh

echo "Checking Alactic Services Health..."

# Check API service
if systemctl is-active --quiet alactic-api; then
echo "✓ API Service: Running"
else
echo "✗ API Service: FAILED"
sudo systemctl restart alactic-api
fi

# Check Worker service
if systemctl is-active --quiet alactic-worker; then
echo "✓ Worker Service: Running"
else
echo "✗ Worker Service: FAILED"
sudo systemctl restart alactic-worker
fi

# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
echo " Disk Usage: ${DISK_USAGE}% (HIGH)"
else
echo " Disk Usage: ${DISK_USAGE}%"
fi

# Check memory
MEMORY_USAGE=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100}')
if [ $MEMORY_USAGE -gt 85 ]; then
echo " Memory Usage: ${MEMORY_USAGE}% (HIGH)"
else
echo " Memory Usage: ${MEMORY_USAGE}%"
fi

Schedule with cron:

# Edit crontab
crontab -e

# Add line (run every 5 minutes)
*/5 * * * * /path/to/check-health.sh >> /var/log/alactic-health.log 2>&1

Queue Monitoring

Check Processing Queue:

# View queue depth via API
curl -H "X-Deployment-Key: ak-xxxxx" \
https://your-vm-ip/api/v1/queue/status

# Returns:
# {
# "queued": 15,
# "processing": 3,
# "completed_today": 42,
# "failed_today": 2
# }

What to Watch:

  • Queue depth more than 50: May indicate processing bottleneck
  • Failed count more than 10% of total: Investigate error patterns
  • Processing jobs stuck: May need worker restart

Error Tracking

View Application Errors:

# Recent errors from API service
sudo journalctl -u alactic-api -p err -n 100

# Recent errors from worker service
sudo journalctl -u alactic-worker -p err -n 100

# Search for specific error pattern
sudo journalctl -u alactic-worker | grep -i "timeout"

Common Error Patterns:

Error MessageCauseSolution
"Connection timeout to Azure OpenAI"Network issue or OpenAI downCheck network, retry
"429 Too Many Requests"Rate limit exceededReduce request rate or increase quota
"PDF parsing failed"Corrupted or unsupported PDFCheck file validity
"Database connection failed"Cosmos DB issueCheck Cosmos DB status
"Out of memory"Memory exhaustedRestart service or upgrade plan

Cost Monitoring

Azure Cost Management

Access Cost Analysis:

  1. Azure Portal → Cost Management + Billing
  2. Click "Cost analysis"
  3. Set scope: Your resource group
  4. View current month spending

Key Cost Breakdowns:

By Resource:

Virtual Machine: $105.00 (71%)
Cosmos DB: $12.00 (8%)
Storage: $8.00 (5%)
Azure OpenAI: $2.34 (2%)
Network: $2.80 (2%)
Other: $16.86 (12%)
Total: $147.00

By Service Category:

Compute: $105.00
Database: $12.00
Storage: $8.00
AI Services: $2.34
Networking: $2.80

Trend Analysis:

  • Compare month-over-month
  • Identify cost spikes
  • Project end-of-month total

Cost Budgets and Alerts

Create Budget:

  1. Azure Portal → Cost Management
  2. Click "Budgets"
  3. Click "+ Add"
  4. Configure:
    • Name: "Alactic Monthly Budget"
    • Reset period: Monthly
    • Amount: $200 (includes buffer)
  5. Add alert conditions:
    • 75% of budget: Email notification
    • 90% of budget: Email notification
    • 100% of budget: Email + stop processing action
  6. Click "Create"

Budget Alert Actions:

75% threshold ($150):

  • Email: "Approaching monthly budget"
  • Review usage patterns
  • Identify any anomalies

90% threshold ($180):

  • Email: "Critical budget level"
  • Investigate high costs immediately
  • Consider reducing usage

100% threshold ($200):

  • Email: "Budget exceeded"
  • Consider automatic actions:
    • Deallocate VM (stops processing, saves costs)
    • Disable Azure OpenAI (prevent additional model costs)

Token Usage Tracking

Monitor in Dashboard:

Settings → Usage Statistics → Token Usage

Key Metrics:

Current Month:
Input Tokens: 2,450,000
Output Tokens: 385,000

Model Breakdown:
GPT-4o mini:
Input: 2,100,000 ($0.315)
Output: 310,000 ($0.186)

GPT-4o:
Input: 350,000 ($0.875)
Output: 75,000 ($0.750)

Total Cost: $2.13
Average per Document: $0.0117

Cost Optimization Tips:

  1. Use GPT-4o mini by default
  2. Enable GPT-4o only for complex documents
  3. Use Quick Extract to reduce output tokens
  4. Minimize reprocessing

Performance Monitoring

Processing Speed Tracking

Monitor Processing Times:

Dashboard → Recent Activity

Average Processing Times (Last 24 Hours):

PDF Processing:
1-5 pages: 15.2s avg
6-10 pages: 32.4s avg
11-20 pages: 58.7s avg
21-50 pages: 2m 35s avg

URL Scraping:
Simple sites: 12.3s avg
Complex sites: 28.9s avg

Model Inference:
GPT-4o mini: 0.4s avg
GPT-4o: 0.6s avg

Performance Trends:

Track weekly averages:

  • Week 1: 18.5s avg
  • Week 2: 19.2s avg (slight increase)
  • Week 3: 22.1s avg (investigate)
  • Week 4: 18.8s avg (back to normal)

Investigate if:

  • Processing times increase more than 20% week-over-week
  • Consistently slower than baseline
  • High variance in processing times

Throughput Monitoring

Track Documents Processed:

Daily Throughput:
Monday: 28 documents
Tuesday: 35 documents
Wednesday: 42 documents (peak)
Thursday: 31 documents
Friday: 24 documents
Weekend: 8 documents

Weekly Total: 168 documents
Monthly Projection: 720 documents (exceeds quota!)

Action: Plan for upgrade if consistently exceeding quota.

Latency Analysis

Component Latency Breakdown:

Total Processing Time: 18.5s

Breakdown:
File upload: 1.2s (6%)
PDF parsing: 3.8s (21%)
Text extraction: 2.1s (11%)
Model inference: 0.4s (2%)
Summary generation: 8.5s (46%)
Vector embedding: 1.8s (10%)
Database write: 0.7s (4%)

Bottleneck: Summary generation (46% of time)

Optimization:

  • Use Quick Extract if summary not needed
  • Consider GPT-4o mini for faster inference

Uptime Monitoring

Availability Tracking

Azure VM Availability:

Azure Portal → VM → Metrics → Availability

Last 30 Days:
Uptime: 99.95%
Downtime: 21.6 minutes

Incidents:
- March 5: 12 minutes (Azure maintenance)
- March 18: 9 minutes (manual restart)

SLA:

  • Free/Pro/Pro+ Plans: No SLA guarantee
  • Enterprise Plan: 99.9% SLA

External Monitoring

Use Third-Party Uptime Monitors:

Recommended Services:

  • UptimeRobot (free for 50 monitors)
  • Pingdom
  • StatusCake

Configuration:

  1. Sign up for service
  2. Add HTTP(S) monitor:
  3. Add alert contacts
  4. Monitor dashboard and API separately

Benefits:

  • Independent verification (not relying on Azure)
  • Historical uptime data
  • Public status page for users

Best Practices

Monitoring Checklist

Daily:

  • Check dashboard for failed jobs
  • Review service health status
  • Monitor queue depth

Weekly:

  • Review processing times and throughput
  • Check Azure Monitor alerts
  • Analyze cost trends
  • Review error logs

Monthly:

  • Comprehensive cost analysis
  • Capacity planning review
  • Performance trend analysis
  • Security audit (access logs)
  • Update monitoring thresholds if needed

Alert Fatigue Prevention

Avoid Too Many Alerts:

  • Set appropriate thresholds (not too sensitive)
  • Group related alerts
  • Use alert severity levels
  • Implement alert cooldown periods

Alert Priority Levels:

P1 (Critical):

  • VM down
  • Service completely failed
  • Security breach

P2 (High):

  • High error rate (more than 20%)
  • Critical resource exhaustion (CPU more than 95%)
  • Budget exceeded

P3 (Medium):

  • Elevated error rate (more than 10%)
  • High resource usage (more than 85%)
  • Processing delays

P4 (Low):

  • Warning thresholds crossed
  • Informational alerts

Dashboard Customization

Create Custom Views:

Azure Portal → Dashboard → New Dashboard

Recommended Tiles:

  1. VM CPU Usage (line chart, 24h)
  2. VM Memory Usage (line chart, 24h)
  3. Cosmos DB Request Units (line chart, 24h)
  4. Azure OpenAI Token Usage (bar chart, monthly)
  5. Daily Cost (bar chart, last 30 days)
  6. Active Alerts (list)
  7. Recent Deployments (list)

Share with Team:

  • Azure Portal → Dashboard → Share
  • Set permissions (Viewer, Contributor)