Monitoring and Observability
Monitoring your Alactic AGI deployment ensures optimal performance, reliability, and cost efficiency. This comprehensive guide covers monitoring strategies, tools, metrics, and best practices for maintaining a healthy production deployment.
Monitoring Strategy
Why Monitor?
Key Benefits:
- Early Problem Detection: Identify issues before users notice
- Performance Optimization: Find bottlenecks and optimize
- Cost Control: Track spending and prevent overruns
- Capacity Planning: Forecast growth and scale proactively
- Troubleshooting: Quickly diagnose and resolve issues
Monitoring Layers
1. Infrastructure Layer
- VM health (CPU, memory, disk)
- Network connectivity
- Azure service availability
2. Application Layer
- Service status (API, frontend, worker)
- Processing queue depth
- Error rates and types
3. Business Layer
- Documents processed
- Processing success rate
- User activity patterns
4. Cost Layer
- Infrastructure spending
- Model API costs
- Monthly budget tracking
Built-in Dashboard Monitoring
Usage Statistics
Access via Settings → Usage Statistics
Key Metrics:
Document Processing:
PDFs Processed This Month: 45 / 100
URLs Scraped This Month: 132 / 200
Remaining Quota: 55 PDFs, 68 URLs
Storage:
Storage Used: 3.2 GB / 5 GB (64%)
Growth Rate: +120 MB/week
Projected Full: 6 weeks
Token Usage:
Input Tokens (Month): 2,450,000
Output Tokens (Month): 385,000
Total Cost: $0.87
Average per Document: $0.0049
Model Distribution:
GPT-4o mini: 140 documents (77%), $0.21
GPT-4o: 42 documents (23%), $1.05
Total: 182 documents, $1.26
Processing Status
Active Jobs:
Processing: 3 jobs
Queued: 12 jobs
Completed (Today): 28 jobs
Failed (Today): 1 job
Recent Activity:
[2024-03-20 14:35] invoice_March.pdf - Completed (18.2s)
[2024-03-20 14:33] contract_NDA.pdf - Completed (42.5s)
[2024-03-20 14:30] article_scrape.url - Failed (timeout)
Service Health
Check service status:
Dashboard → Settings → Service Status
✓ API Service: Running (healthy)
✓ Worker Service: Running (2 jobs active)
✓ Frontend: Running
✓ Solr Search: Running
✓ Database: Connected
✓ Azure OpenAI: Connected (120ms latency)
What to Watch:
- Any service showing "Stopped" or "Failed"
- High latency to Azure OpenAI (more than 500ms)
- Database connection errors
Azure Monitor Integration
Enable Azure Monitor
Step 1: Create Log Analytics Workspace
- Go to Azure Portal
- Click "+ Create a resource"
- Search "Log Analytics Workspace"
- Click "Create"
- Configure:
- Resource group: Same as Alactic deployment
- Name: alactic-logs-workspace
- Region: Same as VM region
- Pricing tier: Pay-as-you-go (about $2-5/month)
- Click "Review + Create"
Step 2: Connect VM to Log Analytics
- Go to your VM resource
- Click "Insights" in left sidebar
- Click "Enable"
- Select your Log Analytics workspace
- Wait 5-10 minutes for setup
Step 3: Enable Diagnostic Logs
For each resource (VM, Cosmos DB, Storage):
- Go to resource
- Click "Diagnostic settings"
- Click "+ Add diagnostic setting"
- Configure:
- Name: "Send to Log Analytics"
- Select all log categories
- Destination: Log Analytics workspace
- Click "Save"
Key Metrics to Monitor
Virtual Machine Metrics:
| Metric | Threshold | Action |
|---|---|---|
| Percentage CPU | more than 80% sustained | Investigate workload or upgrade |
| Available Memory | less than 1 GB | Check for memory leaks or upgrade |
| Disk Queue Length | more than 5 | Disk bottleneck, upgrade to faster disk |
| Network In/Out | Near bandwidth limit | Check for unexpected traffic |
| Disk Read/Write IOPS | Near IOPS limit | Upgrade to Premium SSD |
Azure OpenAI Metrics:
| Metric | Threshold | Action |
|---|---|---|
| HTTP 429 Errors | Any occurrence | Request quota increase |
| Average Latency | more than 5 seconds | Check network or OpenAI status |
| Failed Requests | more than 5% | Investigate errors, check quota |
| Token Usage | Approaching quota | Plan for overage or reduce usage |
Cosmos DB Metrics:
| Metric | Threshold | Action |
|---|---|---|
| Request Units (RU/s) | Near 5,000 (serverless limit) | Optimize queries or upgrade |
| Storage Used | more than 40 GB | Clean up old data |
| HTTP 429 (Rate Limited) | Any occurrence | Reduce request rate |
Querying Logs with KQL
Access Log Analytics:
- Azure Portal → Log Analytics workspace
- Click "Logs"
- Write KQL queries
Common Queries:
1. VM CPU Usage Over Time
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where Computer == "your-vm-name"
| where TimeGenerated > ago(24h)
| summarize avg(CounterValue) by bin(TimeGenerated, 5m)
| render timechart
2. Application Errors
Syslog
| where Facility == "user" and SeverityLevel == "err"
| where TimeGenerated > ago(24h)
| project TimeGenerated, ProcessName, SyslogMessage
| order by TimeGenerated desc
3. Azure OpenAI API Calls
AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(24h)
| summarize Count=count() by OperationName, ResultType
| order by Count desc
4. Failed Document Processing
Syslog
| where ProcessName == "alactic-worker"
| where SyslogMessage contains "failed" or SyslogMessage contains "error"
| where TimeGenerated > ago(7d)
| project TimeGenerated, SyslogMessage
| order by TimeGenerated desc
5. Cost Analysis by Service
AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(30d)
| extend Tokens = toint(Properties.tokens)
| summarize TotalTokens = sum(Tokens) by Model = tostring(Properties.model)
| extend EstimatedCost = TotalTokens * 0.15 / 1000000 // Adjust rate per model
Alert Configuration
Recommended Alerts
1. High CPU Usage
Condition:
- Metric: Percentage CPU
- Operator: Greater than
- Threshold: 85%
- Duration: 15 minutes
Action:
- Email: admin@yourcompany.com
- SMS: (optional)
Why: Indicates VM struggling, may need upgrade or workload reduction.
2. Low Memory
Condition:
- Metric: Available Memory Bytes
- Operator: Less than
- Threshold: 1 GB (1,073,741,824 bytes)
- Duration: 5 minutes
Action: Email alert
Why: May cause service crashes or poor performance.
3. Azure OpenAI Throttling
Condition:
- Metric: HTTP 429 responses
- Operator: Greater than
- Threshold: 1
- Duration: 5 minutes
Action: Email + Webhook to pause processing
Why: Hitting rate limits, requests failing.
4. High Disk Usage
Condition:
- Metric: Percentage Disk Used
- Operator: Greater than
- Threshold: 85%
- Duration: 1 hour
Action: Email alert
Why: Running out of storage, may cause failures.
5. Service Down
Condition:
- Metric: VM Availability
- Operator: Equals
- Value: 0 (not available)
- Duration: 5 minutes
Action: Email + SMS to on-call engineer
Why: Critical - services inaccessible.
6. High Daily Cost
Condition:
- Metric: Daily Cost
- Operator: Greater than
- Threshold: $15 (adjust per plan)
- Evaluation: Daily
Action: Email alert
Why: Unexpected spending, may indicate issue or misuse.
7. Processing Failures
Condition:
- Custom log query: Failed jobs more than 10% of total
- Evaluation: Every 1 hour
KQL Query:
Syslog
| where ProcessName == "alactic-worker"
| where TimeGenerated > ago(1h)
| summarize Total = count(),
Failed = countif(SyslogMessage contains "failed")
| extend FailureRate = (Failed * 100.0) / Total
| where FailureRate > 10
Action: Email alert
Why: High failure rate indicates systemic issue.
Creating Alerts in Azure Portal
Step-by-Step:
- Go to Azure Portal → Monitor
- Click "Alerts" in left sidebar
- Click "+ Create" → "Alert rule"
- Select scope:
- Click "Select resource"
- Choose your VM, Cosmos DB, or OpenAI resource
- Define condition:
- Click "Add condition"
- Select metric (e.g., "Percentage CPU")
- Configure threshold and duration
- Add action group:
- Click "+ Create action group"
- Name: "Alactic Admins"
- Add actions:
- Email: admin@yourcompany.com
- SMS: +1-555-123-4567 (optional)
- Webhook: https://your-webhook-url (optional)
- Configure alert details:
- Severity: Select appropriate level
- Alert rule name: "Alactic High CPU Usage"
- Description: "CPU usage exceeded 85% for 15 minutes"
- Click "Create alert rule"
Application-Level Monitoring
Service Health Checks
Check Service Status via SSH:
# SSH into VM
ssh -i ~/.ssh/id_rsa appuser@your-vm-ip
# Check all Alactic services
sudo systemctl status alactic-*
# Individual services
sudo systemctl status alactic-api
sudo systemctl status alactic-worker
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr
# View recent logs
sudo journalctl -u alactic-api -n 50
sudo journalctl -u alactic-worker -n 50 --follow
Automated Health Check Script:
#!/bin/bash
# save as check-health.sh
echo "Checking Alactic Services Health..."
# Check API service
if systemctl is-active --quiet alactic-api; then
echo "✓ API Service: Running"
else
echo "✗ API Service: FAILED"
sudo systemctl restart alactic-api
fi
# Check Worker service
if systemctl is-active --quiet alactic-worker; then
echo "✓ Worker Service: Running"
else
echo "✗ Worker Service: FAILED"
sudo systemctl restart alactic-worker
fi
# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
echo " Disk Usage: ${DISK_USAGE}% (HIGH)"
else
echo " Disk Usage: ${DISK_USAGE}%"
fi
# Check memory
MEMORY_USAGE=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100}')
if [ $MEMORY_USAGE -gt 85 ]; then
echo " Memory Usage: ${MEMORY_USAGE}% (HIGH)"
else
echo " Memory Usage: ${MEMORY_USAGE}%"
fi
Schedule with cron:
# Edit crontab
crontab -e
# Add line (run every 5 minutes)
*/5 * * * * /path/to/check-health.sh >> /var/log/alactic-health.log 2>&1
Queue Monitoring
Check Processing Queue:
# View queue depth via API
curl -H "X-Deployment-Key: ak-xxxxx" \
https://your-vm-ip/api/v1/queue/status
# Returns:
# {
# "queued": 15,
# "processing": 3,
# "completed_today": 42,
# "failed_today": 2
# }
What to Watch:
- Queue depth more than 50: May indicate processing bottleneck
- Failed count more than 10% of total: Investigate error patterns
- Processing jobs stuck: May need worker restart
Error Tracking
View Application Errors:
# Recent errors from API service
sudo journalctl -u alactic-api -p err -n 100
# Recent errors from worker service
sudo journalctl -u alactic-worker -p err -n 100
# Search for specific error pattern
sudo journalctl -u alactic-worker | grep -i "timeout"
Common Error Patterns:
| Error Message | Cause | Solution |
|---|---|---|
| "Connection timeout to Azure OpenAI" | Network issue or OpenAI down | Check network, retry |
| "429 Too Many Requests" | Rate limit exceeded | Reduce request rate or increase quota |
| "PDF parsing failed" | Corrupted or unsupported PDF | Check file validity |
| "Database connection failed" | Cosmos DB issue | Check Cosmos DB status |
| "Out of memory" | Memory exhausted | Restart service or upgrade plan |
Cost Monitoring
Azure Cost Management
Access Cost Analysis:
- Azure Portal → Cost Management + Billing
- Click "Cost analysis"
- Set scope: Your resource group
- View current month spending
Key Cost Breakdowns:
By Resource:
Virtual Machine: $105.00 (71%)
Cosmos DB: $12.00 (8%)
Storage: $8.00 (5%)
Azure OpenAI: $2.34 (2%)
Network: $2.80 (2%)
Other: $16.86 (12%)
Total: $147.00
By Service Category:
Compute: $105.00
Database: $12.00
Storage: $8.00
AI Services: $2.34
Networking: $2.80
Trend Analysis:
- Compare month-over-month
- Identify cost spikes
- Project end-of-month total
Cost Budgets and Alerts
Create Budget:
- Azure Portal → Cost Management
- Click "Budgets"
- Click "+ Add"
- Configure:
- Name: "Alactic Monthly Budget"
- Reset period: Monthly
- Amount: $200 (includes buffer)
- Add alert conditions:
- 75% of budget: Email notification
- 90% of budget: Email notification
- 100% of budget: Email + stop processing action
- Click "Create"
Budget Alert Actions:
75% threshold ($150):
- Email: "Approaching monthly budget"
- Review usage patterns
- Identify any anomalies
90% threshold ($180):
- Email: "Critical budget level"
- Investigate high costs immediately
- Consider reducing usage
100% threshold ($200):
- Email: "Budget exceeded"
- Consider automatic actions:
- Deallocate VM (stops processing, saves costs)
- Disable Azure OpenAI (prevent additional model costs)
Token Usage Tracking
Monitor in Dashboard:
Settings → Usage Statistics → Token Usage
Key Metrics:
Current Month:
Input Tokens: 2,450,000
Output Tokens: 385,000
Model Breakdown:
GPT-4o mini:
Input: 2,100,000 ($0.315)
Output: 310,000 ($0.186)
GPT-4o:
Input: 350,000 ($0.875)
Output: 75,000 ($0.750)
Total Cost: $2.13
Average per Document: $0.0117
Cost Optimization Tips:
- Use GPT-4o mini by default
- Enable GPT-4o only for complex documents
- Use Quick Extract to reduce output tokens
- Minimize reprocessing
Performance Monitoring
Processing Speed Tracking
Monitor Processing Times:
Dashboard → Recent Activity
Average Processing Times (Last 24 Hours):
PDF Processing:
1-5 pages: 15.2s avg
6-10 pages: 32.4s avg
11-20 pages: 58.7s avg
21-50 pages: 2m 35s avg
URL Scraping:
Simple sites: 12.3s avg
Complex sites: 28.9s avg
Model Inference:
GPT-4o mini: 0.4s avg
GPT-4o: 0.6s avg
Performance Trends:
Track weekly averages:
- Week 1: 18.5s avg
- Week 2: 19.2s avg (slight increase)
- Week 3: 22.1s avg (investigate)
- Week 4: 18.8s avg (back to normal)
Investigate if:
- Processing times increase more than 20% week-over-week
- Consistently slower than baseline
- High variance in processing times
Throughput Monitoring
Track Documents Processed:
Daily Throughput:
Monday: 28 documents
Tuesday: 35 documents
Wednesday: 42 documents (peak)
Thursday: 31 documents
Friday: 24 documents
Weekend: 8 documents
Weekly Total: 168 documents
Monthly Projection: 720 documents (exceeds quota!)
Action: Plan for upgrade if consistently exceeding quota.
Latency Analysis
Component Latency Breakdown:
Total Processing Time: 18.5s
Breakdown:
File upload: 1.2s (6%)
PDF parsing: 3.8s (21%)
Text extraction: 2.1s (11%)
Model inference: 0.4s (2%)
Summary generation: 8.5s (46%)
Vector embedding: 1.8s (10%)
Database write: 0.7s (4%)
Bottleneck: Summary generation (46% of time)
Optimization:
- Use Quick Extract if summary not needed
- Consider GPT-4o mini for faster inference
Uptime Monitoring
Availability Tracking
Azure VM Availability:
Azure Portal → VM → Metrics → Availability
Last 30 Days:
Uptime: 99.95%
Downtime: 21.6 minutes
Incidents:
- March 5: 12 minutes (Azure maintenance)
- March 18: 9 minutes (manual restart)
SLA:
- Free/Pro/Pro+ Plans: No SLA guarantee
- Enterprise Plan: 99.9% SLA
External Monitoring
Use Third-Party Uptime Monitors:
Recommended Services:
- UptimeRobot (free for 50 monitors)
- Pingdom
- StatusCake
Configuration:
- Sign up for service
- Add HTTP(S) monitor:
- URL: https://your-vm-ip/api/v1/health
- Check interval: Every 5 minutes
- Timeout: 30 seconds
- Add alert contacts
- Monitor dashboard and API separately
Benefits:
- Independent verification (not relying on Azure)
- Historical uptime data
- Public status page for users
Best Practices
Monitoring Checklist
Daily:
- Check dashboard for failed jobs
- Review service health status
- Monitor queue depth
Weekly:
- Review processing times and throughput
- Check Azure Monitor alerts
- Analyze cost trends
- Review error logs
Monthly:
- Comprehensive cost analysis
- Capacity planning review
- Performance trend analysis
- Security audit (access logs)
- Update monitoring thresholds if needed
Alert Fatigue Prevention
Avoid Too Many Alerts:
- Set appropriate thresholds (not too sensitive)
- Group related alerts
- Use alert severity levels
- Implement alert cooldown periods
Alert Priority Levels:
P1 (Critical):
- VM down
- Service completely failed
- Security breach
P2 (High):
- High error rate (more than 20%)
- Critical resource exhaustion (CPU more than 95%)
- Budget exceeded
P3 (Medium):
- Elevated error rate (more than 10%)
- High resource usage (more than 85%)
- Processing delays
P4 (Low):
- Warning thresholds crossed
- Informational alerts
Dashboard Customization
Create Custom Views:
Azure Portal → Dashboard → New Dashboard
Recommended Tiles:
- VM CPU Usage (line chart, 24h)
- VM Memory Usage (line chart, 24h)
- Cosmos DB Request Units (line chart, 24h)
- Azure OpenAI Token Usage (bar chart, monthly)
- Daily Cost (bar chart, last 30 days)
- Active Alerts (list)
- Recent Deployments (list)
Share with Team:
- Azure Portal → Dashboard → Share
- Set permissions (Viewer, Contributor)