Monitoring and Observability

Monitoring your Alactic AGI deployment ensures optimal performance, reliability, and cost efficiency. This comprehensive guide covers monitoring strategies, tools, metrics, and best practices for maintaining a healthy production deployment.

Monitoring Strategy

Why Monitor?

Key Benefits:

Early Problem Detection: Identify issues before users notice
Performance Optimization: Find bottlenecks and optimize
Cost Control: Track spending and prevent overruns
Capacity Planning: Forecast growth and scale proactively
Troubleshooting: Quickly diagnose and resolve issues

Monitoring Layers

1. Infrastructure Layer

VM health (CPU, memory, disk)
Network connectivity
Azure service availability

2. Application Layer

Service status (API, frontend, worker)
Processing queue depth
Error rates and types

3. Business Layer

Documents processed
Processing success rate
User activity patterns

4. Cost Layer

Infrastructure spending
Model API costs
Monthly budget tracking

Built-in Dashboard Monitoring

Usage Statistics

Access via Settings → Usage Statistics

Key Metrics:

Document Processing:

PDFs Processed This Month: 45 / 100
URLs Scraped This Month: 132 / 200
Remaining Quota: 55 PDFs, 68 URLs

Storage:

Storage Used: 3.2 GB / 5 GB (64%)
Growth Rate: +120 MB/week
Projected Full: 6 weeks

Token Usage:

Input Tokens (Month): 2,450,000
Output Tokens (Month): 385,000
Total Cost: $0.87
Average per Document: $0.0049

Model Distribution:

GPT-4o mini: 140 documents (77%), $0.21
GPT-4o: 42 documents (23%), $1.05
Total: 182 documents, $1.26

Processing Status

Active Jobs:

Processing: 3 jobs
Queued: 12 jobs
Completed (Today): 28 jobs
Failed (Today): 1 job

Recent Activity:

[2024-03-20 14:35] invoice_March.pdf - Completed (18.2s)
[2024-03-20 14:33] contract_NDA.pdf - Completed (42.5s)
[2024-03-20 14:30] article_scrape.url - Failed (timeout)

Service Health

Check service status:

Dashboard → Settings → Service Status

✓ API Service: Running (healthy)
✓ Worker Service: Running (2 jobs active)
✓ Frontend: Running
✓ Solr Search: Running
✓ Database: Connected
✓ Azure OpenAI: Connected (120ms latency)

What to Watch:

Any service showing "Stopped" or "Failed"
High latency to Azure OpenAI (more than 500ms)
Database connection errors

Azure Monitor Integration

Enable Azure Monitor

Step 1: Create Log Analytics Workspace

Go to Azure Portal
Click "+ Create a resource"
Search "Log Analytics Workspace"
Click "Create"
Configure:
- Resource group: Same as Alactic deployment
- Name: alactic-logs-workspace
- Region: Same as VM region
- Pricing tier: Pay-as-you-go (about $2-5/month)
Click "Review + Create"

Step 2: Connect VM to Log Analytics

Go to your VM resource
Click "Insights" in left sidebar
Click "Enable"
Select your Log Analytics workspace
Wait 5-10 minutes for setup

Step 3: Enable Diagnostic Logs

For each resource (VM, Cosmos DB, Storage):

Go to resource
Click "Diagnostic settings"
Click "+ Add diagnostic setting"
Configure:
- Name: "Send to Log Analytics"
- Select all log categories
- Destination: Log Analytics workspace
Click "Save"

Key Metrics to Monitor

Virtual Machine Metrics:

Metric	Threshold	Action
Percentage CPU	more than 80% sustained	Investigate workload or upgrade
Available Memory	less than 1 GB	Check for memory leaks or upgrade
Disk Queue Length	more than 5	Disk bottleneck, upgrade to faster disk
Network In/Out	Near bandwidth limit	Check for unexpected traffic
Disk Read/Write IOPS	Near IOPS limit	Upgrade to Premium SSD

Azure OpenAI Metrics:

Metric	Threshold	Action
HTTP 429 Errors	Any occurrence	Request quota increase
Average Latency	more than 5 seconds	Check network or OpenAI status
Failed Requests	more than 5%	Investigate errors, check quota
Token Usage	Approaching quota	Plan for overage or reduce usage

Cosmos DB Metrics:

Metric	Threshold	Action
Request Units (RU/s)	Near 5,000 (serverless limit)	Optimize queries or upgrade
Storage Used	more than 40 GB	Clean up old data
HTTP 429 (Rate Limited)	Any occurrence	Reduce request rate

Querying Logs with KQL

Access Log Analytics:

Azure Portal → Log Analytics workspace
Click "Logs"
Write KQL queries

Common Queries:

1. VM CPU Usage Over Time

Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where Computer == "your-vm-name"
| where TimeGenerated > ago(24h)
| summarize avg(CounterValue) by bin(TimeGenerated, 5m)
| render timechart

2. Application Errors

Syslog
| where Facility == "user" and SeverityLevel == "err"
| where TimeGenerated > ago(24h)
| project TimeGenerated, ProcessName, SyslogMessage
| order by TimeGenerated desc

3. Azure OpenAI API Calls

AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(24h)
| summarize Count=count() by OperationName, ResultType
| order by Count desc

4. Failed Document Processing

Syslog
| where ProcessName == "alactic-worker"
| where SyslogMessage contains "failed" or SyslogMessage contains "error"
| where TimeGenerated > ago(7d)
| project TimeGenerated, SyslogMessage
| order by TimeGenerated desc

5. Cost Analysis by Service

AzureDiagnostics
| where ResourceType == "COGNITIVESERVICES/ACCOUNTS"
| where TimeGenerated > ago(30d)
| extend Tokens = toint(Properties.tokens)
| summarize TotalTokens = sum(Tokens) by Model = tostring(Properties.model)
| extend EstimatedCost = TotalTokens * 0.15 / 1000000  // Adjust rate per model

Alert Configuration

Recommended Alerts

1. High CPU Usage

Condition:

Metric: Percentage CPU
Operator: Greater than
Threshold: 85%
Duration: 15 minutes

Action:

Email: admin@yourcompany.com
SMS: (optional)

Why: Indicates VM struggling, may need upgrade or workload reduction.

2. Low Memory

Condition:

Metric: Available Memory Bytes
Operator: Less than
Threshold: 1 GB (1,073,741,824 bytes)
Duration: 5 minutes

Action: Email alert

Why: May cause service crashes or poor performance.

3. Azure OpenAI Throttling

Condition:

Metric: HTTP 429 responses
Operator: Greater than
Threshold: 1
Duration: 5 minutes

Action: Email + Webhook to pause processing

Why: Hitting rate limits, requests failing.

4. High Disk Usage

Condition:

Metric: Percentage Disk Used
Operator: Greater than
Threshold: 85%
Duration: 1 hour

Action: Email alert

Why: Running out of storage, may cause failures.

5. Service Down

Condition:

Metric: VM Availability
Operator: Equals
Value: 0 (not available)
Duration: 5 minutes

Action: Email + SMS to on-call engineer

Why: Critical - services inaccessible.

6. High Daily Cost

Condition:

Metric: Daily Cost
Operator: Greater than
Threshold: $15 (adjust per plan)
Evaluation: Daily

Action: Email alert

Why: Unexpected spending, may indicate issue or misuse.

7. Processing Failures

Condition:

Custom log query: Failed jobs more than 10% of total
Evaluation: Every 1 hour

KQL Query:

Syslog
| where ProcessName == "alactic-worker"
| where TimeGenerated > ago(1h)
| summarize Total = count(), 
            Failed = countif(SyslogMessage contains "failed")
| extend FailureRate = (Failed * 100.0) / Total
| where FailureRate > 10

Action: Email alert

Why: High failure rate indicates systemic issue.

Creating Alerts in Azure Portal

Step-by-Step:

Go to Azure Portal → Monitor
Click "Alerts" in left sidebar
Click "+ Create" → "Alert rule"
Select scope:
- Click "Select resource"
- Choose your VM, Cosmos DB, or OpenAI resource
Define condition:
- Click "Add condition"
- Select metric (e.g., "Percentage CPU")
- Configure threshold and duration
Add action group:
- Click "+ Create action group"
- Name: "Alactic Admins"
- Add actions:
  - Email: admin@yourcompany.com
  - SMS: +1-555-123-4567 (optional)
  - Webhook: https://your-webhook-url (optional)
Configure alert details:
- Severity: Select appropriate level
- Alert rule name: "Alactic High CPU Usage"
- Description: "CPU usage exceeded 85% for 15 minutes"
Click "Create alert rule"

Application-Level Monitoring

Service Health Checks

Check Service Status via SSH:

# SSH into VM
ssh -i ~/.ssh/id_rsa appuser@your-vm-ip

# Check all Alactic services
sudo systemctl status alactic-*

# Individual services
sudo systemctl status alactic-api
sudo systemctl status alactic-worker
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr

# View recent logs
sudo journalctl -u alactic-api -n 50
sudo journalctl -u alactic-worker -n 50 --follow

Automated Health Check Script:

#!/bin/bash
# save as check-health.sh

echo "Checking Alactic Services Health..."

# Check API service
if systemctl is-active --quiet alactic-api; then
    echo "✓ API Service: Running"
else
    echo "✗ API Service: FAILED"
    sudo systemctl restart alactic-api
fi

# Check Worker service
if systemctl is-active --quiet alactic-worker; then
    echo "✓ Worker Service: Running"
else
    echo "✗ Worker Service: FAILED"
    sudo systemctl restart alactic-worker
fi

# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
    echo " Disk Usage: ${DISK_USAGE}% (HIGH)"
else
    echo " Disk Usage: ${DISK_USAGE}%"
fi

# Check memory
MEMORY_USAGE=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100}')
if [ $MEMORY_USAGE -gt 85 ]; then
    echo " Memory Usage: ${MEMORY_USAGE}% (HIGH)"
else
    echo " Memory Usage: ${MEMORY_USAGE}%"
fi

Schedule with cron:

# Edit crontab
crontab -e

# Add line (run every 5 minutes)
*/5 * * * * /path/to/check-health.sh >> /var/log/alactic-health.log 2>&1

Queue Monitoring

Check Processing Queue:

# View queue depth via API
curl -H "X-Deployment-Key: ak-xxxxx" \
     https://your-vm-ip/api/v1/queue/status

# Returns:
# {
#   "queued": 15,
#   "processing": 3,
#   "completed_today": 42,
#   "failed_today": 2
# }

What to Watch:

Queue depth more than 50: May indicate processing bottleneck
Failed count more than 10% of total: Investigate error patterns
Processing jobs stuck: May need worker restart

Error Tracking

View Application Errors:

# Recent errors from API service
sudo journalctl -u alactic-api -p err -n 100

# Recent errors from worker service
sudo journalctl -u alactic-worker -p err -n 100

# Search for specific error pattern
sudo journalctl -u alactic-worker | grep -i "timeout"

Common Error Patterns:

Error Message	Cause	Solution
"Connection timeout to Azure OpenAI"	Network issue or OpenAI down	Check network, retry
"429 Too Many Requests"	Rate limit exceeded	Reduce request rate or increase quota
"PDF parsing failed"	Corrupted or unsupported PDF	Check file validity
"Database connection failed"	Cosmos DB issue	Check Cosmos DB status
"Out of memory"	Memory exhausted	Restart service or upgrade plan

Cost Monitoring

Azure Cost Management

Access Cost Analysis:

Azure Portal → Cost Management + Billing
Click "Cost analysis"
Set scope: Your resource group
View current month spending

Key Cost Breakdowns:

By Resource:

Virtual Machine: $105.00 (71%)
Cosmos DB: $12.00 (8%)
Storage: $8.00 (5%)
Azure OpenAI: $2.34 (2%)
Network: $2.80 (2%)
Other: $16.86 (12%)
Total: $147.00

By Service Category:

Compute: $105.00
Database: $12.00
Storage: $8.00
AI Services: $2.34
Networking: $2.80

Trend Analysis:

Compare month-over-month
Identify cost spikes
Project end-of-month total

Cost Budgets and Alerts

Create Budget:

Azure Portal → Cost Management
Click "Budgets"
Click "+ Add"
Configure:
- Name: "Alactic Monthly Budget"
- Reset period: Monthly
- Amount: $200 (includes buffer)
Add alert conditions:
- 75% of budget: Email notification
- 90% of budget: Email notification
- 100% of budget: Email + stop processing action
Click "Create"

Budget Alert Actions:

75% threshold ($150):

Email: "Approaching monthly budget"
Review usage patterns
Identify any anomalies

90% threshold ($180):

Email: "Critical budget level"
Investigate high costs immediately
Consider reducing usage

100% threshold ($200):

Email: "Budget exceeded"
Consider automatic actions:
- Deallocate VM (stops processing, saves costs)
- Disable Azure OpenAI (prevent additional model costs)

Token Usage Tracking

Monitor in Dashboard:

Settings → Usage Statistics → Token Usage

Key Metrics:

Current Month:
  Input Tokens: 2,450,000
  Output Tokens: 385,000
  
Model Breakdown:
  GPT-4o mini:
    Input: 2,100,000 ($0.315)
    Output: 310,000 ($0.186)
    
  GPT-4o:
    Input: 350,000 ($0.875)
    Output: 75,000 ($0.750)
    
Total Cost: $2.13
Average per Document: $0.0117

Cost Optimization Tips:

Use GPT-4o mini by default
Enable GPT-4o only for complex documents
Use Quick Extract to reduce output tokens
Minimize reprocessing

Performance Monitoring

Processing Speed Tracking

Monitor Processing Times:

Dashboard → Recent Activity

Average Processing Times (Last 24 Hours):

PDF Processing:
  1-5 pages: 15.2s avg
  6-10 pages: 32.4s avg
  11-20 pages: 58.7s avg
  21-50 pages: 2m 35s avg
  
URL Scraping:
  Simple sites: 12.3s avg
  Complex sites: 28.9s avg
  
Model Inference:
  GPT-4o mini: 0.4s avg
  GPT-4o: 0.6s avg

Performance Trends:

Track weekly averages:

Week 1: 18.5s avg
Week 2: 19.2s avg (slight increase)
Week 3: 22.1s avg (investigate)
Week 4: 18.8s avg (back to normal)

Investigate if:

Processing times increase more than 20% week-over-week
Consistently slower than baseline
High variance in processing times

Throughput Monitoring

Track Documents Processed:

Daily Throughput:
  Monday: 28 documents
  Tuesday: 35 documents
  Wednesday: 42 documents (peak)
  Thursday: 31 documents
  Friday: 24 documents
  Weekend: 8 documents
  
Weekly Total: 168 documents
Monthly Projection: 720 documents (exceeds quota!)

Action: Plan for upgrade if consistently exceeding quota.

Latency Analysis

Component Latency Breakdown:

Total Processing Time: 18.5s

Breakdown:
  File upload: 1.2s (6%)
  PDF parsing: 3.8s (21%)
  Text extraction: 2.1s (11%)
  Model inference: 0.4s (2%)
  Summary generation: 8.5s (46%)
  Vector embedding: 1.8s (10%)
  Database write: 0.7s (4%)

Bottleneck: Summary generation (46% of time)

Optimization:

Use Quick Extract if summary not needed
Consider GPT-4o mini for faster inference

Uptime Monitoring

Availability Tracking

Azure VM Availability:

Azure Portal → VM → Metrics → Availability

Last 30 Days:
  Uptime: 99.95%
  Downtime: 21.6 minutes
  
Incidents:
  - March 5: 12 minutes (Azure maintenance)
  - March 18: 9 minutes (manual restart)

SLA:

Free/Pro/Pro+ Plans: No SLA guarantee
Enterprise Plan: 99.9% SLA

External Monitoring

Use Third-Party Uptime Monitors:

Recommended Services:

UptimeRobot (free for 50 monitors)
Pingdom
StatusCake

Configuration:

Sign up for service
Add HTTP(S) monitor:
- URL: https://your-vm-ip/api/v1/health
- Check interval: Every 5 minutes
- Timeout: 30 seconds
Add alert contacts
Monitor dashboard and API separately

Benefits:

Independent verification (not relying on Azure)
Historical uptime data
Public status page for users

Best Practices

Monitoring Checklist

Daily:

Check dashboard for failed jobs
Review service health status
Monitor queue depth

Weekly:

Review processing times and throughput
Check Azure Monitor alerts
Analyze cost trends
Review error logs

Monthly:

Comprehensive cost analysis
Capacity planning review
Performance trend analysis
Security audit (access logs)
Update monitoring thresholds if needed

Alert Fatigue Prevention

Avoid Too Many Alerts:

Set appropriate thresholds (not too sensitive)
Group related alerts
Use alert severity levels
Implement alert cooldown periods

Alert Priority Levels:

P1 (Critical):

VM down
Service completely failed
Security breach

P2 (High):

High error rate (more than 20%)
Critical resource exhaustion (CPU more than 95%)
Budget exceeded

P3 (Medium):

Elevated error rate (more than 10%)
High resource usage (more than 85%)
Processing delays

P4 (Low):

Warning thresholds crossed
Informational alerts

Dashboard Customization

Create Custom Views:

Azure Portal → Dashboard → New Dashboard

Recommended Tiles:

VM CPU Usage (line chart, 24h)
VM Memory Usage (line chart, 24h)
Cosmos DB Request Units (line chart, 24h)
Azure OpenAI Token Usage (bar chart, monthly)
Daily Cost (bar chart, last 30 days)
Active Alerts (list)
Recent Deployments (list)

Share with Team:

Azure Portal → Dashboard → Share
Set permissions (Viewer, Contributor)

Monitoring Strategy​

Why Monitor?​

Monitoring Layers​

Built-in Dashboard Monitoring​

Usage Statistics​

Processing Status​

Service Health​

Azure Monitor Integration​

Enable Azure Monitor​

Key Metrics to Monitor​

Querying Logs with KQL​

Alert Configuration​

Recommended Alerts​

Creating Alerts in Azure Portal​

Application-Level Monitoring​

Service Health Checks​

Queue Monitoring​

Error Tracking​

Cost Monitoring​

Azure Cost Management​

Cost Budgets and Alerts​

Token Usage Tracking​

Performance Monitoring​

Processing Speed Tracking​

Throughput Monitoring​

Latency Analysis​

Uptime Monitoring​

Availability Tracking​

External Monitoring​

Best Practices​

Monitoring Checklist​

Alert Fatigue Prevention​

Dashboard Customization​

Related Documentation​

Monitoring Strategy

Why Monitor?

Monitoring Layers

Built-in Dashboard Monitoring

Usage Statistics

Processing Status

Service Health

Azure Monitor Integration

Enable Azure Monitor

Key Metrics to Monitor

Querying Logs with KQL

Alert Configuration

Recommended Alerts

Creating Alerts in Azure Portal

Application-Level Monitoring

Service Health Checks

Queue Monitoring

Error Tracking

Cost Monitoring

Azure Cost Management

Cost Budgets and Alerts

Token Usage Tracking

Performance Monitoring

Processing Speed Tracking

Throughput Monitoring

Latency Analysis

Uptime Monitoring

Availability Tracking

External Monitoring

Best Practices

Monitoring Checklist

Alert Fatigue Prevention

Dashboard Customization

Related Documentation