Deployment Troubleshooting

This guide covers common issues encountered during Azure Marketplace deployment and how to resolve them. Follow the troubleshooting steps in order, starting with the most common issues.

Pre-Deployment Validation

Before attempting deployment, verify these requirements are met:

Azure OpenAI Service Access

Check if you have access:

Go to Azure Portal
Click "+ Create a resource"
Search for "Azure OpenAI"
Select a region (e.g., East US)
Check if you can proceed to "Create"

If you see "Request Access" or "Not Available":

You need to apply for access:

Visit Azure OpenAI Access Request Form
Fill out application form
Wait 1-5 business days for approval
Check email for approval notification
Retry deployment after approval

Why is this required?
Azure OpenAI Service is a controlled service requiring Microsoft approval to prevent misuse.

vCPU Quota Check

Verify sufficient quota:

Go to Azure Portal → Subscriptions
Select your subscription
Click "Usage + quotas" in left sidebar
Filter by:
- Provider: "Microsoft.Compute"
- Region: Your target deployment region (e.g., East US)
Find "Standard DSv3 Family vCPUs" or "Standard BS Family vCPUs"
Check: Current usage + required vCPUs ≤ Limit

Required vCPUs by plan:

Free Plan: 2 vCPUs (Standard_B2s)
Pro Plan: 2 vCPUs (Standard_D2s_v3)
Pro+ Plan: 4 vCPUs (Standard_D4s_v3)
Enterprise Plan: 8+ vCPUs

If quota insufficient:

Click "Request quota increase"
Select VM series (DSv3 or BS)
Enter new limit (current + needed)
Provide justification: "Alactic AGI deployment"
Submit request
Wait 1-2 business days for approval

SSH Key Pair

Verify you have SSH keys:

Windows (PowerShell):

Test-Path ~/.ssh/id_rsa.pub
# Should return: True

Mac/Linux:

ls ~/.ssh/id_rsa.pub
# Should list the file

If keys don't exist, generate them:

# Windows/Mac/Linux
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

# Press Enter for default location
# Enter passphrase (optional but recommended)

Get public key content:

# Windows
Get-Content ~/.ssh/id_rsa.pub

# Mac/Linux
cat ~/.ssh/id_rsa.pub

Copy the entire output (starts with ssh-rsa).

Deployment Failure: Azure OpenAI Quota Exceeded

Error Message:

Deployment failed with error:
The subscription does not have enough quota for 'StandardGPT4o' in region 'eastus'.
Current usage: 100K TPM, Available: 100K TPM, Requested: 30K TPM

Cause:
Azure OpenAI has per-subscription Token Per Minute (TPM) quotas. You've reached the limit.

Solution 1: Request Quota Increase

Go to Azure OpenAI Studio
Click "Quotas" in left sidebar
Select your subscription and region
Find model (GPT-4o or GPT-4o mini)
Click "Request quota increase"
Enter new TPM limit (recommend 150K for Pro, 300K for Pro+)
Provide justification: "Alactic AGI document processing workload"
Submit request
Wait 1-3 business days for approval
Retry deployment

Solution 2: Deploy in Different Region

If quota unavailable in your preferred region:

Check Regional Availability guide
Select alternative region with capacity
Retry deployment in new region

Solution 3: Delete Existing Azure OpenAI Deployments

If you have other Azure OpenAI resources using quota:

Go to Azure Portal
Search for "Azure OpenAI" resources
Review each resource's model deployments
Delete unused model deployments to free quota
Wait 5-10 minutes for quota to update
Retry Alactic deployment

Deployment Failure: Resource Provider Not Registered

Error Message:

The subscription is not registered to use namespace 'Microsoft.CognitiveServices'

Cause:
Azure requires resource providers to be registered before use. First-time users may not have this provider registered.

Solution:

Go to Azure Portal → Subscriptions
Select your subscription
Click "Resource providers" in left sidebar
Search for "Microsoft.CognitiveServices"
Click "Register"
Wait 2-5 minutes for registration to complete
Repeat for these providers if not registered:
- Microsoft.CognitiveServices
- Microsoft.DocumentDB (for Cosmos DB)
- Microsoft.KeyVault
- Microsoft.Storage
- Microsoft.Network
- Microsoft.Compute
Retry deployment

Or use Azure CLI:

az provider register --namespace Microsoft.CognitiveServices
az provider register --namespace Microsoft.DocumentDB
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute

# Check registration status
az provider show --namespace Microsoft.CognitiveServices --query "registrationState"

Deployment Stuck: VM Extension Installation Timeout

Symptoms:

Deployment shows "Running" for 30+ minutes
Last status: "Installing VM extensions..."
Eventually times out with error

Cause:
VM custom script extension downloading files or installing packages is slow or failing.

Solution 1: Check Internet Connectivity

Go to Azure Portal → Resource Groups
Find your Alactic deployment resource group
Click on Virtual Machine resource
Click "Connect" → "SSH"
Use your SSH private key to connect

Check internet:

ping -c 4 8.8.8.8
curl -I https://www.microsoft.com

If no connectivity, check Network Security Group rules

Solution 2: Check Extension Logs

SSH into VM and check logs:

# View extension logs
sudo cat /var/log/azure/custom-script/handler.log

# Check for errors
sudo grep -i error /var/log/azure/custom-script/handler.log

# View deployment script output
sudo cat /var/lib/waagent/custom-script/download/0/stdout
sudo cat /var/lib/waagent/custom-script/download/0/stderr

Common errors and solutions:

Error: "Could not download deployment package"

Blob storage URL incorrect
Blob not public
Solution: Verify blob URL and permissions

Error: "Package installation failed"

apt-get update failed
Solution: Run sudo apt-get update manually, check for repository issues

Error: "Python package installation failed"

PyPI connection timeout
Solution: Retry deployment, check VM's outbound connectivity

Solution 3: Delete and Redeploy

If extension continues failing:

Go to Azure Portal → Resource Groups
Find deployment resource group
Click "Delete resource group"
Confirm deletion
Wait 5-10 minutes for cleanup
Retry deployment from marketplace

Deployment Failure: Invalid SSH Key Format

Error Message:

VM deployment failed: Invalid SSH public key format

Cause:
SSH public key contains invalid characters, wrong format, or line breaks.

Solution 1: Verify Key Format

Valid SSH public key format:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... user@hostname

Should:

Start with ssh-rsa (or ssh-ed25519)
Be one continuous line (no line breaks)
End with comment (optional)

Should NOT:

Have -----BEGIN PUBLIC KEY----- headers
Have line breaks in middle of key
Have special characters outside Base64 set

Solution 2: Copy Key Correctly

Windows:

Get-Content ~/.ssh/id_rsa.pub | Set-Clipboard

Mac:

pbcopy < ~/.ssh/id_rsa.pub

Linux:

cat ~/.ssh/id_rsa.pub | xclip -selection clipboard

Solution 3: Generate New Key

If key appears corrupted:

# Generate new key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/alactic_key

# Copy public key
cat ~/.ssh/alactic_key.pub

Use new public key in deployment form.

Deployment Failure: Model Deployment Timeout

Error Message:

Azure OpenAI model deployment failed: Operation timed out

Cause:
Azure OpenAI Service experiencing high demand, model deployment queued.

Solution 1: Wait and Retry

Delete failed deployment
Wait 15-30 minutes
Retry deployment
Try during off-peak hours (early morning UTC)

Solution 2: Deploy Smaller Model First

If GPT-4o deployment fails:

Try deploying GPT-4o mini instead (Free or Pro Plan)
Check if deployment succeeds
If successful, issue is GPT-4o capacity
Contact Azure Support for GPT-4o quota

Solution 3: Use Different Region

Some regions have more capacity:

Check Regional Availability
Select region with better capacity (e.g., East US 2 if East US fails)
Retry deployment

Post-Deployment: Services Not Starting

Symptoms:

Deployment completed successfully
Can access VM via SSH
Dashboard URL doesn't load (connection refused or timeout)

Cause:
Services failed to start after deployment.

Diagnosis:

SSH into VM and check service status:

# Check all services
sudo systemctl status alactic-*

# Check specific services
sudo systemctl status alactic-api
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr

Solution 1: Service Start Failed

If service status shows "failed":

# View service logs
sudo journalctl -u alactic-api -n 50

# Common issues:
# - Port already in use
# - Missing environment variables
# - Python package import errors

# Restart service
sudo systemctl restart alactic-api

Solution 2: Check Environment Variables

# Verify .env file exists
cat ~/.env

# Should contain:
# DEPLOYMENT_KEY=ak-xxxxx
# AZURE_OPENAI_ENDPOINT=https://...
# AZURE_OPENAI_KEY=...
# COSMOS_DB_ENDPOINT=https://...
# COSMOS_DB_KEY=...

If missing, regenerate from Key Vault:

# Get Key Vault name from deployment
KEYVAULT_NAME="alactic-kv-xxxxx"

# Retrieve secrets
az keyvault secret show --vault-name $KEYVAULT_NAME --name deployment-key --query value -o tsv
az keyvault secret show --vault-name $KEYVAULT_NAME --name azure-openai-endpoint --query value -o tsv

Solution 3: Firewall/NSG Issues

Check if ports are blocked:

# Check if services listening
sudo netstat -tlnp | grep -E ':(80|443|8983)'

# Should see:
# :80 - nginx
# :443 - nginx
# :8983 - solr

If services running but dashboard unreachable, check Network Security Group:

Go to Azure Portal → Resource Groups
Click Network Security Group resource
Click "Inbound security rules"
Verify rule exists:
- Name: AllowHTTPS
- Port: 443
- Protocol: TCP
- Action: Allow
If missing, click "Add" to create rule

Solution 4: Check SSL Certificate

# Verify SSL certificate valid
sudo nginx -t

# Check certificate files
ls -l /etc/ssl/alactic/

# Should see:
# cert.pem
# key.pem

If missing, regenerate:

# Generate self-signed certificate
sudo mkdir -p /etc/ssl/alactic
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/ssl/alactic/key.pem \
  -out /etc/ssl/alactic/cert.pem \
  -subj "/CN=alactic-agi"

# Restart nginx
sudo systemctl restart nginx

Symptoms:

Dashboard loads
Login page visible
Enter deployment key
Error: "Invalid deployment key" or "Authentication failed"

Cause:
Deployment key mismatch or authentication service issue.

Solution 1: Verify Deployment Key

Get correct key from Key Vault:

Go to Azure Portal → Resource Groups
Click Key Vault resource (alactic-kv-xxxxx)
Click "Secrets" in left sidebar
Click "deployment-key" secret
Click current version
Click "Show Secret Value"
Copy key (format: ak-xxxxx)
Try logging in again with this key

Or use Azure CLI:

az keyvault secret show \
  --vault-name alactic-kv-xxxxx \
  --name deployment-key \
  --query value -o tsv

Solution 2: Check Key in VM

SSH into VM and verify key:

# Check .env file
grep DEPLOYMENT_KEY ~/.env

# Should show: DEPLOYMENT_KEY=ak-xxxxx

If different from Key Vault, update:

# Get correct key
CORRECT_KEY=$(az keyvault secret show --vault-name alactic-kv-xxxxx --name deployment-key --query value -o tsv)

# Update .env
sudo sed -i "s/DEPLOYMENT_KEY=.*/DEPLOYMENT_KEY=$CORRECT_KEY/" ~/.env

# Restart API service
sudo systemctl restart alactic-api

Solution 3: Check API Service

# Verify API responds
curl -H "X-Deployment-Key: ak-xxxxx" http://localhost:8000/health

# Should return:
# {"status": "healthy"}

If error:

# Check API logs
sudo journalctl -u alactic-api -n 100

# Restart API
sudo systemctl restart alactic-api

Post-Deployment: High Costs

Symptoms:

Deployment successful
Not processing documents yet
Seeing charges in Azure Cost Management

Cause:
Infrastructure resources running 24/7.

Expected Baseline Costs:

Plan	Daily Cost	Monthly Cost	Main Components
Free	$2.40	$72	VM ($1.70), Storage ($0.50), Cosmos DB ($0.20)
Pro	$4.90	$147	VM ($3.50), Storage ($1.00), Cosmos DB ($0.40)
Pro+	$9.83	$295	VM ($7.50), Storage ($2.00), Cosmos DB ($0.33)

If costs higher than expected:

Check 1: Verify Plan

SSH into VM

Check VM SKU:

curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/vmSize?api-version=2021-01-01&format=text"

# Should return:
# Free: Standard_B2s
# Pro: Standard_D2s_v3
# Pro+: Standard_D4s_v3

Check 2: Review Azure OpenAI Usage

Go to Azure Portal → Resource Groups
Click Azure OpenAI resource
Click "Metrics" in left sidebar
Select metric: "Processed Inference Tokens"
Check if usage unexpected

If seeing token usage without processing documents:

May have health checks consuming tokens
Check API logs for unexpected calls

Check 3: Stop/Deallocate VM When Not in Use

For development/testing, stop VM when idle:

# Stop VM (saves compute costs, keeps storage)
az vm deallocate --resource-group <your-rg> --name <vm-name>

# Start VM when needed
az vm start --resource-group <your-rg> --name <vm-name>

Savings: Eliminates VM compute costs (~$1.70-7.50/day)
Downside: Dashboard inaccessible while stopped

Getting Help

If issues persist after troubleshooting:

Check Deployment Logs:

Go to Azure Portal → Resource Groups
Click "Deployments" in left sidebar
Click your deployment
Review operation details and error messages

Contact Support:

Email: support@alacticai.com
Include: Deployment ID, error message, troubleshooting steps attempted
Enterprise plans: 24-hour response SLA

Community Help:

Support Portal: alactic.io/support
Community Forum: community.alactic.ai

Azure Support: For Azure-specific issues (quotas, regional capacity):

Azure Support Portal
Azure Support Plans required for ticket submission

Prevention: Pre-Deployment Checklist

Use this checklist before every deployment:

Azure OpenAI Service access approved
Sufficient vCPU quota in target region
SSH key pair generated and tested
All required resource providers registered
Selected region supports all required services
Verified subscription has sufficient credits/budget
Documented deployment key storage location
Reviewed Regional Availability guide

Following this checklist prevents 90% of deployment failures.

Pre-Deployment Validation​

Azure OpenAI Service Access​

vCPU Quota Check​

SSH Key Pair​

Deployment Failure: Azure OpenAI Quota Exceeded​

Deployment Failure: Resource Provider Not Registered​

Deployment Stuck: VM Extension Installation Timeout​

Deployment Failure: Invalid SSH Key Format​

Deployment Failure: Model Deployment Timeout​

Post-Deployment: Services Not Starting​

Post-Deployment: Cannot Login to Dashboard​

Post-Deployment: High Costs​

Getting Help​

Prevention: Pre-Deployment Checklist​

Related Documentation​

Pre-Deployment Validation

Azure OpenAI Service Access

vCPU Quota Check

SSH Key Pair

Deployment Failure: Azure OpenAI Quota Exceeded

Deployment Failure: Resource Provider Not Registered

Deployment Stuck: VM Extension Installation Timeout

Deployment Failure: Invalid SSH Key Format

Deployment Failure: Model Deployment Timeout

Post-Deployment: Services Not Starting

Post-Deployment: Cannot Login to Dashboard

Post-Deployment: High Costs

Getting Help

Prevention: Pre-Deployment Checklist

Related Documentation