Skip to main content

Deployment Troubleshooting

This guide covers common issues encountered during Azure Marketplace deployment and how to resolve them. Follow the troubleshooting steps in order, starting with the most common issues.

Pre-Deployment Validation

Before attempting deployment, verify these requirements are met:

Azure OpenAI Service Access

Check if you have access:

  1. Go to Azure Portal
  2. Click "+ Create a resource"
  3. Search for "Azure OpenAI"
  4. Select a region (e.g., East US)
  5. Check if you can proceed to "Create"

If you see "Request Access" or "Not Available":

You need to apply for access:

  1. Visit Azure OpenAI Access Request Form
  2. Fill out application form
  3. Wait 1-5 business days for approval
  4. Check email for approval notification
  5. Retry deployment after approval

Why is this required?
Azure OpenAI Service is a controlled service requiring Microsoft approval to prevent misuse.

vCPU Quota Check

Verify sufficient quota:

  1. Go to Azure Portal → Subscriptions
  2. Select your subscription
  3. Click "Usage + quotas" in left sidebar
  4. Filter by:
    • Provider: "Microsoft.Compute"
    • Region: Your target deployment region (e.g., East US)
  5. Find "Standard DSv3 Family vCPUs" or "Standard BS Family vCPUs"
  6. Check: Current usage + required vCPUs ≤ Limit

Required vCPUs by plan:

  • Free Plan: 2 vCPUs (Standard_B2s)
  • Pro Plan: 2 vCPUs (Standard_D2s_v3)
  • Pro+ Plan: 4 vCPUs (Standard_D4s_v3)
  • Enterprise Plan: 8+ vCPUs

If quota insufficient:

  1. Click "Request quota increase"
  2. Select VM series (DSv3 or BS)
  3. Enter new limit (current + needed)
  4. Provide justification: "Alactic AGI deployment"
  5. Submit request
  6. Wait 1-2 business days for approval

SSH Key Pair

Verify you have SSH keys:

Windows (PowerShell):

Test-Path ~/.ssh/id_rsa.pub
# Should return: True

Mac/Linux:

ls ~/.ssh/id_rsa.pub
# Should list the file

If keys don't exist, generate them:

# Windows/Mac/Linux
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

# Press Enter for default location
# Enter passphrase (optional but recommended)

Get public key content:

# Windows
Get-Content ~/.ssh/id_rsa.pub

# Mac/Linux
cat ~/.ssh/id_rsa.pub

Copy the entire output (starts with ssh-rsa).

Deployment Failure: Azure OpenAI Quota Exceeded

Error Message:

Deployment failed with error:
The subscription does not have enough quota for 'StandardGPT4o' in region 'eastus'.
Current usage: 100K TPM, Available: 100K TPM, Requested: 30K TPM

Cause:
Azure OpenAI has per-subscription Token Per Minute (TPM) quotas. You've reached the limit.

Solution 1: Request Quota Increase

  1. Go to Azure OpenAI Studio
  2. Click "Quotas" in left sidebar
  3. Select your subscription and region
  4. Find model (GPT-4o or GPT-4o mini)
  5. Click "Request quota increase"
  6. Enter new TPM limit (recommend 150K for Pro, 300K for Pro+)
  7. Provide justification: "Alactic AGI document processing workload"
  8. Submit request
  9. Wait 1-3 business days for approval
  10. Retry deployment

Solution 2: Deploy in Different Region

If quota unavailable in your preferred region:

  1. Check Regional Availability guide
  2. Select alternative region with capacity
  3. Retry deployment in new region

Solution 3: Delete Existing Azure OpenAI Deployments

If you have other Azure OpenAI resources using quota:

  1. Go to Azure Portal
  2. Search for "Azure OpenAI" resources
  3. Review each resource's model deployments
  4. Delete unused model deployments to free quota
  5. Wait 5-10 minutes for quota to update
  6. Retry Alactic deployment

Deployment Failure: Resource Provider Not Registered

Error Message:

The subscription is not registered to use namespace 'Microsoft.CognitiveServices'

Cause:
Azure requires resource providers to be registered before use. First-time users may not have this provider registered.

Solution:

  1. Go to Azure Portal → Subscriptions
  2. Select your subscription
  3. Click "Resource providers" in left sidebar
  4. Search for "Microsoft.CognitiveServices"
  5. Click "Register"
  6. Wait 2-5 minutes for registration to complete
  7. Repeat for these providers if not registered:
    • Microsoft.CognitiveServices
    • Microsoft.DocumentDB (for Cosmos DB)
    • Microsoft.KeyVault
    • Microsoft.Storage
    • Microsoft.Network
    • Microsoft.Compute
  8. Retry deployment

Or use Azure CLI:

az provider register --namespace Microsoft.CognitiveServices
az provider register --namespace Microsoft.DocumentDB
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute

# Check registration status
az provider show --namespace Microsoft.CognitiveServices --query "registrationState"

Deployment Stuck: VM Extension Installation Timeout

Symptoms:

  • Deployment shows "Running" for 30+ minutes
  • Last status: "Installing VM extensions..."
  • Eventually times out with error

Cause:
VM custom script extension downloading files or installing packages is slow or failing.

Solution 1: Check Internet Connectivity

  1. Go to Azure Portal → Resource Groups
  2. Find your Alactic deployment resource group
  3. Click on Virtual Machine resource
  4. Click "Connect" → "SSH"
  5. Use your SSH private key to connect
  6. Check internet:
    ping -c 4 8.8.8.8
    curl -I https://www.microsoft.com
  7. If no connectivity, check Network Security Group rules

Solution 2: Check Extension Logs

SSH into VM and check logs:

# View extension logs
sudo cat /var/log/azure/custom-script/handler.log

# Check for errors
sudo grep -i error /var/log/azure/custom-script/handler.log

# View deployment script output
sudo cat /var/lib/waagent/custom-script/download/0/stdout
sudo cat /var/lib/waagent/custom-script/download/0/stderr

Common errors and solutions:

Error: "Could not download deployment package"

  • Blob storage URL incorrect
  • Blob not public
  • Solution: Verify blob URL and permissions

Error: "Package installation failed"

  • apt-get update failed
  • Solution: Run sudo apt-get update manually, check for repository issues

Error: "Python package installation failed"

  • PyPI connection timeout
  • Solution: Retry deployment, check VM's outbound connectivity

Solution 3: Delete and Redeploy

If extension continues failing:

  1. Go to Azure Portal → Resource Groups
  2. Find deployment resource group
  3. Click "Delete resource group"
  4. Confirm deletion
  5. Wait 5-10 minutes for cleanup
  6. Retry deployment from marketplace

Deployment Failure: Invalid SSH Key Format

Error Message:

VM deployment failed: Invalid SSH public key format

Cause:
SSH public key contains invalid characters, wrong format, or line breaks.

Solution 1: Verify Key Format

Valid SSH public key format:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... user@hostname

Should:

  • Start with ssh-rsa (or ssh-ed25519)
  • Be one continuous line (no line breaks)
  • End with comment (optional)

Should NOT:

  • Have -----BEGIN PUBLIC KEY----- headers
  • Have line breaks in middle of key
  • Have special characters outside Base64 set

Solution 2: Copy Key Correctly

Windows:

Get-Content ~/.ssh/id_rsa.pub | Set-Clipboard

Mac:

pbcopy < ~/.ssh/id_rsa.pub

Linux:

cat ~/.ssh/id_rsa.pub | xclip -selection clipboard

Solution 3: Generate New Key

If key appears corrupted:

# Generate new key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/alactic_key

# Copy public key
cat ~/.ssh/alactic_key.pub

Use new public key in deployment form.

Deployment Failure: Model Deployment Timeout

Error Message:

Azure OpenAI model deployment failed: Operation timed out

Cause:
Azure OpenAI Service experiencing high demand, model deployment queued.

Solution 1: Wait and Retry

  1. Delete failed deployment
  2. Wait 15-30 minutes
  3. Retry deployment
  4. Try during off-peak hours (early morning UTC)

Solution 2: Deploy Smaller Model First

If GPT-4o deployment fails:

  1. Try deploying GPT-4o mini instead (Free or Pro Plan)
  2. Check if deployment succeeds
  3. If successful, issue is GPT-4o capacity
  4. Contact Azure Support for GPT-4o quota

Solution 3: Use Different Region

Some regions have more capacity:

  1. Check Regional Availability
  2. Select region with better capacity (e.g., East US 2 if East US fails)
  3. Retry deployment

Post-Deployment: Services Not Starting

Symptoms:

  • Deployment completed successfully
  • Can access VM via SSH
  • Dashboard URL doesn't load (connection refused or timeout)

Cause:
Services failed to start after deployment.

Diagnosis:

SSH into VM and check service status:

# Check all services
sudo systemctl status alactic-*

# Check specific services
sudo systemctl status alactic-api
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr

Solution 1: Service Start Failed

If service status shows "failed":

# View service logs
sudo journalctl -u alactic-api -n 50

# Common issues:
# - Port already in use
# - Missing environment variables
# - Python package import errors

# Restart service
sudo systemctl restart alactic-api

Solution 2: Check Environment Variables

# Verify .env file exists
cat ~/.env

# Should contain:
# DEPLOYMENT_KEY=ak-xxxxx
# AZURE_OPENAI_ENDPOINT=https://...
# AZURE_OPENAI_KEY=...
# COSMOS_DB_ENDPOINT=https://...
# COSMOS_DB_KEY=...

If missing, regenerate from Key Vault:

# Get Key Vault name from deployment
KEYVAULT_NAME="alactic-kv-xxxxx"

# Retrieve secrets
az keyvault secret show --vault-name $KEYVAULT_NAME --name deployment-key --query value -o tsv
az keyvault secret show --vault-name $KEYVAULT_NAME --name azure-openai-endpoint --query value -o tsv

Solution 3: Firewall/NSG Issues

Check if ports are blocked:

# Check if services listening
sudo netstat -tlnp | grep -E ':(80|443|8983)'

# Should see:
# :80 - nginx
# :443 - nginx
# :8983 - solr

If services running but dashboard unreachable, check Network Security Group:

  1. Go to Azure Portal → Resource Groups
  2. Click Network Security Group resource
  3. Click "Inbound security rules"
  4. Verify rule exists:
    • Name: AllowHTTPS
    • Port: 443
    • Protocol: TCP
    • Action: Allow
  5. If missing, click "Add" to create rule

Solution 4: Check SSL Certificate

# Verify SSL certificate valid
sudo nginx -t

# Check certificate files
ls -l /etc/ssl/alactic/

# Should see:
# cert.pem
# key.pem

If missing, regenerate:

# Generate self-signed certificate
sudo mkdir -p /etc/ssl/alactic
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/alactic/key.pem \
-out /etc/ssl/alactic/cert.pem \
-subj "/CN=alactic-agi"

# Restart nginx
sudo systemctl restart nginx

Post-Deployment: Cannot Login to Dashboard

Symptoms:

  • Dashboard loads
  • Login page visible
  • Enter deployment key
  • Error: "Invalid deployment key" or "Authentication failed"

Cause:
Deployment key mismatch or authentication service issue.

Solution 1: Verify Deployment Key

Get correct key from Key Vault:

  1. Go to Azure Portal → Resource Groups
  2. Click Key Vault resource (alactic-kv-xxxxx)
  3. Click "Secrets" in left sidebar
  4. Click "deployment-key" secret
  5. Click current version
  6. Click "Show Secret Value"
  7. Copy key (format: ak-xxxxx)
  8. Try logging in again with this key

Or use Azure CLI:

az keyvault secret show \
--vault-name alactic-kv-xxxxx \
--name deployment-key \
--query value -o tsv

Solution 2: Check Key in VM

SSH into VM and verify key:

# Check .env file
grep DEPLOYMENT_KEY ~/.env

# Should show: DEPLOYMENT_KEY=ak-xxxxx

If different from Key Vault, update:

# Get correct key
CORRECT_KEY=$(az keyvault secret show --vault-name alactic-kv-xxxxx --name deployment-key --query value -o tsv)

# Update .env
sudo sed -i "s/DEPLOYMENT_KEY=.*/DEPLOYMENT_KEY=$CORRECT_KEY/" ~/.env

# Restart API service
sudo systemctl restart alactic-api

Solution 3: Check API Service

# Verify API responds
curl -H "X-Deployment-Key: ak-xxxxx" http://localhost:8000/health

# Should return:
# {"status": "healthy"}

If error:

# Check API logs
sudo journalctl -u alactic-api -n 100

# Restart API
sudo systemctl restart alactic-api

Post-Deployment: High Costs

Symptoms:

  • Deployment successful
  • Not processing documents yet
  • Seeing charges in Azure Cost Management

Cause:
Infrastructure resources running 24/7.

Expected Baseline Costs:

PlanDaily CostMonthly CostMain Components
Free$2.40$72VM ($1.70), Storage ($0.50), Cosmos DB ($0.20)
Pro$4.90$147VM ($3.50), Storage ($1.00), Cosmos DB ($0.40)
Pro+$9.83$295VM ($7.50), Storage ($2.00), Cosmos DB ($0.33)

If costs higher than expected:

Check 1: Verify Plan

  1. SSH into VM
  2. Check VM SKU:
    curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/vmSize?api-version=2021-01-01&format=text"

    # Should return:
    # Free: Standard_B2s
    # Pro: Standard_D2s_v3
    # Pro+: Standard_D4s_v3

Check 2: Review Azure OpenAI Usage

  1. Go to Azure Portal → Resource Groups
  2. Click Azure OpenAI resource
  3. Click "Metrics" in left sidebar
  4. Select metric: "Processed Inference Tokens"
  5. Check if usage unexpected

If seeing token usage without processing documents:

  • May have health checks consuming tokens
  • Check API logs for unexpected calls

Check 3: Stop/Deallocate VM When Not in Use

For development/testing, stop VM when idle:

# Stop VM (saves compute costs, keeps storage)
az vm deallocate --resource-group <your-rg> --name <vm-name>

# Start VM when needed
az vm start --resource-group <your-rg> --name <vm-name>

Savings: Eliminates VM compute costs (~$1.70-7.50/day)
Downside: Dashboard inaccessible while stopped

Getting Help

If issues persist after troubleshooting:

Check Deployment Logs:

  1. Go to Azure Portal → Resource Groups
  2. Click "Deployments" in left sidebar
  3. Click your deployment
  4. Review operation details and error messages

Contact Support:

  • Email: support@alacticai.com
  • Include: Deployment ID, error message, troubleshooting steps attempted
  • Enterprise plans: 24-hour response SLA

Community Help:

Azure Support: For Azure-specific issues (quotas, regional capacity):

Prevention: Pre-Deployment Checklist

Use this checklist before every deployment:

  • Azure OpenAI Service access approved
  • Sufficient vCPU quota in target region
  • SSH key pair generated and tested
  • All required resource providers registered
  • Selected region supports all required services
  • Verified subscription has sufficient credits/budget
  • Documented deployment key storage location
  • Reviewed Regional Availability guide

Following this checklist prevents 90% of deployment failures.