Deployment Troubleshooting
This guide covers common issues encountered during Azure Marketplace deployment and how to resolve them. Follow the troubleshooting steps in order, starting with the most common issues.
Pre-Deployment Validation
Before attempting deployment, verify these requirements are met:
Azure OpenAI Service Access
Check if you have access:
- Go to Azure Portal
- Click "+ Create a resource"
- Search for "Azure OpenAI"
- Select a region (e.g., East US)
- Check if you can proceed to "Create"
If you see "Request Access" or "Not Available":
You need to apply for access:
- Visit Azure OpenAI Access Request Form
- Fill out application form
- Wait 1-5 business days for approval
- Check email for approval notification
- Retry deployment after approval
Why is this required?
Azure OpenAI Service is a controlled service requiring Microsoft approval to prevent misuse.
vCPU Quota Check
Verify sufficient quota:
- Go to Azure Portal → Subscriptions
- Select your subscription
- Click "Usage + quotas" in left sidebar
- Filter by:
- Provider: "Microsoft.Compute"
- Region: Your target deployment region (e.g., East US)
- Find "Standard DSv3 Family vCPUs" or "Standard BS Family vCPUs"
- Check: Current usage + required vCPUs ≤ Limit
Required vCPUs by plan:
- Free Plan: 2 vCPUs (Standard_B2s)
- Pro Plan: 2 vCPUs (Standard_D2s_v3)
- Pro+ Plan: 4 vCPUs (Standard_D4s_v3)
- Enterprise Plan: 8+ vCPUs
If quota insufficient:
- Click "Request quota increase"
- Select VM series (DSv3 or BS)
- Enter new limit (current + needed)
- Provide justification: "Alactic AGI deployment"
- Submit request
- Wait 1-2 business days for approval
SSH Key Pair
Verify you have SSH keys:
Windows (PowerShell):
Test-Path ~/.ssh/id_rsa.pub
# Should return: True
Mac/Linux:
ls ~/.ssh/id_rsa.pub
# Should list the file
If keys don't exist, generate them:
# Windows/Mac/Linux
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
# Press Enter for default location
# Enter passphrase (optional but recommended)
Get public key content:
# Windows
Get-Content ~/.ssh/id_rsa.pub
# Mac/Linux
cat ~/.ssh/id_rsa.pub
Copy the entire output (starts with ssh-rsa).
Deployment Failure: Azure OpenAI Quota Exceeded
Error Message:
Deployment failed with error:
The subscription does not have enough quota for 'StandardGPT4o' in region 'eastus'.
Current usage: 100K TPM, Available: 100K TPM, Requested: 30K TPM
Cause:
Azure OpenAI has per-subscription Token Per Minute (TPM) quotas. You've reached the limit.
Solution 1: Request Quota Increase
- Go to Azure OpenAI Studio
- Click "Quotas" in left sidebar
- Select your subscription and region
- Find model (GPT-4o or GPT-4o mini)
- Click "Request quota increase"
- Enter new TPM limit (recommend 150K for Pro, 300K for Pro+)
- Provide justification: "Alactic AGI document processing workload"
- Submit request
- Wait 1-3 business days for approval
- Retry deployment
Solution 2: Deploy in Different Region
If quota unavailable in your preferred region:
- Check Regional Availability guide
- Select alternative region with capacity
- Retry deployment in new region
Solution 3: Delete Existing Azure OpenAI Deployments
If you have other Azure OpenAI resources using quota:
- Go to Azure Portal
- Search for "Azure OpenAI" resources
- Review each resource's model deployments
- Delete unused model deployments to free quota
- Wait 5-10 minutes for quota to update
- Retry Alactic deployment
Deployment Failure: Resource Provider Not Registered
Error Message:
The subscription is not registered to use namespace 'Microsoft.CognitiveServices'
Cause:
Azure requires resource providers to be registered before use. First-time users may not have this provider registered.
Solution:
- Go to Azure Portal → Subscriptions
- Select your subscription
- Click "Resource providers" in left sidebar
- Search for "Microsoft.CognitiveServices"
- Click "Register"
- Wait 2-5 minutes for registration to complete
- Repeat for these providers if not registered:
- Microsoft.CognitiveServices
- Microsoft.DocumentDB (for Cosmos DB)
- Microsoft.KeyVault
- Microsoft.Storage
- Microsoft.Network
- Microsoft.Compute
- Retry deployment
Or use Azure CLI:
az provider register --namespace Microsoft.CognitiveServices
az provider register --namespace Microsoft.DocumentDB
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute
# Check registration status
az provider show --namespace Microsoft.CognitiveServices --query "registrationState"
Deployment Stuck: VM Extension Installation Timeout
Symptoms:
- Deployment shows "Running" for 30+ minutes
- Last status: "Installing VM extensions..."
- Eventually times out with error
Cause:
VM custom script extension downloading files or installing packages is slow or failing.
Solution 1: Check Internet Connectivity
- Go to Azure Portal → Resource Groups
- Find your Alactic deployment resource group
- Click on Virtual Machine resource
- Click "Connect" → "SSH"
- Use your SSH private key to connect
- Check internet:
ping -c 4 8.8.8.8
curl -I https://www.microsoft.com - If no connectivity, check Network Security Group rules
Solution 2: Check Extension Logs
SSH into VM and check logs:
# View extension logs
sudo cat /var/log/azure/custom-script/handler.log
# Check for errors
sudo grep -i error /var/log/azure/custom-script/handler.log
# View deployment script output
sudo cat /var/lib/waagent/custom-script/download/0/stdout
sudo cat /var/lib/waagent/custom-script/download/0/stderr
Common errors and solutions:
Error: "Could not download deployment package"
- Blob storage URL incorrect
- Blob not public
- Solution: Verify blob URL and permissions
Error: "Package installation failed"
- apt-get update failed
- Solution: Run
sudo apt-get updatemanually, check for repository issues
Error: "Python package installation failed"
- PyPI connection timeout
- Solution: Retry deployment, check VM's outbound connectivity
Solution 3: Delete and Redeploy
If extension continues failing:
- Go to Azure Portal → Resource Groups
- Find deployment resource group
- Click "Delete resource group"
- Confirm deletion
- Wait 5-10 minutes for cleanup
- Retry deployment from marketplace
Deployment Failure: Invalid SSH Key Format
Error Message:
VM deployment failed: Invalid SSH public key format
Cause:
SSH public key contains invalid characters, wrong format, or line breaks.
Solution 1: Verify Key Format
Valid SSH public key format:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... user@hostname
Should:
- Start with
ssh-rsa(orssh-ed25519) - Be one continuous line (no line breaks)
- End with comment (optional)
Should NOT:
- Have
-----BEGIN PUBLIC KEY-----headers - Have line breaks in middle of key
- Have special characters outside Base64 set
Solution 2: Copy Key Correctly
Windows:
Get-Content ~/.ssh/id_rsa.pub | Set-Clipboard
Mac:
pbcopy < ~/.ssh/id_rsa.pub
Linux:
cat ~/.ssh/id_rsa.pub | xclip -selection clipboard
Solution 3: Generate New Key
If key appears corrupted:
# Generate new key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/alactic_key
# Copy public key
cat ~/.ssh/alactic_key.pub
Use new public key in deployment form.
Deployment Failure: Model Deployment Timeout
Error Message:
Azure OpenAI model deployment failed: Operation timed out
Cause:
Azure OpenAI Service experiencing high demand, model deployment queued.
Solution 1: Wait and Retry
- Delete failed deployment
- Wait 15-30 minutes
- Retry deployment
- Try during off-peak hours (early morning UTC)
Solution 2: Deploy Smaller Model First
If GPT-4o deployment fails:
- Try deploying GPT-4o mini instead (Free or Pro Plan)
- Check if deployment succeeds
- If successful, issue is GPT-4o capacity
- Contact Azure Support for GPT-4o quota
Solution 3: Use Different Region
Some regions have more capacity:
- Check Regional Availability
- Select region with better capacity (e.g., East US 2 if East US fails)
- Retry deployment
Post-Deployment: Services Not Starting
Symptoms:
- Deployment completed successfully
- Can access VM via SSH
- Dashboard URL doesn't load (connection refused or timeout)
Cause:
Services failed to start after deployment.
Diagnosis:
SSH into VM and check service status:
# Check all services
sudo systemctl status alactic-*
# Check specific services
sudo systemctl status alactic-api
sudo systemctl status alactic-ui
sudo systemctl status alactic-solr
Solution 1: Service Start Failed
If service status shows "failed":
# View service logs
sudo journalctl -u alactic-api -n 50
# Common issues:
# - Port already in use
# - Missing environment variables
# - Python package import errors
# Restart service
sudo systemctl restart alactic-api
Solution 2: Check Environment Variables
# Verify .env file exists
cat ~/.env
# Should contain:
# DEPLOYMENT_KEY=ak-xxxxx
# AZURE_OPENAI_ENDPOINT=https://...
# AZURE_OPENAI_KEY=...
# COSMOS_DB_ENDPOINT=https://...
# COSMOS_DB_KEY=...
If missing, regenerate from Key Vault:
# Get Key Vault name from deployment
KEYVAULT_NAME="alactic-kv-xxxxx"
# Retrieve secrets
az keyvault secret show --vault-name $KEYVAULT_NAME --name deployment-key --query value -o tsv
az keyvault secret show --vault-name $KEYVAULT_NAME --name azure-openai-endpoint --query value -o tsv
Solution 3: Firewall/NSG Issues
Check if ports are blocked:
# Check if services listening
sudo netstat -tlnp | grep -E ':(80|443|8983)'
# Should see:
# :80 - nginx
# :443 - nginx
# :8983 - solr
If services running but dashboard unreachable, check Network Security Group:
- Go to Azure Portal → Resource Groups
- Click Network Security Group resource
- Click "Inbound security rules"
- Verify rule exists:
- Name: AllowHTTPS
- Port: 443
- Protocol: TCP
- Action: Allow
- If missing, click "Add" to create rule
Solution 4: Check SSL Certificate
# Verify SSL certificate valid
sudo nginx -t
# Check certificate files
ls -l /etc/ssl/alactic/
# Should see:
# cert.pem
# key.pem
If missing, regenerate:
# Generate self-signed certificate
sudo mkdir -p /etc/ssl/alactic
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/alactic/key.pem \
-out /etc/ssl/alactic/cert.pem \
-subj "/CN=alactic-agi"
# Restart nginx
sudo systemctl restart nginx
Post-Deployment: Cannot Login to Dashboard
Symptoms:
- Dashboard loads
- Login page visible
- Enter deployment key
- Error: "Invalid deployment key" or "Authentication failed"
Cause:
Deployment key mismatch or authentication service issue.
Solution 1: Verify Deployment Key
Get correct key from Key Vault:
- Go to Azure Portal → Resource Groups
- Click Key Vault resource (alactic-kv-xxxxx)
- Click "Secrets" in left sidebar
- Click "deployment-key" secret
- Click current version
- Click "Show Secret Value"
- Copy key (format: ak-xxxxx)
- Try logging in again with this key
Or use Azure CLI:
az keyvault secret show \
--vault-name alactic-kv-xxxxx \
--name deployment-key \
--query value -o tsv
Solution 2: Check Key in VM
SSH into VM and verify key:
# Check .env file
grep DEPLOYMENT_KEY ~/.env
# Should show: DEPLOYMENT_KEY=ak-xxxxx
If different from Key Vault, update:
# Get correct key
CORRECT_KEY=$(az keyvault secret show --vault-name alactic-kv-xxxxx --name deployment-key --query value -o tsv)
# Update .env
sudo sed -i "s/DEPLOYMENT_KEY=.*/DEPLOYMENT_KEY=$CORRECT_KEY/" ~/.env
# Restart API service
sudo systemctl restart alactic-api
Solution 3: Check API Service
# Verify API responds
curl -H "X-Deployment-Key: ak-xxxxx" http://localhost:8000/health
# Should return:
# {"status": "healthy"}
If error:
# Check API logs
sudo journalctl -u alactic-api -n 100
# Restart API
sudo systemctl restart alactic-api
Post-Deployment: High Costs
Symptoms:
- Deployment successful
- Not processing documents yet
- Seeing charges in Azure Cost Management
Cause:
Infrastructure resources running 24/7.
Expected Baseline Costs:
| Plan | Daily Cost | Monthly Cost | Main Components |
|---|---|---|---|
| Free | $2.40 | $72 | VM ($1.70), Storage ($0.50), Cosmos DB ($0.20) |
| Pro | $4.90 | $147 | VM ($3.50), Storage ($1.00), Cosmos DB ($0.40) |
| Pro+ | $9.83 | $295 | VM ($7.50), Storage ($2.00), Cosmos DB ($0.33) |
If costs higher than expected:
Check 1: Verify Plan
- SSH into VM
- Check VM SKU:
curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/vmSize?api-version=2021-01-01&format=text"
# Should return:
# Free: Standard_B2s
# Pro: Standard_D2s_v3
# Pro+: Standard_D4s_v3
Check 2: Review Azure OpenAI Usage
- Go to Azure Portal → Resource Groups
- Click Azure OpenAI resource
- Click "Metrics" in left sidebar
- Select metric: "Processed Inference Tokens"
- Check if usage unexpected
If seeing token usage without processing documents:
- May have health checks consuming tokens
- Check API logs for unexpected calls
Check 3: Stop/Deallocate VM When Not in Use
For development/testing, stop VM when idle:
# Stop VM (saves compute costs, keeps storage)
az vm deallocate --resource-group <your-rg> --name <vm-name>
# Start VM when needed
az vm start --resource-group <your-rg> --name <vm-name>
Savings: Eliminates VM compute costs (~$1.70-7.50/day)
Downside: Dashboard inaccessible while stopped
Getting Help
If issues persist after troubleshooting:
Check Deployment Logs:
- Go to Azure Portal → Resource Groups
- Click "Deployments" in left sidebar
- Click your deployment
- Review operation details and error messages
Contact Support:
- Email: support@alacticai.com
- Include: Deployment ID, error message, troubleshooting steps attempted
- Enterprise plans: 24-hour response SLA
Community Help:
- Support Portal: alactic.io/support
- Community Forum: community.alactic.ai
Azure Support: For Azure-specific issues (quotas, regional capacity):
- Azure Support Portal
- Azure Support Plans required for ticket submission
Prevention: Pre-Deployment Checklist
Use this checklist before every deployment:
- Azure OpenAI Service access approved
- Sufficient vCPU quota in target region
- SSH key pair generated and tested
- All required resource providers registered
- Selected region supports all required services
- Verified subscription has sufficient credits/budget
- Documented deployment key storage location
- Reviewed Regional Availability guide
Following this checklist prevents 90% of deployment failures.