AI专用故障排除指南


文档摘要

AI 专用故障排除指南 章节导航: 课程主页:AZD 初学者指南 当前章节:第 7 章 - 故障排除与调试 ⬅️ 上一章:调试指南 ➡️ 下一章:第 8 章:生产与企业模式 相关内容:第 2 章:AI 优先开发 上一章:生产 AI 实践 | 下一章:AZD 入门 本全面的故障排除指南解决了使用 AZD 部署 AI 解决方案时的常见问题,并提供了针对 Azure AI 服务的解决方案和调试技术。

AI 专用故障排除指南

章节导航:

上一章:生产 AI 实践 | 下一章:AZD 入门

本全面的故障排除指南解决了使用 AZD 部署 AI 解决方案时的常见问题,并提供了针对 Azure AI 服务的解决方案和调试技术。

目录

Azure OpenAI 服务问题

问题:OpenAI 服务在区域不可用

症状:

Error: The requested resource type is not available in the location 'westus'

原因:

  • Azure OpenAI 在所选区域不可用
  • 首选区域的配额已用尽
  • 区域容量限制

解决方案:

  1. 检查区域可用性:
# List available regions for OpenAI az cognitiveservices account list-skus \ --kind OpenAI \ --query "[].locations[]" \ --output table
  1. 更新 AZD 配置:
# azure.yaml - Force specific region infra: provider: bicep path: infra module: main parameters: location: "eastus2" # Known working region
  1. 使用其他区域:
// infra/main.bicep - Multi-region fallback @allowed([ 'eastus2' 'francecentral' 'canadaeast' 'swedencentral' ]) param openAiLocation string = 'eastus2'

问题:模型部署配额超出

症状:

Error: Deployment failed due to insufficient quota

解决方案:

  1. 检查当前配额:
# Check quota usage az cognitiveservices usage list \ --name YOUR_OPENAI_RESOURCE \ --resource-group YOUR_RG
  1. 请求增加配额:
# Submit quota increase request az support tickets create \ --ticket-name "OpenAI Quota Increase" \ --description "Need increased quota for production deployment" \ --severity "minimal" \ --problem-classification "/providers/Microsoft.Support/services/quota_service_guid/problemClassifications/quota_service_problemClassification_guid"
  1. 优化模型容量:
// Reduce initial capacity resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = { properties: { model: { format: 'OpenAI' name: 'gpt-4o-mini' version: '2024-07-18' } } sku: { name: 'Standard' capacity: 1 // Start with minimal capacity } }

问题:无效的 API 版本

症状:

Error: The API version '2023-05-15' is not available for OpenAI

解决方案:

  1. 使用支持的 API 版本:
# Use latest supported version AZURE_OPENAI_API_VERSION = "2024-02-15-preview"
  1. 检查 API 版本兼容性:
# List supported API versions az rest --method get \ --url "https://management.azure.com/providers/Microsoft.CognitiveServices/operations?api-version=2023-05-01" \ --query "value[?name.value=='Microsoft.CognitiveServices/accounts/read'].properties.serviceSpecification.metricSpecifications[].supportedApiVersions[]"

Azure AI 搜索问题

问题:搜索服务定价层不足

症状:

Error: Semantic search requires Basic tier or higher

解决方案:

  1. 升级定价层:
// infra/main.bicep - Use Basic tier resource searchService 'Microsoft.Search/searchServices@2023-11-01' = { name: searchServiceName location: location sku: { name: 'basic' // Minimum for semantic search } properties: { replicaCount: 1 partitionCount: 1 hostingMode: 'default' semanticSearch: 'standard' } }
  1. 禁用语义搜索(开发阶段):
// For development environments resource searchService 'Microsoft.Search/searchServices@2023-11-01' = { name: searchServiceName sku: { name: 'free' } properties: { semanticSearch: 'disabled' } }

问题:索引创建失败

症状:

Error: Cannot create index, insufficient permissions

解决方案:

  1. 验证搜索服务密钥:
# Get search service admin key az search admin-key show \ --service-name YOUR_SEARCH_SERVICE \ --resource-group YOUR_RG
  1. 检查索引架构:
# Validate index schema from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import SearchIndex def validate_index_schema(index_definition): """Validate index schema before creation.""" required_fields = ['id', 'content'] field_names = [field.name for field in index_definition.fields] for required in required_fields: if required not in field_names: raise ValueError(f"Missing required field: {required}")
  1. 使用托管身份:
// Grant search permissions to managed identity resource searchContributor 'Microsoft.Authorization/roleAssignments@2022-04-01' = { scope: searchService name: guid(searchService.id, containerApp.id, searchIndexDataContributorRole) properties: { principalId: containerApp.identity.principalId roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '8ebe5a00-799e-43f5-93ac-243d3dce84a7') principalType: 'ServicePrincipal' } }

容器应用部署问题

问题:容器构建失败

症状:

Error: Failed to build container image

解决方案:

  1. 检查 Dockerfile 语法:
# Dockerfile - Python AI app example FROM python:3.11-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ gcc \ && rm -rf /var/lib/apt/lists/* # Copy requirements first for better caching COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
  1. 验证依赖项:
# requirements.txt - Pin versions for stability fastapi==0.104.1 uvicorn==0.24.0 openai==1.3.7 azure-identity==1.14.1 azure-keyvault-secrets==4.7.0 azure-search-documents==11.4.0 azure-cosmos==4.5.1
  1. 添加健康检查:
# main.py - Add health check endpoint from fastapi import FastAPI app = FastAPI() @app.get("/health") async def health_check(): return {"status": "healthy"}

问题:容器应用启动失败

症状:

Error: Container failed to start within timeout period

解决方案:

  1. 增加启动超时时间:
resource containerApp 'Microsoft.App/containerApps@2024-03-01' = { properties: { template: { containers: [ { name: 'main' image: containerImage resources: { cpu: json('0.5') memory: '1Gi' } probes: [ { type: 'startup' httpGet: { path: '/health' port: 8000 } initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 10 // Allow more time for AI models to load } ] } ] } } }
  1. 优化模型加载:
# Lazy load models to reduce startup time import asyncio from contextlib import asynccontextmanager class ModelManager: def __init__(self): self._client = None async def get_client(self): if self._client is None: self._client = await self._initialize_client() return self._client async def _initialize_client(self): # Initialize AI client here pass @asynccontextmanager async def lifespan(app: FastAPI): # Startup app.state.model_manager = ModelManager() yield # Shutdown pass app = FastAPI(lifespan=lifespan)

身份验证和权限错误

问题:托管身份权限被拒绝

症状:

Error: Authentication failed for Azure OpenAI Service

解决方案:

  1. 验证角色分配:
# Check current role assignments az role assignment list \ --assignee YOUR_MANAGED_IDENTITY_ID \ --scope /subscriptions/YOUR_SUBSCRIPTION/resourceGroups/YOUR_RG
  1. 分配所需角色:
// Required role assignments for AI services var cognitiveServicesOpenAIUserRole = subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '5e0bd9bd-7b93-4f28-af87-19fc36ad61bd') var searchIndexDataContributorRole = subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '8ebe5a00-799e-43f5-93ac-243d3dce84a7') resource openAiRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = { scope: openAi name: guid(openAi.id, containerApp.id, cognitiveServicesOpenAIUserRole) properties: { principalId: containerApp.identity.principalId roleDefinitionId: cognitiveServicesOpenAIUserRole principalType: 'ServicePrincipal' } }
  1. 测试身份验证:
# Test managed identity authentication from azure.identity import DefaultAzureCredential from azure.core.exceptions import ClientAuthenticationError async def test_authentication(): try: credential = DefaultAzureCredential() token = await credential.get_token("https://cognitiveservices.azure.com/.default") print(f"Authentication successful: {token.token[:10]}...") except ClientAuthenticationError as e: print(f"Authentication failed: {e}")

问题:Key Vault 访问被拒绝

症状:

Error: The user, group or application does not have secrets get permission

解决方案:

  1. 授予 Key Vault 权限:
resource keyVaultAccessPolicy 'Microsoft.KeyVault/vaults/accessPolicies@2023-07-01' = { parent: keyVault name: 'add' properties: { accessPolicies: [ { tenantId: subscription().tenantId objectId: containerApp.identity.principalId permissions: { secrets: ['get', 'list'] } } ] } }
  1. 使用 RBAC 替代访问策略:
resource keyVaultSecretsUserRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = { scope: keyVault name: guid(keyVault.id, containerApp.id, 'Key Vault Secrets User') properties: { principalId: containerApp.identity.principalId roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '4633458b-17de-408a-b874-0445c86b69e6') principalType: 'ServicePrincipal' } }

模型部署失败

问题:模型版本不可用

症状:

Error: Model version 'gpt-4-32k' is not available

解决方案:

  1. 检查可用模型:
# List available models az cognitiveservices account list-models \ --name YOUR_OPENAI_RESOURCE \ --resource-group YOUR_RG \ --query "[].{name:model.name, version:model.version}" \ --output table
  1. 使用模型回退机制:
// Model deployment with fallback @description('Primary model configuration') param primaryModel object = { name: 'gpt-4o-mini' version: '2024-07-18' } @description('Fallback model configuration') param fallbackModel object = { name: 'gpt-35-turbo' version: '0125' } // Try primary model first, fallback if unavailable resource primaryDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = { parent: openAi name: 'chat-model' properties: { model: primaryModel } sku: { name: 'Standard' capacity: 10 } }
  1. 在部署前验证模型:
# Pre-deployment model validation import httpx async def validate_model_availability(model_name: str, version: str) -> bool: """Check if model is available before deployment.""" try: async with httpx.AsyncClient() as client: response = await client.get( f"{AZURE_OPENAI_ENDPOINT}/openai/models", headers={"api-key": AZURE_OPENAI_API_KEY} ) models = response.json() return any( model["id"] == f"{model_name}-{version}" for model in models.get("data", []) ) except Exception: return False

性能和扩展问题

问题:高延迟响应

症状:

  • 响应时间 > 30 秒
  • 超时错误
  • 用户体验差

解决方案:

  1. 实现请求超时:
# Configure proper timeouts import httpx client = httpx.AsyncClient( timeout=httpx.Timeout( connect=5.0, read=30.0, write=10.0, pool=10.0 ) )
  1. 添加响应缓存:
# Redis cache for responses import redis.asyncio as redis import json class ResponseCache: def __init__(self, redis_url: str): self.redis = redis.from_url(redis_url) async def get_cached_response(self, query_hash: str) -> str | None: """Get cached response if available.""" cached = await self.redis.get(f"ai_response:{query_hash}") return cached.decode() if cached else None async def cache_response(self, query_hash: str, response: str, ttl: int = 3600): """Cache AI response with TTL.""" await self.redis.setex(f"ai_response:{query_hash}", ttl, response)
  1. 配置自动扩展:
resource containerApp 'Microsoft.App/containerApps@2024-03-01' = { properties: { template: { scale: { minReplicas: 2 maxReplicas: 20 rules: [ { name: 'http-requests' http: { metadata: { concurrentRequests: '5' // Scale aggressively for AI workloads } } } { name: 'cpu-utilization' custom: { type: 'cpu' metadata: { type: 'Utilization' value: '60' // Lower threshold for AI apps } } } ] } } } }

问题:内存不足错误

症状:

Error: Container killed due to memory limit exceeded

解决方案:

  1. 增加内存分配:
resource containerApp 'Microsoft.App/containerApps@2024-03-01' = { properties: { template: { containers: [ { name: 'main' resources: { cpu: json('1.0') memory: '2Gi' // Increase for AI workloads } } ] } } }
  1. 优化内存使用:
# Memory-efficient model handling import gc import psutil class MemoryOptimizedAI: def __init__(self): self.max_memory_percent = 80 async def process_request(self, request): """Process request with memory monitoring.""" # Check memory usage before processing memory_percent = psutil.virtual_memory().percent if memory_percent > self.max_memory_percent: gc.collect() # Force garbage collection result = await self._process_ai_request(request) # Clean up after processing gc.collect() return result

成本和配额管理

问题:意外的高成本

症状:

  • Azure 账单高于预期
  • 令牌使用量超出估算
  • 触发预算警报

解决方案:

  1. 实施成本控制:
# Token usage tracking class TokenTracker: def __init__(self, monthly_limit: int = 100000): self.monthly_limit = monthly_limit self.current_usage = 0 async def track_usage(self, prompt_tokens: int, completion_tokens: int): """Track token usage with limits.""" total_tokens = prompt_tokens + completion_tokens self.current_usage += total_tokens if self.current_usage > self.monthly_limit: raise Exception("Monthly token limit exceeded") return total_tokens
  1. 设置成本警报:
resource budgetAlert 'Microsoft.Consumption/budgets@2023-05-01' = { name: 'ai-workload-budget' properties: { timePeriod: { startDate: '2024-01-01' endDate: '2024-12-31' } timeGrain: 'Monthly' amount: 500 // $500 monthly limit category: 'Cost' notifications: { Actual_GreaterThan_80_Percent: { enabled: true operator: 'GreaterThan' threshold: 80 contactEmails: ['admin@company.com'] contactRoles: ['Owner'] } } } }
  1. 优化模型选择:
# Cost-aware model selection MODEL_COSTS = { 'gpt-4o-mini': 0.00015, # per 1K tokens 'gpt-4': 0.03, # per 1K tokens 'gpt-35-turbo': 0.0015 # per 1K tokens } def select_model_by_cost(complexity: str, budget_remaining: float) -> str: """Select model based on complexity and budget.""" if complexity == 'simple' or budget_remaining < 10: return 'gpt-4o-mini' elif complexity == 'medium': return 'gpt-35-turbo' else: return 'gpt-4'

调试工具和技术

AZD 调试命令

# Enable verbose logging azd up --debug # Check deployment status azd show # View deployment logs azd logs --follow # Check environment variables azd env get-values

应用调试

  1. 结构化日志记录:
import logging import json # Configure structured logging for AI applications logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) def log_ai_request(model: str, tokens: int, latency: float, success: bool): """Log AI request details.""" logger.info(json.dumps({ 'event': 'ai_request', 'model': model, 'tokens': tokens, 'latency_ms': latency, 'success': success }))
  1. 健康检查端点:
@app.get("/debug/health") async def detailed_health_check(): """Comprehensive health check for debugging.""" checks = {} # Check OpenAI connectivity try: client = AsyncOpenAI(azure_endpoint=AZURE_OPENAI_ENDPOINT) await client.models.list() checks['openai'] = {'status': 'healthy'} except Exception as e: checks['openai'] = {'status': 'unhealthy', 'error': str(e)} # Check Search service try: search_client = SearchIndexClient( endpoint=AZURE_SEARCH_ENDPOINT, credential=DefaultAzureCredential() ) indexes = await search_client.list_index_names() checks['search'] = {'status': 'healthy', 'indexes': list(indexes)} except Exception as e: checks['search'] = {'status': 'unhealthy', 'error': str(e)} return checks
  1. 性能监控:
import time from functools import wraps def monitor_performance(func): """Decorator to monitor function performance.""" @wraps(func) async def wrapper(*args, **kwargs): start_time = time.time() try: result = await func(*args, **kwargs) success = True except Exception as e: result = None success = False raise finally: end_time = time.time() latency = (end_time - start_time) * 1000 logger.info(json.dumps({ 'function': func.__name__, 'latency_ms': latency, 'success': success })) return result return wrapper

常见错误代码及解决方案

错误代码 描述 解决方案
401 未授权 检查 API 密钥和托管身份配置
403 禁止访问 验证 RBAC 角色分配
429 速率限制 实现带指数退避的重试逻辑
500 内部服务器错误 检查模型部署状态和日志
503 服务不可用 验证服务健康状况和区域可用性

后续步骤

  1. 查看 AI 模型部署指南,了解部署最佳实践
  2. 完成 生产 AI 实践,实现企业级解决方案
  3. 加入 Microsoft Foundry Discord,获取社区支持
  4. 提交问题AZD GitHub 仓库,解决 AZD 相关问题

资源

章节导航:

免责声明
本文档使用AI翻译服务Co-op Translator进行翻译。尽管我们努力确保翻译的准确性,但请注意,自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于重要信息,建议使用专业人工翻译。我们对因使用此翻译而产生的任何误解或误读不承担责任。


发布者: 作者: 转发
评论区 (0)
U