Overview
The Portkey Enterprise Gateway exposes detailed telemetry data through Prometheus metrics, enabling comprehensive observability for LLM gateway operations. These metrics cover the entire request lifecycle from authentication through response delivery, including cost tracking, performance monitoring, and cache analytics. This monitoring capability is essential for:- Performance Optimization: Identify bottlenecks and optimize gateway performance
- Cost Management: Track and analyze LLM usage costs in real-time
- Capacity Planning: Understanding traffic patterns and scaling requirements
- SLA Monitoring: Ensure service level agreements are met
- Security Monitoring: Track authentication and authorization performance
Metrics Endpoint
Endpoint:/metrics
Method: GET
Content-Type:
text/plain; version=0.0.4; charset=utf-8
Authentication: Typically open (check your deployment configuration)
Ensure your Prometheus server can access the
/metrics
endpoint. In production deployments, consider securing this endpoint or restricting access to monitoring infrastructure only.Global Configuration
Default Labels
All custom metrics are automatically labeled with:app
: Service identifier (fromSERVICE_NAME
environment variable)env
: Deployment environment (fromNODE_ENV
environment variable)
Standard Node.js Runtime Metrics
The gateway automatically exposes Node.js runtime metrics with thenode_
prefix using the prom-client
default collectors:
Process Metrics
- CPU Usage:
node_process_cpu_user_seconds_total
,node_process_cpu_system_seconds_total
- Memory:
node_process_resident_memory_bytes
,node_process_heap_bytes
- File Descriptors:
node_process_open_fds
,node_process_max_fds
- Process Uptime:
node_process_start_time_seconds
Event Loop Metrics
- Event Loop Lag:
node_eventloop_lag_seconds
(critical for detecting Node.js performance issues) - Event Loop Utilization: Tracks how busy the event loop is
Garbage Collection Metrics
- GC Duration:
node_gc_duration_seconds
with custom buckets optimized for LLM gateway workloads - Custom GC Buckets:
[0.001, 0.01, 0.1, 1, 1.5, 2, 3, 5, 7, 10, 15, 20, 30, 45, 60, 90, 120, 240, 500, 1000, 6000]
(seconds)
Custom Application Metrics
The Portkey Enterprise Gateway exposes 14 custom metrics designed to provide deep visibility into LLM gateway operations, performance characteristics, and business metrics.Universal Label Schema
All custom metrics share a common labeling schema enabling multi-dimensional analysis:method
: HTTP verb (GET, POST, PUT, DELETE, etc.)endpoint
: Normalized API endpoint path (e.g.,/v1/chat/completions
,/v1/completions
)code
: HTTP response status code (200, 400, 500, etc.)provider
: LLM provider identifier (openai, anthropic, azure-openai, etc.)model
: Specific model name (gpt-4, claude-3, etc.)source
: Request origination source or client identifierstream
: Boolean indicator for streaming responses (“true”/“false”)cacheStatus
: Cache interaction result (“hit”, “miss”, “disabled”, “error”)metadata_*
: Dynamic labels from request metadata (see configuration section)
1. Gateway Request Counter
Metric Name:request_count
Type: Counter
Unit: Requests Technical Description: Monotonic counter tracking every HTTP request processed by the gateway. This is the primary metric for understanding traffic volume, request patterns, and success/failure rates across different dimensions. Use Cases:
- Traffic Analysis: Monitor request volume trends and identify peak usage periods
- Error Rate Monitoring: Calculate error rates by dividing 4xx/5xx responses by total requests
- Provider Distribution: Understand which LLM providers are most heavily utilized
- Model Popularity: Track adoption of different AI models across your organization
- Cache Effectiveness: Monitor cache hit rates to optimize performance and costs
2. HTTP Request Duration Distribution
Metric Name:http_request_duration_seconds
Type: Histogram
Unit: Seconds Technical Description: Measures the complete HTTP request-response cycle duration from the gateway’s perspective. This includes all processing time: authentication, middleware execution, provider communication, response processing, and network transmission back to the client. Bucket Configuration:
[0.1, 1, 1.5, 2, 3, 5, 7, 10, 15, 20, 30, 45, 60, 90, 120, 240, 500, 1000, 3000]
- Optimized for typical LLM response times ranging from sub-second to several minutes
- Enables percentile analysis for SLA monitoring
- SLA Monitoring: Track P95/P99 response times against service level agreements
- Performance Regression Detection: Identify when response times degrade
- Endpoint Performance Analysis: Compare performance across different API endpoints
- Client-Side Latency Tracking: Full end-to-end timing from client perspective
3. LLM Provider Request Duration
Metric Name:llm_request_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Measures the duration of actual requests sent to LLM providers, excluding gateway processing overhead. This metric isolates provider performance from internal gateway operations, enabling precise provider SLA monitoring and performance comparison. Bucket Configuration:
[0.1, 1, 2, 5, 10, 30, 50, 75, 100, 150, 200, 350, 500, 1000, 2500, 5000, 10000, 50000, 100000, 300000, 500000, 10000000]
- High-resolution buckets for millisecond-precision analysis
- Extended range to handle very long-running requests (up to ~2.7 hours)
- Provider Performance Comparison: Benchmark response times across different LLM providers
- Model Performance Analysis: Compare latency characteristics of different models
- Provider SLA Monitoring: Track provider performance against contractual agreements
- Capacity Planning: Understand provider response time distributions for scaling decisions
4. Portkey Processing Time (Excluding Streaming Latency)
Metric Name:portkey_processing_time_excluding_last_byte_ms
Type: Histogram
Unit: Milliseconds Technical Description: Measures Portkey’s internal processing time excluding the time spent waiting for the final byte of streamed responses from LLM providers. This metric isolates the gateway’s computational overhead from provider streaming characteristics, enabling precise performance optimization of internal operations. Key Insights:
- Excludes network latency and provider-side streaming delays
- Includes authentication, request transformation, middleware execution, and initial response processing
- Critical for identifying gateway performance bottlenecks vs. provider latency issues
- Gateway Performance Optimization: Identify internal processing bottlenecks
- Middleware Performance Analysis: Measure impact of authentication and transformation logic
- Scaling Decisions: Understand processing capacity independent of provider performance
- Performance Baseline Establishment: Set internal SLAs for gateway operations
5. LLM Last Byte Latency Analysis
Metric Name:llm_last_byte_diff_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Captures the time difference between receiving the first response data and the final byte from LLM providers. This metric is essential for understanding streaming performance characteristics and time-to-first-token vs. total completion time patterns across different providers and models. Streaming Analysis Value:
- Time-to-First-Token: Indirectly measurable by comparing with total request duration
- Streaming Efficiency: Identifies providers with consistent vs. bursty streaming patterns
- Model Behavior Analysis: Different models exhibit different streaming characteristics
- User Experience Optimization: Understand perceived responsiveness for streaming applications
- Provider Streaming Comparison: Compare streaming performance across providers
- Model Selection: Choose models based on streaming vs. batch completion preferences
- Client Application Optimization: Inform client-side timeout and buffering strategies
6. Total Portkey Request Duration
Metric Name:portkey_request_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Comprehensive timing metric measuring the complete duration of Portkey request processing from initial request receipt to final response transmission. This provides the most complete view of gateway performance and serves as the authoritative metric for end-to-end processing analysis. Scope Includes:
- Authentication and authorization processing
- Request validation and transformation
- Provider selection and routing logic
- LLM provider communication (complete)
- Response processing and transformation
- Cache operations (read/write)
- Post-processing hooks and analytics
- Comprehensive Performance Monitoring: Single metric for overall gateway health
- Capacity Planning: Understand total processing requirements for scaling
- Performance Baseline: Primary metric for SLA establishment and monitoring
- Troubleshooting: First metric to check during performance investigations
7. LLM Cost Accumulator
Metric Name:llm_cost_sum
Type: Gauge
Unit: Currency Units (USD) Technical Description: Real-time accumulator tracking the total monetary cost of LLM API usage across all providers and models. This gauge provides immediate visibility into cost burn rates and enables fine-grained cost analysis across multiple dimensions including users, applications, models, and providers. Cost Calculation Features:
- Multi-Provider Support: Normalized cost tracking across different provider pricing models
- Token-Based Accuracy: Precise cost calculation based on actual token consumption
- Real-Time Updates: Immediate cost visibility for budget monitoring
- Dimensional Analysis: Cost breakdown by any label dimension
- Budget Monitoring: Real-time tracking against spending limits
- Cost Attribution: Identify highest-cost users, applications, or use cases
- ROI Analysis: Measure cost efficiency across different models and providers
- Chargeback/Showback: Accurate cost allocation for internal billing
8. Authentication Performance Analysis
Metric Name:authentication_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Measures the complete authentication and authorization pipeline duration, including API key validation, usage limit verification, and permission checks. This metric is critical for identifying authentication bottlenecks that could impact overall gateway performance and user experience. Authentication Pipeline Components:
- API Key Validation: Cryptographic verification and database lookups
- Usage Limit Checks: Real-time quota verification against configured limits
- Permission Validation: Role-based access control (RBAC) enforcement
- Workspace/Organization Context: Multi-tenant authorization processing
- Database Performance: Identifies slow authentication database queries
- Caching Effectiveness: Measures benefit of authentication result caching
- Security vs. Performance: Balances security thoroughness with response time requirements
- Security Performance Monitoring: Ensure security checks don’t degrade user experience
- Database Optimization: Identify authentication-related database performance issues
- Caching Strategy: Optimize authentication result caching for better performance
- Bottleneck Identification: Pinpoint authentication components causing delays
9. Rate Limiting Performance Metrics
Metric Name:api_key_rate_limit_check_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Tracks the performance of hierarchical rate limiting checks across organization, workspace, and user levels. This metric monitors the computational overhead of the multi-level rate limiting system and identifies performance impacts of complex rate limiting policies. Rate Limiting Hierarchy:
- Organization Level: Global rate limits across entire organization
- Workspace Level: Team or project-specific rate limits
- User Level: Individual user rate limits and quotas
- API Key Level: Specific API key usage limits
- Redis Performance: Measures Redis-based rate limiting performance
- Multi-Level Complexity: Tracks overhead of hierarchical limit checking
- Atomic Operations: Performance of distributed rate limiting algorithms
- Rate Limiting Optimization: Optimize rate limiting algorithms for better performance
- Redis Performance Monitoring: Track distributed rate limiting infrastructure performance
- Policy Impact Analysis: Measure performance impact of complex rate limiting policies
- Scaling Capacity Planning: Understand rate limiting overhead for capacity planning
10. Pre-Request Middleware Pipeline Performance
Metric Name:pre_request_processing_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Comprehensive timing of the pre-request processing pipeline that prepares requests for LLM provider execution. This metric captures the overhead of all preparatory operations required before sending requests to upstream providers. Pipeline Components:
- Request Context Creation: Building execution context and metadata
- Prompt Template Processing: Variable substitution and template rendering
- Guardrails Retrieval: Fetching and applying content safety policies
- Provider Configuration: Loading provider-specific settings and authentication
- Cache Key Generation: Computing cache identifiers for request deduplication
- Request Transformation: Converting requests to provider-specific formats
- Template Caching: Optimize prompt template compilation and caching
- Guardrails Performance: Minimize overhead of content safety checks
- Configuration Caching: Reduce database queries for provider settings
- Middleware Performance Optimization: Identify and optimize slow preprocessing steps
- Request Preparation Monitoring: Track overhead of request enhancement features
- Feature Impact Analysis: Measure performance cost of advanced features
- Pipeline Efficiency: Optimize the request processing pipeline for better throughput
11. Post-Request Processing Pipeline Performance
Metric Name:post_request_processing_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Measures the duration of post-request processing operations that occur after receiving responses from LLM providers but before returning results to clients. This metric captures the overhead of response enhancement, logging, analytics, and cleanup operations. Post-Processing Components:
- Response Transformation: Converting provider responses to standardized formats
- Analytics Data Collection: Gathering metrics and usage statistics
- Audit Logging: Recording detailed request/response logs for compliance
- Cache Writing: Storing responses for future cache hits
- Webhook Execution: Triggering configured post-request webhooks
- Cost Calculation: Computing and recording usage costs
- Analytics Overhead: Measures cost of detailed usage analytics
- Compliance Impact: Tracks overhead of audit logging requirements
- Cache Performance: Monitors cache write operation performance
- Response Processing Optimization: Minimize post-request processing overhead
- Analytics Performance: Optimize data collection and storage operations
- Cache Write Performance: Monitor and optimize cache storage operations
- Compliance Monitoring: Track overhead of regulatory compliance features
12. Cache System Performance Analysis
Metric Name:llm_cache_processing_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Dedicated metric for measuring cache system performance including cache key computation, cache lookups, cache hits/misses, and cache storage operations. This metric is essential for optimizing cache configuration and understanding cache system impact on overall performance. Cache Operation Types:
- Cache Key Generation: Computing deterministic cache identifiers
- Cache Lookup Operations: Reading from distributed cache storage
- Cache Hit Processing: Deserializing and returning cached responses
- Cache Miss Handling: Managing cache misses and preparing for cache writes
- Cache Storage Operations: Writing new responses to cache storage
- Hit vs. Miss Performance: Compare cache hit vs. miss processing times
- Storage Backend Performance: Monitor Redis, Memcached, or other cache backend performance
- Serialization Overhead: Track cost of response serialization/deserialization
- Network Latency: Measure distributed cache network performance
- Cache Strategy Tuning: Optimize cache TTL and eviction policies
- Serialization Optimization: Improve response serialization performance
- Cache Backend Scaling: Plan cache infrastructure scaling based on performance data
- Cache Hit Rate Optimization: Improve cache key generation for better hit rates
13. gRPC Request Conversion Performance
Metric Name:grpc_req_conversion_duration_milliseconds
Type: Histogram
Unit: Milliseconds Technical Description: Measures the performance of converting incoming gRPC requests to HTTP format before forwarding to the internal HTTP handler. This metric is critical for understanding the overhead introduced by the gRPC-to-HTTP adapter layer and identifying potential bottlenecks in the gRPC gateway implementation. Bucket Configuration:
[0.01, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
- Optimized for conversion operations typically ranging from sub-millisecond to several seconds
- Higher resolution at lower latencies to detect micro-optimizations
- gRPC Metadata Extraction: Converting gRPC metadata to HTTP headers
- Request Body Processing: Transforming gRPC request body to HTTP format
- Protocol Translation: Converting gRPC semantics to HTTP semantics
- Header Validation: Filtering and validating headers for HTTP compatibility
- Adapter Efficiency: Identify inefficiencies in the gRPC-to-HTTP conversion logic
- Serialization Performance: Monitor overhead of request format transformation
- Memory Usage: Track memory allocation patterns during conversion
- Protocol Overhead: Measure the cost of protocol translation
- gRPC Gateway Optimization: Optimize the conversion layer for better throughput
- Service Migration: Compare gRPC vs. HTTP performance during service transitions
- Protocol Selection: Inform decisions about when to use gRPC vs. HTTP
- Performance Regression Detection: Alert on conversion performance degradation
Configuration
Dynamic Metadata Label System
The Portkey Enterprise Gateway supports dynamic metadata labeling through request-specific metadata injection. This powerful feature enables fine-grained observability across custom dimensions specific to your organization’s structure and use cases. Environment Variable:PROMETHEUS_LABELS_METADATA_ALLOWED_KEYS
Configuration Format: Comma-separated list of metadata keys
- Metadata is passed via the
x-portkey-metadata
HTTP header as a JSON string - The JSON string is parsed and keys are validated against the allowlist for security
- Invalid or missing metadata values are handled gracefully (empty object returned)
- Labels are automatically prefixed with
metadata_
to avoid naming conflicts
Security Considerations:
- Cardinality Control: Limit allowed keys to prevent metric explosion
- Sensitive Data Protection: Avoid including PII or sensitive information in labels
- Performance Impact: Each additional label increases metric storage requirements
Comprehensive Monitoring Examples
Traffic Analysis & Capacity Planning
Request Volume Monitoring:Performance Monitoring & SLA Tracking
Latency Percentile Analysis:Cache Performance Optimization
Cache Effectiveness Analysis:Cost Management & Business Intelligence
Real-Time Cost Monitoring:Security & Authentication Performance
Authentication Performance Monitoring:Advanced Performance Analysis
Processing Pipeline Breakdown:gRPC Gateway Performance Analysis
gRPC Conversion Efficiency Monitoring:Alerting Rules Examples
Critical Performance Alerts
Cost Management Alerts
Security Considerations
When deploying metrics collection in production:
- Endpoint Security: Consider securing the
/metrics
endpoint with authentication or network restrictions - Data Sensitivity: Avoid including sensitive information in metadata labels
- Cardinality Management: Limit metadata keys to prevent metric explosion and storage issues
- Network Security: Ensure secure communication between Prometheus and the gateway
- Access Control: Implement appropriate access controls for monitoring dashboards
Troubleshooting
Common Issues
High Cardinality Metrics:- Symptom: Prometheus storage growth, query performance degradation
- Solution: Review
PROMETHEUS_LABELS_METADATA_ALLOWED_KEYS
configuration and limit high-cardinality labels
- Check that the
/metrics
endpoint is accessible - Verify Prometheus scraping configuration
- Review gateway logs for metric collection errors
- Monitor the overhead of metrics collection on gateway performance
- Consider adjusting scrape intervals for high-volume deployments
Related Documentation
- Analytics Dashboard - SaaS monitoring and analytics
- Private Cloud Architecture - Deployment architecture overview
- Observability - General observability features