Skip to main content
This feature is available on all Portkey plans.
This ensures high availability and optimal performance of your generative AI apps, preventing any single LLM from becoming a performance bottleneck.

Enable Load Balancing

To enable Load Balancing, you can modify the config object to include a strategy with loadbalance mode. Here’s a quick example to load balance 75-25 between an OpenAI and an Azure OpenAI account
{
  "strategy": {
      "mode": "loadbalance"
  },
  "targets": [
    {
      "provider":"@openai-virtual-key",
      "weight": 0.75
    },
    {
      "provider":"@azure-virtual-key",
      "weight": 0.25
    }
  ]
}

You can create and then use the config in your requests.

How Load Balancing Works

  1. Defining the Loadbalance Targets & their Weights: You provide a list of providers, and assign a weight value to each target. The weights represent the relative share of requests that should be routed to each target.
  2. Weight Normalization: Portkey first sums up all the weights you provided for the targets. It then divides each target’s weight by the total sum to calculate the normalized weight for that target. This ensures the weights add up to 1 (or 100%), allowing Portkey to distribute the load proportionally. For example, let’s say you have three targets with weights 5, 3, and 1. The total sum of weights is 9 (5 + 3 + 1). Portkey will then normalize the weights as follows:
    • Target 1: 5 / 9 = 0.55 (55% of the traffic)
    • Target 2: 3 / 9 = 0.33 (33% of the traffic)
    • Target 3: 1 / 9 = 0.11 (11% of the traffic)
  3. Request Distribution: When a request comes in, Portkey routes it to a target LLM based on the normalized weight probabilities. This ensures the traffic is distributed across the LLMs according to the specified weights.
  • Defaultweightvalue is1
  • Minimumweightvalue is0
  • If weight is not set for a target, the default weight value (i.e. 1) is applied.
  • You can set "weight":0 for a specific target to stop routing traffic to it without removing it from your Config

Sticky Load Balancing

Sticky load balancing ensures that requests with the same identifier are consistently routed to the same target. This is useful for:
  • Maintaining conversation context across multiple requests
  • Ensuring consistent model behavior for A/B testing
  • Session-based routing for user-specific experiences

Configuration

Add sticky_session to your load balancing strategy:
{
  "strategy": {
    "mode": "loadbalance",
    "sticky_session": {
      "hash_fields": ["metadata.user_id"],
      "ttl": 3600
    }
  },
  "targets": [
    {
      "provider": "@openai-virtual-key",
      "weight": 0.5
    },
    {
      "provider": "@anthropic-virtual-key", 
      "weight": 0.5
    }
  ]
}

Parameters

ParameterTypeDescription
hash_fieldsarrayFields to use for generating the sticky session identifier. Supports dot notation for nested fields (e.g., metadata.user_id, metadata.session_id)
ttlnumberTime-to-live in seconds for the sticky session. After this period, a new target may be selected. Default: 3600 (1 hour)

How It Works

  1. Identifier Generation: When a request arrives, Portkey generates a hash from the specified hash_fields values
  2. Target Lookup: The hash is used to look up the previously assigned target from cache
  3. Consistent Routing: If a cached assignment exists and hasn’t expired, the request goes to the same target
  4. New Assignment: If no cached assignment exists, a new target is selected based on weights and cached for future requests
Sticky sessions use a two-tier cache system (in-memory + Redis) for fast lookups and persistence across gateway instances in distributed deployments.

Caveats and Considerations

While the Load Balancing feature offers numerous benefits, there are a few things to consider:
  1. Ensure the LLMs in your list are compatible with your use case. Not all LLMs offer the same capabilities or respond in the same format.
  2. Be aware of your usage with each LLM. Depending on your weight distribution, your usage with each LLM could vary significantly.
  3. Keep in mind that each LLM has its own latency and pricing. Diversifying your traffic could have implications on the cost and response time.
  4. Sticky sessions require Redis for persistence across gateway instances. Without Redis, sticky sessions will only work within a single gateway instance’s memory.