Load Balancing

This feature is available on all Portkey plans.

This ensures high availability and optimal performance of your generative AI apps, preventing any single LLM from becoming a performance bottleneck.

Enable Load Balancing

To enable Load Balancing, you can modify the config object to include a strategy with loadbalance mode. Here’s a quick example to load balance 75-25 between an OpenAI and an Azure OpenAI account

{
  "strategy": {
      "mode": "loadbalance"
  },
  "targets": [
    {
      "provider":"@openai-virtual-key",
      "weight": 0.75
    },
    {
      "provider":"@azure-virtual-key",
      "weight": 0.25
    }
  ]
}

You can create and then use the config in your requests.

How Load Balancing Works

Defining the Loadbalance Targets & their Weights: You provide a list of providers, and assign a weight value to each target. The weights represent the relative share of requests that should be routed to each target.
Weight Normalization: Portkey first sums up all the weights you provided for the targets. It then divides each target’s weight by the total sum to calculate the normalized weight for that target. This ensures the weights add up to 1 (or 100%), allowing Portkey to distribute the load proportionally. For example, let’s say you have three targets with weights 5, 3, and 1. The total sum of weights is 9 (5 + 3 + 1). Portkey will then normalize the weights as follows:
- Target 1: 5 / 9 = 0.55 (55% of the traffic)
- Target 2: 3 / 9 = 0.33 (33% of the traffic)
- Target 3: 1 / 9 = 0.11 (11% of the traffic)
Request Distribution: When a request comes in, Portkey routes it to a target LLM based on the normalized weight probabilities. This ensures the traffic is distributed across the LLMs according to the specified weights.

Defaultweightvalue is1
Minimumweightvalue is0
If weight is not set for a target, the default weight value (i.e. 1) is applied.
You can set "weight":0 for a specific target to stop routing traffic to it without removing it from your Config

Sticky Load Balancing

Sticky load balancing ensures that requests with the same identifier are consistently routed to the same target. This is useful for:

Maintaining conversation context across multiple requests
Ensuring consistent model behavior for A/B testing
Session-based routing for user-specific experiences

Configuration

Add sticky_session to your load balancing strategy:

{
  "strategy": {
    "mode": "loadbalance",
    "sticky_session": {
      "hash_fields": ["metadata.user_id"],
      "ttl": 3600
    }
  },
  "targets": [
    {
      "provider": "@openai-virtual-key",
      "weight": 0.5
    },
    {
      "provider": "@anthropic-virtual-key", 
      "weight": 0.5
    }
  ]
}

Parameters

Parameter	Type	Description
`hash_fields`	array	Fields to use for generating the sticky session identifier. Supports dot notation for nested fields (e.g., `metadata.user_id`, `metadata.session_id`)
`ttl`	number	Time-to-live in seconds for the sticky session. After this period, a new target may be selected. Default: 3600 (1 hour)

How It Works

Identifier Generation: When a request arrives, Portkey generates a hash from the specified hash_fields values
Target Lookup: The hash is used to look up the previously assigned target from cache
Consistent Routing: If a cached assignment exists and hasn’t expired, the request goes to the same target
New Assignment: If no cached assignment exists, a new target is selected based on weights and cached for future requests

Sticky sessions use a two-tier cache system (in-memory + Redis) for fast lookups and persistence across gateway instances in distributed deployments.

Caveats and Considerations

While the Load Balancing feature offers numerous benefits, there are a few things to consider:

Ensure the LLMs in your list are compatible with your use case. Not all LLMs offer the same capabilities or respond in the same format.
Be aware of your usage with each LLM. Depending on your weight distribution, your usage with each LLM could vary significantly.
Keep in mind that each LLM has its own latency and pricing. Diversifying your traffic could have implications on the cost and response time.
Sticky sessions require Redis for persistence across gateway instances. Without Redis, sticky sessions will only work within a single gateway instance’s memory.

Introduction

Product

Self-Hosting

Support

Enable Load Balancing

You can create and then use the config in your requests.

How Load Balancing Works

Sticky Load Balancing

Configuration

Parameters

How It Works

Caveats and Considerations

Introduction

Product

Self-Hosting

Support

​Enable Load Balancing

​You can create and then use the config in your requests.

​How Load Balancing Works

​Sticky Load Balancing

​Configuration

​Parameters

​How It Works

​Caveats and Considerations

Enable Load Balancing

You can create and then use the config in your requests.

How Load Balancing Works

Sticky Load Balancing

Configuration

Parameters

How It Works

Caveats and Considerations