FutureAGI is an AI lifecycle platform that provides automated evaluation, tracing, and quality assessment for LLM applications. When combined with Portkey, you get a complete end-to-end observability solution covering both operational performance and response quality. 
Portkey handles the “what happened, how fast, and how much did it cost?” while FutureAGI answers “how good was the response?” 
Why FutureAGI + Portkey? The integration creates a powerful synergy: 
Portkey  acts as the operational layer - unifying API calls, managing keys, and monitoring metrics like latency, cost, and request volumeFutureAGI  acts as the quality layer - capturing full request context and running automated evaluations to score model outputs 
Getting Started Prerequisites Before integrating FutureAGI with Portkey, ensure you have: 
Python 3.8+ installed 
API Keys:
 
 
Installation pip  install  portkey-ai  fi-instrumentation  traceai-portkey Setting up Environment Variables Create a .env file in your project root: 
# .env PORTKEY_API_KEY = "your-portkey-api-key" FI_API_KEY = "your-futureagi-api-key" FI_SECRET_KEY = "your-futureagi-secret-key" Integration Guide Step 1: Basic Setup Import the necessary libraries and configure your environment: 
import  asyncio import  json import  time from  portkey_ai  import  Portkey from  traceai_portkey  import  PortkeyInstrumentor from  fi_instrumentation  import  register from  fi_instrumentation.fi_types  import  (     ProjectType, EvalTag, EvalTagType,     EvalSpanKind, EvalName, ModelChoices ) from  dotenv  import  load_dotenv load_dotenv() Set up comprehensive evaluation tags to automatically assess model responses: 
def  setup_tracing ( project_version_name :  str ):     """Setup tracing with comprehensive evaluation tags"""     tracer_provider  =  register(         project_name = "Model-Benchmarking" ,         project_type = ProjectType. EXPERIMENT ,         project_version_name = project_version_name,         eval_tags = [             # Evaluates if the response is concise             EvalTag(                 type = EvalTagType. OBSERVATION_SPAN ,                 value = EvalSpanKind. LLM ,                 eval_name = EvalName. IS_CONCISE ,                 custom_eval_name = "Is_Concise" ,                 mapping = { "input" :  "llm.output_messages.0.message.content" },                 model = ModelChoices. TURING_LARGE             ),             # Evaluates context adherence             EvalTag(                 type = EvalTagType. OBSERVATION_SPAN ,                 value = EvalSpanKind. LLM ,                 eval_name = EvalName. CONTEXT_ADHERENCE ,                 custom_eval_name = "Response_Quality" ,                 mapping = {                     "context" :  "llm.input_messages.0.message.content" ,                     "output" :  "llm.output_messages.0.message.content" ,                 },                 model = ModelChoices. TURING_LARGE             ),             # Evaluates task completion             EvalTag(                 type = EvalTagType. OBSERVATION_SPAN ,                 value = EvalSpanKind. LLM ,                 eval_name = EvalName. TASK_COMPLETION ,                 custom_eval_name = "Task_Completion" ,                 mapping = {                     "input" :  "llm.input_messages.0.message.content" ,                     "output" :  "llm.output_messages.0.message.content" ,                 },                 model = ModelChoices. TURING_LARGE             ),         ]     )     # Instrument the Portkey library     PortkeyInstrumentor().instrument( tracer_provider = tracer_provider)     return  tracer_provider The mapping parameter in EvalTag tells the evaluator where to find the necessary data within the trace. This is crucial for accurate evaluation. 
Step 3: Define Models and Test Scenarios Configure the models you want to test and create test scenarios: 
def  get_models ():     """Setup model configurations with their Portkey Virtual Keys"""     return  [         {             "name" :  "GPT-4o" ,             "provider" :  "OpenAI" ,             "virtual_key" :  "openai-virtual-key" ,             "model_id" :  "gpt-4o"         },         {             "name" :  "Claude-3.7-Sonnet" ,             "provider" :  "Anthropic" ,             "virtual_key" :  "anthropic-virtual-key" ,             "model_id" :  "claude-3-7-sonnet-latest"         },         {             "name" :  "Llama-3-70b" ,             "provider" :  "Groq" ,             "virtual_key" :  "groq-virtual-key" ,             "model_id" :  "llama3-70b-8192"         },     ] def  get_test_scenarios ():     """Returns a dictionary of test scenarios"""     return  {         "reasoning_logic" :  "A farmer has 17 sheep. All but 9 die. How many are left?" ,         "creative_writing" :  "Write a 6-word story about a robot who discovers music." ,         "code_generation" :  "Write a Python function to find the nth Fibonacci number." ,     } Step 4: Execute Tests with Automatic Evaluation Run tests on each model while capturing both operational metrics and quality evaluations: 
async  def  test_model ( model_config ,  prompt ):     """Tests a single model with a single prompt and returns the response"""     tracer_provider  =  setup_tracing(model_config[ "name" ])     print ( f "Testing  { model_config[ 'name' ] } ..." )     client  =  Portkey( virtual_key = model_config[ 'virtual_key' ])     start_time  =  time.time()     completion  =  await  client.chat.completions.create(         messages = [{ "role" :  "user" ,  "content" : prompt}],         model = model_config[ 'model_id' ],         max_tokens = 1024 ,         temperature = 0.5     )     response_time  =  time.time()  -  start_time     response_text  =  completion.choices[ 0 ].message.content  or  ""     return  response_text async  def  main ():     """Main execution function to run all tests"""     models_to_test  =  get_models()     scenarios  =  get_test_scenarios()     for  test_name, prompt  in  scenarios.items():         print ( f " \n { '=' * 20 }  SCENARIO:  { test_name.upper() }  { '=' * 20 } " )         print ( f "PROMPT:  { prompt } " )         print ( "-"  *  60 )         for  model  in  models_to_test:             await  test_model(model, prompt)         await  asyncio.sleep( 1 )   # Brief pause between scenarios         PortkeyInstrumentor().uninstrument() if  __name__  ==  "__main__" :     asyncio.run(main()) Viewing Results After running your tests, you’ll have two powerful dashboards to analyze performance: 
FutureAGI Dashboard - Quality View Navigate to the Prototype Tab  in your FutureAGI Dashboard to find your “Model-Benchmarking” project. 
Key features: 
Automated evaluation scores for each model response 
Detailed trace analysis with quality metrics 
Comparison views across different models 
 
Portkey Dashboard - Operational View Access your Portkey dashboard to see operational metrics for all API calls: 
Key metrics: 
Unified Logs : Single view of all requests across providersCost Tracking : Automatic cost calculation for every callLatency Monitoring : Response time comparisons across modelsToken Usage : Detailed token consumption analytics 
Advanced Use Cases Complex Agentic Workflows The integration supports tracing complex workflows where you chain multiple LLM calls: 
# Example: E-commerce assistant with multiple LLM calls async  def  ecommerce_assistant_workflow ( user_query ):     # Step 1: Intent classification     intent  =  await  classify_intent(user_query)     # Step 2: Product search     products  =  await  search_products(intent)     # Step 3: Generate response     response  =  await  generate_response(products, user_query)     # All steps are automatically traced and evaluated     return  response CI/CD Integration Leverage this integration in your CI/CD pipelines for: 
Automated Model Testing : Run evaluation suites on new model versionsQuality Gates : Set thresholds for evaluation scores before deploymentPerformance Monitoring : Track degradation in model quality over timeCost Optimization : Monitor and alert on cost spikes 
Benefits 
Comprehensive Observability Track both operational metrics (cost, latency) and quality metrics (accuracy, relevance) in one place 
Automated Evaluation No manual evaluation needed - FutureAGI automatically scores responses on multiple dimensions 
Multi-Model Comparison Easily compare different models side-by-side on the same tasks 
Production Ready Built-in alerting and monitoring for your production LLM applications 
Example Notebooks 
Interactive Colab Notebook Try out the FutureAGI + Portkey integration with our interactive notebook 
Next Steps 
Create your FutureAGI account Set up Virtual Keys in Portkey Run the example code to see automated evaluation in action 
Customize evaluation tags for your specific use cases 
Integrate into your CI/CD pipeline for continuous model quality monitoring