Spaces:

ibm-research
/

cuga-agent

Running

App Files Files Community

cuga-agent / src /system_tests /profiling /README.md

Sami Marreed

feat: docker-v1 with optimized frontend

0646b18 1 day ago

preview code

raw

history blame contribute delete

9.7 kB

CUGA Profiling

This directory contains tools for profiling CUGA digital sales tasks with different configurations and models, extracting performance metrics and LLM call information from Langfuse.

Directory Structure

system_tests/profiling/
├── README.md                    # This file
├── run_experiment.sh            # Main entry point for running experiments
├── serve.sh                     # HTTP server for viewing results
├── bin/                         # Internal scripts
│   ├── profile_digital_sales_tasks.py
│   ├── run_profiling.sh
│   └── run_experiment.sh
├── config/                      # Configuration files
│   ├── default_experiment.yaml  # Default experiment configuration
│   ├── fast_vs_accurate.yaml    # Example: Fast vs Accurate comparison
│   └── .secrets.yaml            # Secrets file (git-ignored)
├── experiments/                 # Experiment results and comparison HTML
│   └── comparison.html
└── reports/                     # Individual profiling reports

Quick Start

1. Set Up Environment Variables

Create a .env file in the project root or export these variables:

export LANGFUSE_PUBLIC_KEY="pk-your-public-key"
export LANGFUSE_SECRET_KEY="sk-your-secret-key"
export LANGFUSE_HOST="https://cloud.langfuse.com"  # Optional

2. Run an Experiment

The simplest way to run experiments is using the configuration files:

# Run default experiment (fast vs balanced)
./system_tests/profiling/run_experiment.sh

# Run a specific experiment configuration
./system_tests/profiling/run_experiment.sh --config fast_vs_accurate.yaml

# Run and automatically open results in browser
./system_tests/profiling/run_experiment.sh --config default_experiment.yaml --open

3. View Results

Results are automatically saved to system_tests/profiling/experiments/ and can be viewed in the HTML dashboard:

# Start the server (serves experiments directory)
./system_tests/profiling/serve.sh

# Or start and open browser automatically
./system_tests/profiling/serve.sh --open

# Use a different port
./system_tests/profiling/serve.sh --port 3000

Configuration Files

Configuration files use YAML format with Dynaconf. They define experiments with multiple runs and comparison settings.

Example Configuration

profiling:
  configs:
    - "settings.openai.toml"
  modes:
    - "fast"
    - "balanced"
  tasks:
    - "test_get_top_account_by_revenue_stream"
  runs: 3

experiment:
  name: "fast_vs_balanced"
  description: "Compare fast and balanced modes"
  
  runs:
    - name: "fast_mode"
      test_id: "settings.openai.toml:fast:test_get_top_account_by_revenue_stream"
      iterations: 3
      output: "experiments/fast_{{timestamp}}.json"
      env:
        MODEL_NAME: "Azure/gpt-4o"  # Set environment variable
    
    - name: "balanced_mode"
      test_id: "settings.openai.toml:balanced:test_get_top_account_by_revenue_stream"
      iterations: 3
      output: "experiments/balanced_{{timestamp}}.json"
      env:
        MODEL_NAME: null  # Unset environment variable
  
  comparison:
    generate_html: true
    html_output: "experiments/comparison.html"
    auto_open: false

Configuration Options

Profiling Section

configs: List of configuration files to test (e.g., settings.openai.toml)
modes: List of CUGA modes (fast, balanced, accurate)
tasks: List of test tasks to run
runs: Number of iterations per configuration
output: Output directory and filename settings
langfuse: Langfuse connection settings (credentials from env vars)

Experiment Section

name: Name of the experiment
description: Description of what's being tested
runs: List of experiment runs to execute
- name: Display name for the run
- test_id: Specific test to run (format: config:mode:task)
- iterations: Number of times to run this test
- output: Output file path (use {{timestamp}} for dynamic naming)
- env: (Optional) Environment variables to set/unset for this run
  - Set a variable: VAR_NAME: "value"
  - Unset a variable: VAR_NAME: null
comparison: Settings for generating comparison HTML

Available Test IDs

Test IDs follow the format: config:mode:task

Configurations:

settings.openai.toml
settings.azure.toml
settings.watsonx.toml

Modes:

fast
balanced
accurate

Tasks:

test_get_top_account_by_revenue_stream
test_list_my_accounts
test_find_vp_sales_active_high_value_accounts

To list all available test IDs:

./system_tests/profiling/bin/run_profiling.sh --list-tests

Advanced Usage

Command Line Interface

You can also use CLI arguments directly:

# Run specific configuration with CLI args
./system_tests/profiling/bin/run_profiling.sh \
  --configs settings.openai.toml \
  --modes fast,balanced \
  --runs 3

# Run a single test by ID
./system_tests/profiling/bin/run_profiling.sh \
  --test-id settings.openai.toml:fast:test_get_top_account_by_revenue_stream \
  --runs 5

# Use config file but override runs
./system_tests/profiling/bin/run_profiling.sh \
  --config-file default_experiment.yaml \
  --runs 5

Direct Python Usage

# Run with config file
cd /path/to/project
uv run python system_tests/profiling/bin/profile_digital_sales_tasks.py \
  --config-file default_experiment.yaml

# Run with CLI arguments
uv run python system_tests/profiling/bin/profile_digital_sales_tasks.py \
  --configs settings.openai.toml \
  --modes fast \
  --tasks test_get_top_account_by_revenue_stream \
  --runs 3 \
  --output system_tests/profiling/reports/my_report.json

Output

Profiling Reports

Individual profiling runs generate JSON reports with:

Summary Statistics: Total tests, success rate, timing
Configuration Stats: Performance per config/mode
Langfuse Metrics: LLM calls, tokens, costs, node timings
Detailed Results: Complete test execution details

Comparison HTML

The comparison HTML (experiments/comparison.html) provides:

Interactive Visualizations:

📊 Execution time comparison charts
💰 Cost analysis across modes
🎯 Token usage visualization
🔄 LLM calls breakdown
📊 Execution time variability (Min/Avg/Max with range and std dev)
⚡ Time breakdown (generation vs processing)
📈 Performance radar chart (normalized comparison)

Detailed Tables:

Summary view of all experiments
Configuration statistics table
Per-run Langfuse metrics
Aggregated metrics across runs

Features:

Tab navigation between charts and tables
Color-coded modes (Fast=green, Balanced=blue, Accurate=orange)
Interactive tooltips on hover
Automatic loading of all JSON files in the directory
Modern, responsive design

Creating Custom Experiments

Create a new YAML file in system_tests/profiling/config/:

cp system_tests/profiling/config/default_experiment.yaml system_tests/profiling/config/my_experiment.yaml

Edit the configuration to match your experiment needs
Run your experiment:

./system_tests/profiling/run_experiment.sh --config my_experiment.yaml

Tips

Use {{timestamp}} in output paths for unique filenames
CLI arguments override config file settings
The HTML comparison automatically picks up new JSON files
Set credentials in .env or config/.secrets.yaml
Use --open flag to automatically open results in browser
Use env in experiment runs to set/unset environment variables per run
Set env.VAR: null to explicitly unset an environment variable

Troubleshooting

Port Conflicts

The scripts automatically clean up processes on ports 8000, 8001, 7860.

Missing Credentials

Ensure LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY are set:

# Check if set
echo $LANGFUSE_PUBLIC_KEY
echo $LANGFUSE_SECRET_KEY

# Set temporarily
export LANGFUSE_PUBLIC_KEY="pk-..."
export LANGFUSE_SECRET_KEY="sk-..."

# Or add to .env file

Configuration Not Found

If a config file isn't found, check:

The file exists in system_tests/profiling/config/
The filename is correct (case-sensitive)
You're running from the correct directory

Examples

Compare Fast vs Balanced (3 runs each)

./system_tests/profiling/run_experiment.sh --config default_experiment.yaml

Compare Providers (OpenAI vs Azure vs WatsonX)

./system_tests/profiling/run_experiment.sh --config providers_comparison.yaml

This compares different LLM providers using the same mode (balanced).

Compare All Modes with OpenAI

Create system_tests/profiling/config/all_modes.yaml:

experiment:
  name: "all_modes_comparison"
  runs:
    - name: "fast"
      test_id: "settings.openai.toml:fast:test_get_top_account_by_revenue_stream"
      iterations: 5
      output: "experiments/fast_{{timestamp}}.json"
    - name: "balanced"
      test_id: "settings.openai.toml:balanced:test_get_top_account_by_revenue_stream"
      iterations: 5
      output: "experiments/balanced_{{timestamp}}.json"
    - name: "accurate"
      test_id: "settings.openai.toml:accurate:test_get_top_account_by_revenue_stream"
      iterations: 5
      output: "experiments/accurate_{{timestamp}}.json"

Then run:

./system_tests/profiling/run_experiment.sh --config all_modes.yaml --open

Full Matrix Comparison (Providers × Modes)

./system_tests/profiling/run_experiment.sh --config full_matrix_comparison.yaml

This creates a comprehensive comparison across multiple providers and modes.