Spaces:
Running
Running
| # **CUGA Evaluation** | |
| An evaluation framework for **CUGA**, enabling you to **test your APIs** against structured test cases with detailed scoring and reporting. | |
| --- | |
| ## **Features** | |
| - ✅ Validate **API responses** against expected outputs | |
| - ✅ Score **keywords**, **tool calls**, and **response similarity** | |
| - ✅ Generate **JSON** and **CSV** reports for easy analysis | |
| --- | |
| ## **Test File Schema** | |
| Your test file must be a **JSON** following this structure: | |
| ```json | |
| { | |
| "name": "name for the test suite", | |
| "title": "TestCases", | |
| "type": "object", | |
| "properties": { | |
| "test_cases": { | |
| "type": "array", | |
| "items": { | |
| "$ref": "#/definitions/TestCase" | |
| } | |
| } | |
| }, | |
| "required": ["test_cases"], | |
| "definitions": { | |
| "ToolCall": { | |
| "type": "object", | |
| "properties": { | |
| "name": { "type": "string" }, | |
| "args": { "type": "object" } | |
| }, | |
| "required": ["name", "arguments"] | |
| }, | |
| "ExpectedOutput": { | |
| "type": "object", | |
| "properties": { | |
| "response": { "type": "string" }, | |
| "keywords": { | |
| "type": "array", | |
| "items": { "type": "string" } | |
| }, | |
| "tool_calls": { | |
| "type": "array", | |
| "items": { "$ref": "#/definitions/ToolCall" } | |
| } | |
| }, | |
| "required": ["response", "keywords", "tool_calls"] | |
| }, | |
| "TestCase": { | |
| "type": "object", | |
| "properties": { | |
| "name": { "type": "string" }, | |
| "description": { "type": "string" }, | |
| "intent": { "type": "string" }, | |
| "expected_output": { "$ref": "#/definitions/ExpectedOutput" } | |
| }, | |
| "required": ["name", "description", "intent", "expected_output"] | |
| } | |
| } | |
| } | |
| ``` | |
| --- | |
| ### **Schema Overview** | |
| | Entity | Description | | |
| |------------------|----------------------------------------------| | |
| | **ToolCall** | Represents a tool invocation with `name` and `args`. | | |
| | **ExpectedOutput** | Expected response, keywords, and tool calls. | | |
| | **TestCase** | Defines a single test case with intent and expected output. | | |
| --- | |
| ## **Output Format** | |
| The evaluation generates **two files**: | |
| - `results.json` | |
| - `results.csv` | |
| ### **JSON Structure** | |
| ```json | |
| { | |
| "summary": { | |
| "total_tests": "...", | |
| "avg_keyword_score": "...", | |
| "avg_tool_call_score": "...", | |
| "avg_response_score": "..." | |
| }, | |
| "results": [ | |
| { | |
| "index": "...", | |
| "test_name": "...", | |
| "score": { | |
| "keyword_score": "...", | |
| "tool_call_score": "...", | |
| "response_score": "...", | |
| "response_scoring_type": "..." | |
| }, | |
| "details": { | |
| "missing_keywords": "...", | |
| "expected_keywords": "...", | |
| "expected_tool_calls": "...", | |
| "tool_call_mismatches": "...", | |
| "response_expected": "...", | |
| "response_actual": "...", | |
| "response_scoring_type": "..." | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| ## **Langfuse Tracing (Optional)** | |
| ### Setup Langfuse | |
| In a different folder (not under Cuga) run | |
| ```bash | |
| # Get a copy of the latest Langfuse repository | |
| git clone https://github.com/langfuse/langfuse.git | |
| cd langfuse | |
| # Run the langfuse docker compose | |
| docker compose up | |
| ``` | |
| ### Get API Keys | |
| 1. Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports). | |
| 2. Log in: Sign in with the user account you created during the initial setup or create a new account. | |
| 3. Navigate to Project Settings: | |
| Click on the "Project" menu (usually in the sidebar or top navigation). | |
| Select "Settings". | |
| 4. View API Keys: | |
| In the settings area, you will find a section for API keys. | |
| You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there. | |
| The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it. | |
| 5. Add the API keys and host to your .env file | |
| ```.dotenv | |
| LANGFUSE_SECRET_KEY="your-secret-key" | |
| LANGFUSE_PUBLIC_KEY="your-public-key" | |
| LANGFUSE_HOST="http://localhost:3000" | |
| ``` | |
| ### Update settings | |
| Then in `vendor/cuga-agent/src/cuga/settings.toml` update | |
| ``` | |
| langfuse_tracing = true | |
| ``` | |
| --- | |
| ## **Quick Start Example** | |
| Run the evaluation on our default `digital_sales` API using our example test case. | |
| This is the example input JSON: | |
| ```json | |
| { | |
| "name": "digital-sales", | |
| "test_cases": [ | |
| { | |
| "name": "test_get_top_account", | |
| "description": "gets the top account by revenue", | |
| "intent": "get my top account by revenue", | |
| "expected_output": { | |
| "response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49", | |
| "keywords": ["Andromeda Inc.", "9,700,000"], | |
| "tool_calls": [ | |
| { | |
| "name": "digital_sales_get_my_accounts_my_accounts_get", | |
| "args": { | |
| } | |
| } | |
| ] | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| First set `tracker_enabled = true` in the `settings.toml` | |
| Now you can start running the example. | |
| 1. **Update API URL** in [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml): | |
| ```yaml | |
| url: http://localhost:8000/openapi.json | |
| ``` | |
| 2. **Start the API server**: | |
| ```bash | |
| uv run digital_sales_openapi | |
| ``` | |
| 3. **Run evaluation**: | |
| ```bash | |
| cuga evaluate docs/examples/evaluation/input_example.json | |
| ``` | |
| You’ll get `results.json` and `results.csv` in the project root. | |
| --- | |
| ## **Usage** | |
| ```bash | |
| cuga evaluate -t <test file path> -r <results file path> | |
| ``` | |
| Steps: | |
| 1. Update [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml) with your APIs or create a new YAML file and run | |
| ```shell | |
| export MCP_SERVERS_FILE=<location> | |
| ``` | |
| 2. Create a test file following the schema. | |
| 3. Run the evaluation command above. | |
| --- | |