Spaces:

ibm-research
/

cuga-agent

Running

App Files Files Community

cuga-agent / src /cuga /evaluation /README.md

Sami Marreed

feat: docker-v1 with optimized frontend

0646b18 4 days ago

preview code

raw

history blame contribute delete

5.88 kB


	# CUGA Evaluation

	An evaluation framework for CUGA, enabling you to test your APIs against structured test cases with detailed scoring and reporting.

	---

	## Features
	- ✅ Validate API responses against expected outputs
	- ✅ Score keywords, tool calls, and response similarity
	- ✅ Generate JSON and CSV reports for easy analysis

	---

	## Test File Schema

	Your test file must be a JSON following this structure:

	```json
	{
	"name": "name for the test suite",
	"title": "TestCases",
	"type": "object",
	"properties": {
	"test_cases": {
	"type": "array",
	"items": {
	"$ref": "#/definitions/TestCase"
	}
	}
	},
	"required": ["test_cases"],
	"definitions": {
	"ToolCall": {
	"type": "object",
	"properties": {
	"name": { "type": "string" },
	"args": { "type": "object" }
	},
	"required": ["name", "arguments"]
	},
	"ExpectedOutput": {
	"type": "object",
	"properties": {
	"response": { "type": "string" },
	"keywords": {
	"type": "array",
	"items": { "type": "string" }
	},
	"tool_calls": {
	"type": "array",
	"items": { "$ref": "#/definitions/ToolCall" }
	}
	},
	"required": ["response", "keywords", "tool_calls"]
	},
	"TestCase": {
	"type": "object",
	"properties": {
	"name": { "type": "string" },
	"description": { "type": "string" },
	"intent": { "type": "string" },
	"expected_output": { "$ref": "#/definitions/ExpectedOutput" }
	},
	"required": ["name", "description", "intent", "expected_output"]
	}
	}
	}
	```

	---

	### Schema Overview
	\| Entity \| Description \|
	\|------------------\|----------------------------------------------\|
	\| ToolCall \| Represents a tool invocation with `name` and `args`. \|
	\| ExpectedOutput \| Expected response, keywords, and tool calls. \|
	\| TestCase \| Defines a single test case with intent and expected output. \|

	---

	## Output Format

	The evaluation generates two files:
	- `results.json`
	- `results.csv`

	### JSON Structure
	```json
	{
	"summary": {
	"total_tests": "...",
	"avg_keyword_score": "...",
	"avg_tool_call_score": "...",
	"avg_response_score": "..."
	},
	"results": [
	{
	"index": "...",
	"test_name": "...",
	"score": {
	"keyword_score": "...",
	"tool_call_score": "...",
	"response_score": "...",
	"response_scoring_type": "..."
	},
	"details": {
	"missing_keywords": "...",
	"expected_keywords": "...",
	"expected_tool_calls": "...",
	"tool_call_mismatches": "...",
	"response_expected": "...",
	"response_actual": "...",
	"response_scoring_type": "..."
	}
	}
	]
	}
	```

	## Langfuse Tracing (Optional)
	### Setup Langfuse
	In a different folder (not under Cuga) run
	```bash
	# Get a copy of the latest Langfuse repository
	git clone https://github.com/langfuse/langfuse.git
	cd langfuse

	# Run the langfuse docker compose
	docker compose up
	```

	### Get API Keys

	1. Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports).
	2. Log in: Sign in with the user account you created during the initial setup or create a new account.
	3. Navigate to Project Settings:
	Click on the "Project" menu (usually in the sidebar or top navigation).
	Select "Settings".
	4. View API Keys:
	In the settings area, you will find a section for API keys.
	You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there.
	The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it.
	5. Add the API keys and host to your .env file
	```.dotenv
	LANGFUSE_SECRET_KEY="your-secret-key"
	LANGFUSE_PUBLIC_KEY="your-public-key"
	LANGFUSE_HOST="http://localhost:3000"
	```

	### Update settings
	Then in `vendor/cuga-agent/src/cuga/settings.toml` update
	```
	langfuse_tracing = true
	```



	---

	## Quick Start Example

	Run the evaluation on our default `digital_sales` API using our example test case.

	This is the example input JSON:
	```json
	{
	"name": "digital-sales",
	"test_cases": [
	{
	"name": "test_get_top_account",
	"description": "gets the top account by revenue",
	"intent": "get my top account by revenue",
	"expected_output": {
	"response": "Top Account by Revenue - Name: Andromeda Inc. - Revenue: $9,700,000 - Account ID: acc_49",
	"keywords": ["Andromeda Inc.", "9,700,000"],
	"tool_calls": [
	{
	"name": "digital_sales_get_my_accounts_my_accounts_get",
	"args": {
	}
	}
	]
	}
	}
	]
	}

	```

	First set `tracker_enabled = true` in the `settings.toml`

	Now you can start running the example.

	1. Update API URL in [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml):
	```yaml
	url: http://localhost:8000/openapi.json
	```
	2. Start the API server:
	```bash
	uv run digital_sales_openapi
	```
	3. Run evaluation:
	```bash
	cuga evaluate docs/examples/evaluation/input_example.json
	```

	You’ll get `results.json` and `results.csv` in the project root.

	---

	## Usage
	```bash
	cuga evaluate -t <test file path> -r <results file path>
	```

	Steps:
	1. Update [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml) with your APIs or create a new YAML file and run
	```shell
	export MCP_SERVERS_FILE=<location>
	```
	2. Create a test file following the schema.
	3. Run the evaluation command above.

	---