Sami Marreed
feat: docker-v1 with optimized frontend
0646b18
# **CUGA Evaluation**
An evaluation framework for **CUGA**, enabling you to **test your APIs** against structured test cases with detailed scoring and reporting.
---
## **Features**
- ✅ Validate **API responses** against expected outputs
- ✅ Score **keywords**, **tool calls**, and **response similarity**
- ✅ Generate **JSON** and **CSV** reports for easy analysis
---
## **Test File Schema**
Your test file must be a **JSON** following this structure:
```json
{
"name": "name for the test suite",
"title": "TestCases",
"type": "object",
"properties": {
"test_cases": {
"type": "array",
"items": {
"$ref": "#/definitions/TestCase"
}
}
},
"required": ["test_cases"],
"definitions": {
"ToolCall": {
"type": "object",
"properties": {
"name": { "type": "string" },
"args": { "type": "object" }
},
"required": ["name", "arguments"]
},
"ExpectedOutput": {
"type": "object",
"properties": {
"response": { "type": "string" },
"keywords": {
"type": "array",
"items": { "type": "string" }
},
"tool_calls": {
"type": "array",
"items": { "$ref": "#/definitions/ToolCall" }
}
},
"required": ["response", "keywords", "tool_calls"]
},
"TestCase": {
"type": "object",
"properties": {
"name": { "type": "string" },
"description": { "type": "string" },
"intent": { "type": "string" },
"expected_output": { "$ref": "#/definitions/ExpectedOutput" }
},
"required": ["name", "description", "intent", "expected_output"]
}
}
}
```
---
### **Schema Overview**
| Entity | Description |
|------------------|----------------------------------------------|
| **ToolCall** | Represents a tool invocation with `name` and `args`. |
| **ExpectedOutput** | Expected response, keywords, and tool calls. |
| **TestCase** | Defines a single test case with intent and expected output. |
---
## **Output Format**
The evaluation generates **two files**:
- `results.json`
- `results.csv`
### **JSON Structure**
```json
{
"summary": {
"total_tests": "...",
"avg_keyword_score": "...",
"avg_tool_call_score": "...",
"avg_response_score": "..."
},
"results": [
{
"index": "...",
"test_name": "...",
"score": {
"keyword_score": "...",
"tool_call_score": "...",
"response_score": "...",
"response_scoring_type": "..."
},
"details": {
"missing_keywords": "...",
"expected_keywords": "...",
"expected_tool_calls": "...",
"tool_call_mismatches": "...",
"response_expected": "...",
"response_actual": "...",
"response_scoring_type": "..."
}
}
]
}
```
## **Langfuse Tracing (Optional)**
### Setup Langfuse
In a different folder (not under Cuga) run
```bash
# Get a copy of the latest Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse
# Run the langfuse docker compose
docker compose up
```
### Get API Keys
1. Access the Langfuse UI: Open a web browser and navigate to the URL where your self-hosted Langfuse instance is running (e.g., http://localhost:3000 if running locally with default ports).
2. Log in: Sign in with the user account you created during the initial setup or create a new account.
3. Navigate to Project Settings:
Click on the "Project" menu (usually in the sidebar or top navigation).
Select "Settings".
4. View API Keys:
In the settings area, you will find a section for API keys.
You can view or regenerate your LANGFUSE_PUBLIC_KEY (username) and LANGFUSE_SECRET_KEY (password) there.
The secret key is hidden by default; you may need to click an eye icon or a specific button to reveal and copy it.
5. Add the API keys and host to your .env file
```.dotenv
LANGFUSE_SECRET_KEY="your-secret-key"
LANGFUSE_PUBLIC_KEY="your-public-key"
LANGFUSE_HOST="http://localhost:3000"
```
### Update settings
Then in `vendor/cuga-agent/src/cuga/settings.toml` update
```
langfuse_tracing = true
```
---
## **Quick Start Example**
Run the evaluation on our default `digital_sales` API using our example test case.
This is the example input JSON:
```json
{
"name": "digital-sales",
"test_cases": [
{
"name": "test_get_top_account",
"description": "gets the top account by revenue",
"intent": "get my top account by revenue",
"expected_output": {
"response": "**Top Account by Revenue** - **Name:** Andromeda Inc. - **Revenue:** $9,700,000 - **Account ID:** acc_49",
"keywords": ["Andromeda Inc.", "9,700,000"],
"tool_calls": [
{
"name": "digital_sales_get_my_accounts_my_accounts_get",
"args": {
}
}
]
}
}
]
}
```
First set `tracker_enabled = true` in the `settings.toml`
Now you can start running the example.
1. **Update API URL** in [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml):
```yaml
url: http://localhost:8000/openapi.json
```
2. **Start the API server**:
```bash
uv run digital_sales_openapi
```
3. **Run evaluation**:
```bash
cuga evaluate docs/examples/evaluation/input_example.json
```
You’ll get `results.json` and `results.csv` in the project root.
---
## **Usage**
```bash
cuga evaluate -t <test file path> -r <results file path>
```
Steps:
1. Update [mcp_servers.yaml](src/cuga/backend/tools_env/registry/config/mcp_servers.yaml) with your APIs or create a new YAML file and run
```shell
export MCP_SERVERS_FILE=<location>
```
2. Create a test file following the schema.
3. Run the evaluation command above.
---