Spaces:

iteratehack
/

MentorFlow

Paused

App Files Files Community

MentorFlow / teacher_agent_dev /SUMMARY.md

Cornelius

Deploy MentorFlow with GPU support

a52f96d 13 days ago

preview code

raw

history blame contribute delete

2.62 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Teacher Agent System - Summary

✅ System Status: WORKING AND LEARNING

Files Overview

All files in teacher_agent_dev/ are relevant and necessary:

interfaces.py - Core data structures (Task, StudentState, TeacherAction) and ABC interfaces
mock_student.py - Student agent with learning + forgetting
mock_task_generator.py - Task generator (5 topics × 3 difficulties)
teacher_agent.py - ⭐ MAIN: UCB bandit RL algorithm
train_teacher.py - Training loop with baselines
test_teacher.py - Unit tests (all passing)
visualize.py - Plotting utilities
verify_teacher_learning.py - RL verification script
requirements.txt - Dependencies
README.md - Documentation
RL_VERIFICATION.md - RL proof document

✅ Teacher Agent is Using RL

Algorithm: Upper Confidence Bound (UCB) Multi-Armed Bandit

How it learns:

Selects action using UCB: UCB(a) = estimated_reward(a) + exploration_bonus × sqrt(log(total_pulls) / pulls(a))
Receives reward based on student improvement
Updates policy: Running average reward for each action
Next selection uses updated estimates (exploits good actions)

Verification Results (from verify_teacher_learning.py):

✅ Rewards improve: 1.682 → 2.115 (+0.433)
✅ Explores all 30 actions
✅ Exploits high-reward actions (prefers current_events-hard-R)
✅ Student improves: 0.527 → 0.862 accuracy

Key Features

Teacher Agent:

Uses UCB bandit (classic RL algorithm)
30 actions: 5 topics × 3 difficulties × 2 options
Learns from rewards (policy updates)
Balances exploration/exploitation

Student Agent:

Learns with practice (learning_rate)
Forgets over time (Ebbinghaus curve)
Per-topic skill tracking

Reward Function:

Base: student improvement
Bonus: harder tasks (+2.0), successful reviews (+1.0)
Penalty: wasted reviews (-0.5)

Note on Student State

The teacher currently uses a non-contextual bandit (doesn't use student_state parameter). This is still valid RL (UCB for multi-armed bandit), but could be enhanced to be contextual by using student state in decisions.

Quick Start

cd teacher_agent_dev

# Run tests
python test_teacher.py

# Train teacher
python train_teacher.py

# Verify learning
python verify_teacher_learning.py

All Checks Passed ✅

✅ Teacher learns and improves (rewards increase)
✅ Teacher explores actions
✅ Teacher exploits good actions
✅ Student improves significantly
✅ All tests pass
✅ System is self-contained and functional