Spaces:
Paused
Paused
A newer version of the Gradio SDK is available:
6.1.0
Teacher Agent System - Summary
β System Status: WORKING AND LEARNING
Files Overview
All files in teacher_agent_dev/ are relevant and necessary:
- interfaces.py - Core data structures (Task, StudentState, TeacherAction) and ABC interfaces
- mock_student.py - Student agent with learning + forgetting
- mock_task_generator.py - Task generator (5 topics Γ 3 difficulties)
- teacher_agent.py - β MAIN: UCB bandit RL algorithm
- train_teacher.py - Training loop with baselines
- test_teacher.py - Unit tests (all passing)
- visualize.py - Plotting utilities
- verify_teacher_learning.py - RL verification script
- requirements.txt - Dependencies
- README.md - Documentation
- RL_VERIFICATION.md - RL proof document
β Teacher Agent is Using RL
Algorithm: Upper Confidence Bound (UCB) Multi-Armed Bandit
How it learns:
- Selects action using UCB:
UCB(a) = estimated_reward(a) + exploration_bonus Γ sqrt(log(total_pulls) / pulls(a)) - Receives reward based on student improvement
- Updates policy: Running average reward for each action
- Next selection uses updated estimates (exploits good actions)
Verification Results (from verify_teacher_learning.py):
- β Rewards improve: 1.682 β 2.115 (+0.433)
- β Explores all 30 actions
- β
Exploits high-reward actions (prefers
current_events-hard-R) - β Student improves: 0.527 β 0.862 accuracy
Key Features
Teacher Agent:
- Uses UCB bandit (classic RL algorithm)
- 30 actions: 5 topics Γ 3 difficulties Γ 2 options
- Learns from rewards (policy updates)
- Balances exploration/exploitation
Student Agent:
- Learns with practice (learning_rate)
- Forgets over time (Ebbinghaus curve)
- Per-topic skill tracking
Reward Function:
- Base: student improvement
- Bonus: harder tasks (+2.0), successful reviews (+1.0)
- Penalty: wasted reviews (-0.5)
Note on Student State
The teacher currently uses a non-contextual bandit (doesn't use student_state parameter). This is still valid RL (UCB for multi-armed bandit), but could be enhanced to be contextual by using student state in decisions.
Quick Start
cd teacher_agent_dev
# Run tests
python test_teacher.py
# Train teacher
python train_teacher.py
# Verify learning
python verify_teacher_learning.py
All Checks Passed β
- β Teacher learns and improves (rewards increase)
- β Teacher explores actions
- β Teacher exploits good actions
- β Student improves significantly
- β All tests pass
- β System is self-contained and functional