Spaces:
Paused
Paused
A newer version of the Gradio SDK is available:
6.1.0
Teacher Agent Development System
A complete teacher agent system for developing and testing meta-RL curriculum learning algorithms independently.
Overview
This system provides:
- Mock Student Agent: Realistic student with learning + forgetting (Ebbinghaus curve)
- Mock Task Generator: Simple task generator with multiple topics and difficulties
- Teacher Agent: UCB (Upper Confidence Bound) bandit algorithm for curriculum sequencing
- Training Loop: Complete training system with evaluation
- Visualization: Plotting utilities for analysis
Installation
pip install -r requirements.txt
Quick Start
1. Run Tests
python test_teacher.py
This verifies:
- Student learns with practice
- Student forgets over time
- Teacher explores actions
- Teacher exploits good actions
2. Train Teacher Agent
python train_teacher.py
Expected output: ```
TEACHER AGENT TRAINING
Iterations: 500 Evaluation tasks: 15 Action space: 30 actions
Iteration 0 | Student Acc: 0.267 | Avg Reward: 0.850 | Action: his-ea-N Iteration 50 | Student Acc: 0.453 | Avg Reward: 1.120 | Action: sci-me-R ... Iteration 500 | Student Acc: 0.812 | Avg Reward: 0.780 | Action: lit-ha-N
### 3. Generate Visualizations
```python
from train_teacher import train_teacher
from visualize import *
# Train teacher
history, teacher, student = train_teacher(num_iterations=500)
# Generate plots
plot_learning_curves(history)
plot_curriculum_heatmap(history)
plot_action_distributions(teacher)
4. Compare with Baselines
from train_teacher import train_teacher, train_baseline_random, train_baseline_fixed
from visualize import plot_comparison
# Train all strategies
history_teacher, _, _ = train_teacher(num_iterations=500, verbose=False)
history_random = train_baseline_random(num_iterations=500)
history_fixed = train_baseline_fixed(num_iterations=500)
# Compare
plot_comparison({
'teacher': history_teacher,
'random': history_random,
'fixed': history_fixed
})
Architecture
Components
- interfaces.py: Shared data structures (Task, StudentState, TeacherAction) and ABC interfaces
- mock_student.py: Student agent with learning (improves with practice) and forgetting (Ebbinghaus curve)
- mock_task_generator.py: Simple task generator with 5 topics Γ 3 difficulties
- teacher_agent.py: UCB bandit algorithm for selecting curriculum actions
- train_teacher.py: Main training loop connecting all components
- test_teacher.py: Unit tests for all components
- visualize.py: Plotting utilities for analysis
Action Space
Teacher selects from 30 actions:
- 5 topics: history, science, literature, geography, current_events
- 3 difficulties: easy, medium, hard
- 2 options: new material or review
Student Model
- Learning: Skill improves with practice:
new_skill = old_skill + learning_rate * difficulty_factor * (1 - old_skill) - Forgetting: Retention decays over time:
retention = exp(-forgetting_rate * time_since_practice) - Effective Skill:
effective_skill = base_skill * retention - Accuracy:
accuracy = 0.25 + 0.75 * effective_skill(25% is random guessing on 4-choice MCQ)
Teacher Algorithm
UCB (Upper Confidence Bound):
UCB(a) = estimated_reward(a) + exploration_bonus Γ sqrt(log(total_pulls) / pulls(a))
- Balances exploration (trying new actions) vs exploitation (using known-good actions)
- Exploration bonus controls adventurousness (higher = more exploration)
Reward Function
reward = improvement + difficulty_bonus + review_bonus + review_penalty
where:
- improvement = accuracy_after - accuracy_before
- difficulty_bonus = easy:0.5, medium:1.0, hard:2.0
- review_bonus = 1.0 if review and improvement > 0
- review_penalty = -0.5 if review and accuracy > 0.9 (wasted review)
Expected Behavior
Early Iterations (0-100)
- Teacher explores all topics/difficulties
- Tries mostly easy tasks (build foundation)
- High exploration, low exploitation
Mid Iterations (100-300)
- Starts increasing difficulty
- Discovers which topics student struggles with
- Begins strategic reviewing
Late Iterations (300-500)
- Mostly medium/hard tasks (student is skilled)
- Reviews topics just before forgetting threshold
- High exploitation of known-good curriculum
Emergent Behaviors
- Teacher gives harder tasks as student improves
- Teacher reviews topics ~30-50 iterations after practice (optimal timing)
- Teacher specializes in topics student finds difficult
Success Criteria
After training, you should see:
- β Student reaches >70% accuracy by iteration 500
- β Teacher discovers: easy tasks first β harder tasks later
- β Teacher learns to review before forgetting
- β Teacher reward stabilizes (not just random)
File Structure
teacher_agent_dev/
βββ interfaces.py # Shared data structures and ABC interfaces
βββ mock_student.py # Mock student with learning + forgetting
βββ mock_task_generator.py # Simple task generator
βββ teacher_agent.py # MAIN: UCB bandit teacher algorithm
βββ train_teacher.py # Training loop
βββ test_teacher.py # Unit tests
βββ visualize.py # Plotting utilities
βββ requirements.txt # Dependencies
βββ README.md # This file
Customization
Adjust Student Learning
student = MockStudentAgent(
learning_rate=0.15, # How fast student learns (higher = faster)
forgetting_rate=0.05 # How fast student forgets (higher = faster)
)
Adjust Teacher Exploration
teacher = TeacherAgent(
exploration_bonus=2.0 # Higher = more exploration, Lower = more exploitation
)
Add More Topics/Difficulties
Edit mock_task_generator.py to add more templates or modify teacher_agent.py to adjust action space.
Troubleshooting
Issue: Student doesn't learn
- Solution: Increase
learning_ratein MockStudentAgent
Issue: Teacher doesn't explore
- Solution: Increase
exploration_bonusin TeacherAgent
Issue: Forgetting too fast/slow
- Solution: Adjust
forgetting_ratein MockStudentAgent
Issue: Division by zero errors
- Solution: UCB handles cold start automatically (untried actions selected first)
Next Steps
- Replace mock components: When teammates finish real student/task generator, swap out mock components
- Tune hyperparameters: Adjust learning_rate, forgetting_rate, exploration_bonus
- Experiment with algorithms: Try different bandit algorithms (Thompson Sampling, Ξ΅-greedy)
- Add features: More sophisticated reward functions, state representations, etc.
License
MIT