Spaces:
Paused
Paused
A newer version of the Gradio SDK is available:
6.1.0
Strategy Comparison: Teacher vs Baselines
Overview
This module compares three training strategies for the student agent:
- Random Strategy: Student receives random questions from task generator until they can confidently pass difficult questions
- Progressive Strategy: Student receives questions in progressive difficulty order (Easy β Medium β Hard) within each family sequentially
- Teacher Strategy: RL teacher agent learns optimal curriculum using UCB bandit algorithm
Goal
Demonstrate that the Teacher-trained student performs best - achieving highest accuracy on difficult questions.
Running the Comparison
cd teacher_agent_dev
python compare_strategies.py
This will:
- Train all three strategies for 500 iterations
- Track accuracy on general questions and difficult questions
- Generate comparison plots showing all three strategies
- Print summary statistics
Output
Plot: comparison_all_strategies.png
The plot contains three subplots:
- General Accuracy Over Time: Shows how student accuracy improves on medium-difficulty questions
- Difficult Question Accuracy: KEY METRIC - Shows accuracy on hard questions (most important for demonstrating teacher superiority)
- Learning Efficiency: Bar chart showing iterations to reach 75% target vs final performance
Key Metrics Tracked
- General Accuracy: Student performance on medium-difficulty questions from all topics
- Difficult Accuracy: Student performance on hard-difficulty questions (target metric)
- Iterations to Target: How many iterations until student reaches 75% accuracy on difficult questions
- Final Accuracy: Final performance after 500 iterations
Expected Results
The Teacher strategy should show:
- β Highest final accuracy on difficult questions
- β Efficient learning (good balance of speed and performance)
- β Better curriculum (smarter topic/difficulty selection)
Example Output
STRATEGY COMPARISON SUMMARY
======================================================================
Random | β
Reached | Iterations: 51 | Final Acc: 0.760
Progressive | β
Reached | Iterations: 310 | Final Acc: 0.520
Teacher | β
Reached | Iterations: 55 | Final Acc: 0.880
======================================================================
Teacher wins with highest final accuracy!
Strategy Details
Random Strategy
- Completely random selection of topics and difficulties
- No curriculum structure
- Baseline for comparison
- May reach target quickly due to luck, but doesn't optimize learning
Progressive Strategy
- Rigid curriculum: Easy β Medium β Hard for each topic sequentially
- No adaptation to student needs
- Slow to reach difficult questions
- Doesn't account for forgetting or optimal pacing
Teacher Strategy
- RL-based curriculum learning
- Uses UCB bandit to balance exploration/exploitation
- Adapts based on student improvement (reward signal)
- Optimizes for efficient learning
- Can strategically review topics to prevent forgetting
Visualization Features
- Color coding: Teacher in green (highlighted as best), Random in red, Progressive in teal
- Line styles: Teacher with solid thick line, baselines with dashed/dotted
- Annotations: Final accuracy values labeled on plots
- Target line: 75% accuracy threshold marked on difficult question plot
- Summary statistics: Table showing which strategies reached target and when
Customization
You can modify parameters in compare_strategies.py:
num_iterations = 500 # Number of training iterations
target_accuracy = 0.75 # Target accuracy on difficult questions
seed = 42 # Random seed for reproducibility
Files
compare_strategies.py- Main comparison scriptcomparison_all_strategies.png- Generated comparison plottrain_teacher.py- Teacher training logicmock_student.py- Student agent implementationmock_task_generator.py- Task generator
Notes
- All strategies use the same student parameters for fair comparison
- Evaluation uses held-out test sets
- Teacher strategy learns from rewards based on student improvement
- Results may vary slightly due to randomness, but teacher should consistently outperform baselines