Spaces:
Paused
Paused
A newer version of the Gradio SDK is available:
6.1.0
Summary of Fixes for Accuracy Drop Issues
Issues Identified
1. Accuracy Drops at End β
Root Causes:
Evaluation uses NEW tasks each iteration β Variance and inconsistency
- Line 171-175: Generates new tasks on-the-fly for
general_accuracy - Different tasks each time = different difficulty/variance
- Line 171-175: Generates new tasks on-the-fly for
Forgetting rate too aggressive for 500 iterations
- Forgetting rate = 0.05
- After 500 time units: retention = exp(-0.05 * 500) β 0.0
- All skills completely forgotten by iteration 500!
Evaluation timing: Evaluation happens after time advance, but we log before - this is actually OK
Fix:
- β Use FIXED eval sets generated once at start
- β Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
- β Evaluation happens BEFORE time advance (accurate snapshot)
2. Accuracy Calculation Method
Current Method:
- Uses
student.evaluate(eval_tasks)which samples answers stochastically - Accounts for forgetting correctly
- BUT: Uses different tasks each time
Problems:
- Stochastic variance (random sampling)
- Inconsistent eval sets (regenerated each time)
- Small eval sets (10-15 tasks) = high variance
Better Method:
- β FIXED eval sets generated once
- β Same tasks used throughout = consistent measurement
- β Larger eval sets (15+ tasks) for stability
Alternative (for future):
- Use expected accuracy = mean(prob_correct) instead of sampling
- Removes stochastic variance
3. Mock vs Real Components
Current Mock Components:
- β Mock Student: Captures learning + forgetting well
- β Mock Task Generator: Simple but functional
- β Simplified learning model
- β Limited task diversity
Real Components (MentorFlow):
- Real Student: Full PPO with neural network
- Real Task Generator: Procedural generation, 15 families
Will Real Components Be Better? YES:
Real PPO Student:
- Can learn complex patterns
- Better generalization
- More realistic learning curves
- But: Slower to train
Real Task Generator:
- More diverse tasks
- Procedural generation = infinite variety
- Better tests generalization
Teacher Agent Algorithm:
- UCB algorithm will work the same
- Should perform even better with real components
- More realistic reward signals
Expected Improvement:
- Teacher should learn better curriculum
- Student should achieve higher accuracy
- More realistic forgetting patterns (if implemented)
Applied Fixes
β Fixed evaluation to use FIXED eval sets β Reduced forgetting rate from 0.05 β 0.01 β Evaluation happens BEFORE time advance β All strategies use consistent eval sets
Remaining Considerations
- Forgetting Model: Could use more sophisticated model (spaced repetition optimization)
- Evaluation Method: Could use expected accuracy instead of sampling
- Eval Set Size: Could increase for more stability (currently 15 tasks, could be 50-100)
- Time Reset: Could periodically reset time to prevent complete forgetting in long training