MentorFlow / teacher_agent_dev /FIXES_SUMMARY.md
Cornelius
Deploy MentorFlow with GPU support
a52f96d
# Summary of Fixes for Accuracy Drop Issues
## Issues Identified
### 1. **Accuracy Drops at End** ❌
**Root Causes:**
1. **Evaluation uses NEW tasks each iteration** β†’ Variance and inconsistency
- Line 171-175: Generates new tasks on-the-fly for `general_accuracy`
- Different tasks each time = different difficulty/variance
2. **Forgetting rate too aggressive for 500 iterations**
- Forgetting rate = 0.05
- After 500 time units: retention = exp(-0.05 * 500) β‰ˆ 0.0
- **All skills completely forgotten by iteration 500!**
3. **Evaluation timing**: Evaluation happens after time advance, but we log before - this is actually OK
**Fix:**
- βœ… Use **FIXED eval sets** generated once at start
- βœ… Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
- βœ… Evaluation happens BEFORE time advance (accurate snapshot)
### 2. **Accuracy Calculation Method**
**Current Method:**
- Uses `student.evaluate(eval_tasks)` which samples answers stochastically
- Accounts for forgetting correctly
- BUT: Uses different tasks each time
**Problems:**
- Stochastic variance (random sampling)
- Inconsistent eval sets (regenerated each time)
- Small eval sets (10-15 tasks) = high variance
**Better Method:**
- βœ… **FIXED eval sets** generated once
- βœ… Same tasks used throughout = consistent measurement
- βœ… Larger eval sets (15+ tasks) for stability
**Alternative (for future):**
- Use expected accuracy = mean(prob_correct) instead of sampling
- Removes stochastic variance
### 3. **Mock vs Real Components**
**Current Mock Components:**
- βœ… Mock Student: Captures learning + forgetting well
- βœ… Mock Task Generator: Simple but functional
- ❌ Simplified learning model
- ❌ Limited task diversity
**Real Components (MentorFlow):**
- Real Student: Full PPO with neural network
- Real Task Generator: Procedural generation, 15 families
**Will Real Components Be Better?** **YES:**
1. **Real PPO Student:**
- Can learn complex patterns
- Better generalization
- More realistic learning curves
- But: Slower to train
2. **Real Task Generator:**
- More diverse tasks
- Procedural generation = infinite variety
- Better tests generalization
3. **Teacher Agent Algorithm:**
- UCB algorithm will work the same
- Should perform even better with real components
- More realistic reward signals
**Expected Improvement:**
- Teacher should learn better curriculum
- Student should achieve higher accuracy
- More realistic forgetting patterns (if implemented)
## Applied Fixes
βœ… **Fixed evaluation to use FIXED eval sets**
βœ… **Reduced forgetting rate from 0.05 β†’ 0.01**
βœ… **Evaluation happens BEFORE time advance**
βœ… **All strategies use consistent eval sets**
## Remaining Considerations
1. **Forgetting Model**: Could use more sophisticated model (spaced repetition optimization)
2. **Evaluation Method**: Could use expected accuracy instead of sampling
3. **Eval Set Size**: Could increase for more stability (currently 15 tasks, could be 50-100)
4. **Time Reset**: Could periodically reset time to prevent complete forgetting in long training