Spaces:
Paused
Paused
| # Summary of Fixes for Accuracy Drop Issues | |
| ## Issues Identified | |
| ### 1. **Accuracy Drops at End** β | |
| **Root Causes:** | |
| 1. **Evaluation uses NEW tasks each iteration** β Variance and inconsistency | |
| - Line 171-175: Generates new tasks on-the-fly for `general_accuracy` | |
| - Different tasks each time = different difficulty/variance | |
| 2. **Forgetting rate too aggressive for 500 iterations** | |
| - Forgetting rate = 0.05 | |
| - After 500 time units: retention = exp(-0.05 * 500) β 0.0 | |
| - **All skills completely forgotten by iteration 500!** | |
| 3. **Evaluation timing**: Evaluation happens after time advance, but we log before - this is actually OK | |
| **Fix:** | |
| - β Use **FIXED eval sets** generated once at start | |
| - β Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting) | |
| - β Evaluation happens BEFORE time advance (accurate snapshot) | |
| ### 2. **Accuracy Calculation Method** | |
| **Current Method:** | |
| - Uses `student.evaluate(eval_tasks)` which samples answers stochastically | |
| - Accounts for forgetting correctly | |
| - BUT: Uses different tasks each time | |
| **Problems:** | |
| - Stochastic variance (random sampling) | |
| - Inconsistent eval sets (regenerated each time) | |
| - Small eval sets (10-15 tasks) = high variance | |
| **Better Method:** | |
| - β **FIXED eval sets** generated once | |
| - β Same tasks used throughout = consistent measurement | |
| - β Larger eval sets (15+ tasks) for stability | |
| **Alternative (for future):** | |
| - Use expected accuracy = mean(prob_correct) instead of sampling | |
| - Removes stochastic variance | |
| ### 3. **Mock vs Real Components** | |
| **Current Mock Components:** | |
| - β Mock Student: Captures learning + forgetting well | |
| - β Mock Task Generator: Simple but functional | |
| - β Simplified learning model | |
| - β Limited task diversity | |
| **Real Components (MentorFlow):** | |
| - Real Student: Full PPO with neural network | |
| - Real Task Generator: Procedural generation, 15 families | |
| **Will Real Components Be Better?** **YES:** | |
| 1. **Real PPO Student:** | |
| - Can learn complex patterns | |
| - Better generalization | |
| - More realistic learning curves | |
| - But: Slower to train | |
| 2. **Real Task Generator:** | |
| - More diverse tasks | |
| - Procedural generation = infinite variety | |
| - Better tests generalization | |
| 3. **Teacher Agent Algorithm:** | |
| - UCB algorithm will work the same | |
| - Should perform even better with real components | |
| - More realistic reward signals | |
| **Expected Improvement:** | |
| - Teacher should learn better curriculum | |
| - Student should achieve higher accuracy | |
| - More realistic forgetting patterns (if implemented) | |
| ## Applied Fixes | |
| β **Fixed evaluation to use FIXED eval sets** | |
| β **Reduced forgetting rate from 0.05 β 0.01** | |
| β **Evaluation happens BEFORE time advance** | |
| β **All strategies use consistent eval sets** | |
| ## Remaining Considerations | |
| 1. **Forgetting Model**: Could use more sophisticated model (spaced repetition optimization) | |
| 2. **Evaluation Method**: Could use expected accuracy instead of sampling | |
| 3. **Eval Set Size**: Could increase for more stability (currently 15 tasks, could be 50-100) | |
| 4. **Time Reset**: Could periodically reset time to prevent complete forgetting in long training | |