Spaces:
Paused
Paused
File size: 3,126 Bytes
a52f96d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# Summary of Fixes for Accuracy Drop Issues
## Issues Identified
### 1. **Accuracy Drops at End** β
**Root Causes:**
1. **Evaluation uses NEW tasks each iteration** β Variance and inconsistency
- Line 171-175: Generates new tasks on-the-fly for `general_accuracy`
- Different tasks each time = different difficulty/variance
2. **Forgetting rate too aggressive for 500 iterations**
- Forgetting rate = 0.05
- After 500 time units: retention = exp(-0.05 * 500) β 0.0
- **All skills completely forgotten by iteration 500!**
3. **Evaluation timing**: Evaluation happens after time advance, but we log before - this is actually OK
**Fix:**
- β
Use **FIXED eval sets** generated once at start
- β
Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
- β
Evaluation happens BEFORE time advance (accurate snapshot)
### 2. **Accuracy Calculation Method**
**Current Method:**
- Uses `student.evaluate(eval_tasks)` which samples answers stochastically
- Accounts for forgetting correctly
- BUT: Uses different tasks each time
**Problems:**
- Stochastic variance (random sampling)
- Inconsistent eval sets (regenerated each time)
- Small eval sets (10-15 tasks) = high variance
**Better Method:**
- β
**FIXED eval sets** generated once
- β
Same tasks used throughout = consistent measurement
- β
Larger eval sets (15+ tasks) for stability
**Alternative (for future):**
- Use expected accuracy = mean(prob_correct) instead of sampling
- Removes stochastic variance
### 3. **Mock vs Real Components**
**Current Mock Components:**
- β
Mock Student: Captures learning + forgetting well
- β
Mock Task Generator: Simple but functional
- β Simplified learning model
- β Limited task diversity
**Real Components (MentorFlow):**
- Real Student: Full PPO with neural network
- Real Task Generator: Procedural generation, 15 families
**Will Real Components Be Better?** **YES:**
1. **Real PPO Student:**
- Can learn complex patterns
- Better generalization
- More realistic learning curves
- But: Slower to train
2. **Real Task Generator:**
- More diverse tasks
- Procedural generation = infinite variety
- Better tests generalization
3. **Teacher Agent Algorithm:**
- UCB algorithm will work the same
- Should perform even better with real components
- More realistic reward signals
**Expected Improvement:**
- Teacher should learn better curriculum
- Student should achieve higher accuracy
- More realistic forgetting patterns (if implemented)
## Applied Fixes
β
**Fixed evaluation to use FIXED eval sets**
β
**Reduced forgetting rate from 0.05 β 0.01**
β
**Evaluation happens BEFORE time advance**
β
**All strategies use consistent eval sets**
## Remaining Considerations
1. **Forgetting Model**: Could use more sophisticated model (spaced repetition optimization)
2. **Evaluation Method**: Could use expected accuracy instead of sampling
3. **Eval Set Size**: Could increase for more stability (currently 15 tasks, could be 50-100)
4. **Time Reset**: Could periodically reset time to prevent complete forgetting in long training
|