Spaces:

iteratehack
/

MentorFlow

Paused

File size: 3,126 Bytes

a52f96d

# Summary of Fixes for Accuracy Drop Issues

## Issues Identified

### 1. **Accuracy Drops at End** ❌

**Root Causes:**
1. **Evaluation uses NEW tasks each iteration** → Variance and inconsistency
   - Line 171-175: Generates new tasks on-the-fly for `general_accuracy`
   - Different tasks each time = different difficulty/variance

2. **Forgetting rate too aggressive for 500 iterations**
   - Forgetting rate = 0.05
   - After 500 time units: retention = exp(-0.05 * 500) ≈ 0.0
   - **All skills completely forgotten by iteration 500!**

3. **Evaluation timing**: Evaluation happens after time advance, but we log before - this is actually OK

**Fix:**
- ✅ Use **FIXED eval sets** generated once at start
- ✅ Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
- ✅ Evaluation happens BEFORE time advance (accurate snapshot)

### 2. **Accuracy Calculation Method**

**Current Method:**
- Uses `student.evaluate(eval_tasks)` which samples answers stochastically
- Accounts for forgetting correctly
- BUT: Uses different tasks each time

**Problems:**
- Stochastic variance (random sampling)
- Inconsistent eval sets (regenerated each time)
- Small eval sets (10-15 tasks) = high variance

**Better Method:**
- ✅ **FIXED eval sets** generated once
- ✅ Same tasks used throughout = consistent measurement
- ✅ Larger eval sets (15+ tasks) for stability

**Alternative (for future):**
- Use expected accuracy = mean(prob_correct) instead of sampling
- Removes stochastic variance

### 3. **Mock vs Real Components**

**Current Mock Components:**
- ✅ Mock Student: Captures learning + forgetting well
- ✅ Mock Task Generator: Simple but functional
- ❌ Simplified learning model
- ❌ Limited task diversity

**Real Components (MentorFlow):**
- Real Student: Full PPO with neural network
- Real Task Generator: Procedural generation, 15 families

**Will Real Components Be Better?** **YES:**

1. **Real PPO Student:**
   - Can learn complex patterns
   - Better generalization
   - More realistic learning curves
   - But: Slower to train

2. **Real Task Generator:**
   - More diverse tasks
   - Procedural generation = infinite variety
   - Better tests generalization

3. **Teacher Agent Algorithm:**
   - UCB algorithm will work the same
   - Should perform even better with real components
   - More realistic reward signals

**Expected Improvement:**
- Teacher should learn better curriculum
- Student should achieve higher accuracy
- More realistic forgetting patterns (if implemented)

## Applied Fixes

✅ **Fixed evaluation to use FIXED eval sets**
✅ **Reduced forgetting rate from 0.05 → 0.01**
✅ **Evaluation happens BEFORE time advance**
✅ **All strategies use consistent eval sets**

## Remaining Considerations

1. **Forgetting Model**: Could use more sophisticated model (spaced repetition optimization)
2. **Evaluation Method**: Could use expected accuracy instead of sampling
3. **Eval Set Size**: Could increase for more stability (currently 15 tasks, could be 50-100)
4. **Time Reset**: Could periodically reset time to prevent complete forgetting in long training