File size: 3,126 Bytes
a52f96d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Summary of Fixes for Accuracy Drop Issues

## Issues Identified

### 1. **Accuracy Drops at End** ❌

**Root Causes:**
1. **Evaluation uses NEW tasks each iteration** β†’ Variance and inconsistency
   - Line 171-175: Generates new tasks on-the-fly for `general_accuracy`
   - Different tasks each time = different difficulty/variance

2. **Forgetting rate too aggressive for 500 iterations**
   - Forgetting rate = 0.05
   - After 500 time units: retention = exp(-0.05 * 500) β‰ˆ 0.0
   - **All skills completely forgotten by iteration 500!**

3. **Evaluation timing**: Evaluation happens after time advance, but we log before - this is actually OK

**Fix:**
- βœ… Use **FIXED eval sets** generated once at start
- βœ… Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
- βœ… Evaluation happens BEFORE time advance (accurate snapshot)

### 2. **Accuracy Calculation Method**

**Current Method:**
- Uses `student.evaluate(eval_tasks)` which samples answers stochastically
- Accounts for forgetting correctly
- BUT: Uses different tasks each time

**Problems:**
- Stochastic variance (random sampling)
- Inconsistent eval sets (regenerated each time)
- Small eval sets (10-15 tasks) = high variance

**Better Method:**
- βœ… **FIXED eval sets** generated once
- βœ… Same tasks used throughout = consistent measurement
- βœ… Larger eval sets (15+ tasks) for stability

**Alternative (for future):**
- Use expected accuracy = mean(prob_correct) instead of sampling
- Removes stochastic variance

### 3. **Mock vs Real Components**

**Current Mock Components:**
- βœ… Mock Student: Captures learning + forgetting well
- βœ… Mock Task Generator: Simple but functional
- ❌ Simplified learning model
- ❌ Limited task diversity

**Real Components (MentorFlow):**
- Real Student: Full PPO with neural network
- Real Task Generator: Procedural generation, 15 families

**Will Real Components Be Better?** **YES:**

1. **Real PPO Student:**
   - Can learn complex patterns
   - Better generalization
   - More realistic learning curves
   - But: Slower to train

2. **Real Task Generator:**
   - More diverse tasks
   - Procedural generation = infinite variety
   - Better tests generalization

3. **Teacher Agent Algorithm:**
   - UCB algorithm will work the same
   - Should perform even better with real components
   - More realistic reward signals

**Expected Improvement:**
- Teacher should learn better curriculum
- Student should achieve higher accuracy
- More realistic forgetting patterns (if implemented)

## Applied Fixes

βœ… **Fixed evaluation to use FIXED eval sets**
βœ… **Reduced forgetting rate from 0.05 β†’ 0.01**
βœ… **Evaluation happens BEFORE time advance**
βœ… **All strategies use consistent eval sets**

## Remaining Considerations

1. **Forgetting Model**: Could use more sophisticated model (spaced repetition optimization)
2. **Evaluation Method**: Could use expected accuracy instead of sampling
3. **Eval Set Size**: Could increase for more stability (currently 15 tasks, could be 50-100)
4. **Time Reset**: Could periodically reset time to prevent complete forgetting in long training