Spaces:

iteratehack
/

MentorFlow

Paused

App Files Files Community

MentorFlow / teacher_agent_dev /FIXES_SUMMARY.md

Cornelius

Deploy MentorFlow with GPU support

a52f96d 12 days ago

preview code

raw

history blame contribute delete

3.13 kB

	# Summary of Fixes for Accuracy Drop Issues

	## Issues Identified

	### 1. Accuracy Drops at End ❌

	Root Causes:
	1. Evaluation uses NEW tasks each iteration → Variance and inconsistency
	- Line 171-175: Generates new tasks on-the-fly for `general_accuracy`
	- Different tasks each time = different difficulty/variance

	2. Forgetting rate too aggressive for 500 iterations
	- Forgetting rate = 0.05
	- After 500 time units: retention = exp(-0.05 * 500) ≈ 0.0
	- All skills completely forgotten by iteration 500!

	3. Evaluation timing: Evaluation happens after time advance, but we log before - this is actually OK

	Fix:
	- ✅ Use FIXED eval sets generated once at start
	- ✅ Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
	- ✅ Evaluation happens BEFORE time advance (accurate snapshot)

	### 2. Accuracy Calculation Method

	Current Method:
	- Uses `student.evaluate(eval_tasks)` which samples answers stochastically
	- Accounts for forgetting correctly
	- BUT: Uses different tasks each time

	Problems:
	- Stochastic variance (random sampling)
	- Inconsistent eval sets (regenerated each time)
	- Small eval sets (10-15 tasks) = high variance

	Better Method:
	- ✅ FIXED eval sets generated once
	- ✅ Same tasks used throughout = consistent measurement
	- ✅ Larger eval sets (15+ tasks) for stability

	Alternative (for future):
	- Use expected accuracy = mean(prob_correct) instead of sampling
	- Removes stochastic variance

	### 3. Mock vs Real Components

	Current Mock Components:
	- ✅ Mock Student: Captures learning + forgetting well
	- ✅ Mock Task Generator: Simple but functional
	- ❌ Simplified learning model
	- ❌ Limited task diversity

	Real Components (MentorFlow):
	- Real Student: Full PPO with neural network
	- Real Task Generator: Procedural generation, 15 families

	Will Real Components Be Better? YES:

	1. Real PPO Student:
	- Can learn complex patterns
	- Better generalization
	- More realistic learning curves
	- But: Slower to train

	2. Real Task Generator:
	- More diverse tasks
	- Procedural generation = infinite variety
	- Better tests generalization

	3. Teacher Agent Algorithm:
	- UCB algorithm will work the same
	- Should perform even better with real components
	- More realistic reward signals

	Expected Improvement:
	- Teacher should learn better curriculum
	- Student should achieve higher accuracy
	- More realistic forgetting patterns (if implemented)

	## Applied Fixes

	✅ Fixed evaluation to use FIXED eval sets
	✅ Reduced forgetting rate from 0.05 → 0.01
	✅ Evaluation happens BEFORE time advance
	✅ All strategies use consistent eval sets

	## Remaining Considerations

	1. Forgetting Model: Could use more sophisticated model (spaced repetition optimization)
	2. Evaluation Method: Could use expected accuracy instead of sampling
	3. Eval Set Size: Could increase for more stability (currently 15 tasks, could be 50-100)
	4. Time Reset: Could periodically reset time to prevent complete forgetting in long training