MentorFlow / teacher_agent_dev /FIXES_SUMMARY.md
Cornelius
Deploy MentorFlow with GPU support
a52f96d

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Summary of Fixes for Accuracy Drop Issues

Issues Identified

1. Accuracy Drops at End ❌

Root Causes:

  1. Evaluation uses NEW tasks each iteration β†’ Variance and inconsistency

    • Line 171-175: Generates new tasks on-the-fly for general_accuracy
    • Different tasks each time = different difficulty/variance
  2. Forgetting rate too aggressive for 500 iterations

    • Forgetting rate = 0.05
    • After 500 time units: retention = exp(-0.05 * 500) β‰ˆ 0.0
    • All skills completely forgotten by iteration 500!
  3. Evaluation timing: Evaluation happens after time advance, but we log before - this is actually OK

Fix:

  • βœ… Use FIXED eval sets generated once at start
  • βœ… Reduce forgetting rate from 0.05 to 0.01 (5x slower forgetting)
  • βœ… Evaluation happens BEFORE time advance (accurate snapshot)

2. Accuracy Calculation Method

Current Method:

  • Uses student.evaluate(eval_tasks) which samples answers stochastically
  • Accounts for forgetting correctly
  • BUT: Uses different tasks each time

Problems:

  • Stochastic variance (random sampling)
  • Inconsistent eval sets (regenerated each time)
  • Small eval sets (10-15 tasks) = high variance

Better Method:

  • βœ… FIXED eval sets generated once
  • βœ… Same tasks used throughout = consistent measurement
  • βœ… Larger eval sets (15+ tasks) for stability

Alternative (for future):

  • Use expected accuracy = mean(prob_correct) instead of sampling
  • Removes stochastic variance

3. Mock vs Real Components

Current Mock Components:

  • βœ… Mock Student: Captures learning + forgetting well
  • βœ… Mock Task Generator: Simple but functional
  • ❌ Simplified learning model
  • ❌ Limited task diversity

Real Components (MentorFlow):

  • Real Student: Full PPO with neural network
  • Real Task Generator: Procedural generation, 15 families

Will Real Components Be Better? YES:

  1. Real PPO Student:

    • Can learn complex patterns
    • Better generalization
    • More realistic learning curves
    • But: Slower to train
  2. Real Task Generator:

    • More diverse tasks
    • Procedural generation = infinite variety
    • Better tests generalization
  3. Teacher Agent Algorithm:

    • UCB algorithm will work the same
    • Should perform even better with real components
    • More realistic reward signals

Expected Improvement:

  • Teacher should learn better curriculum
  • Student should achieve higher accuracy
  • More realistic forgetting patterns (if implemented)

Applied Fixes

βœ… Fixed evaluation to use FIXED eval sets βœ… Reduced forgetting rate from 0.05 β†’ 0.01 βœ… Evaluation happens BEFORE time advance βœ… All strategies use consistent eval sets

Remaining Considerations

  1. Forgetting Model: Could use more sophisticated model (spaced repetition optimization)
  2. Evaluation Method: Could use expected accuracy instead of sampling
  3. Eval Set Size: Could increase for more stability (currently 15 tasks, could be 50-100)
  4. Time Reset: Could periodically reset time to prevent complete forgetting in long training