RL Findings — Tune to Learn

Unlike behavior cloning, online RL can discover successful policies across all gain settings — compliant, critically damped, and stiff — as long as key hyperparameters are re-tuned per configuration.

Each dot below loads a live RL policy running in your browser via MuJoCo WASM + ONNX inference. Tap a dot to see the policy trained under that (Kp, Kd) setting.

While RL can solve tasks under any gain regime, the ease of finding good hyperparameters varies. Some gain configurations produce large, smooth regions of successful settings, while others require more precise tuning. However, no single gain regime is consistently easier to optimize across all tasks.

This stands in contrast to behavior cloning, where compliant gains are clearly preferred. For RL, the choice of gains is less critical than ensuring the training pipeline is properly configured for the selected regime.

Hyperparameter Sensitivity. Tap a gain dot to watch how Optuna explores the action-scale space. Green rings = >95% success.

RL RL succeeds across all gain settings