Best viewed on desktop — the full site includes interactive 3D simulations and live policy demos for a better understanding of our findings. Open full site
Back Reinforcement Learning Findings
All gains work. RL finds solutions for every gain regime given proper tuning.
Sensitivity varies. Gains modulate the hyperparameter landscape differently per task.

Unlike behavior cloning, online RL can discover successful policies across all gain settings — compliant, critically damped, and stiff — as long as key hyperparameters are re-tuned per configuration.

Each dot below loads a live RL policy running in your browser via MuJoCo WASM + ONNX inference. Tap a dot to see the policy trained under that (Kp, Kd) setting.

Joint Response
Loading...

Live RL Policy. Tap a gain dot to switch policies. Drag the robot to perturb. Pinch to zoom.

While RL can solve tasks under any gain regime, the ease of finding good hyperparameters varies. Some gain configurations produce large, smooth regions of successful settings, while others require more precise tuning. However, no single gain regime is consistently easier to optimize across all tasks.

This stands in contrast to behavior cloning, where compliant gains are clearly preferred. For RL, the choice of gains is less critical than ensuring the training pipeline is properly configured for the selected regime.

Hyperparameter Sensitivity. Tap a gain dot to watch how Optuna explores the action-scale space. Green rings = >95% success.

Success Rate
0%100%
Trial: 0/50 >95%
View on desktop for larger interactive visualizations

Full-width hyperparameter landscape with detailed controls