Tune to Learn: How Controller Gains Affect Robot Policy Learning

Abstract

Position controllers have become the dominant interface for executing learned manipulation policies. Yet a critical design decision remains understudied: how should we choose controller gains for policy learning? The conventional wisdom is to select gains based on desired task compliance or stiffness. However, this logic breaks down when controllers are paired with state-conditioned policies: effective stiffness emerges from the interplay between learned reactions and control dynamics, not from gains alone. We argue that gain selection should instead be guided by learnability: how amenable different gain settings are to the learning algorithm in use. In this work, we systematically investigate how position controller gains affect three core components of modern robot learning pipelines: behavior cloning, reinforcement learning from scratch, and sim-to-real transfer. Through extensive experiments across multiple tasks and robot embodiments, we find that:

Behavior Cloning (BC) learning benefits from compliant and overdamped gain regimes,
Reinforcement Learning (RL) can succeed across all gain regimes given compatible hyperparameter tuning, and
Sim-to-Real transfer is harmed by stiff and overdamped gain regimes.

These findings reveal that optimal gain selection depends not on the desired task behavior, but on the learning paradigm employed.

Video Summary

Position Impedance Controllers and Gains

Position impedance controllers are commonly used in robotics to enable compliant behavior. The control law is typically defined as:

$$\tau = \underbrace{\mathbf{K}_p (\mathbf{x}_{\text{desired}} - \mathbf{x})}_{\text{feedback}} + \underbrace{\mathbf{K}_d (- \dot{\mathbf{x}})}_{\text{damping}} + \underbrace{\tau_{\text{ff}}}_{\text{feedforward}}$$

where $\tau$ is the control torque, $\mathbf{K}_p$ is the stiffness gain matrix, and $\mathbf{K}_d$ is the damping gain matrix. Click and drag the circle below to adjust the gains and see how it affects the robot's response to different excitations. The response curves are recorded from real-world Franka Research 3.

The 3D simulation below runs the same PD control law on a MuJoCo model of the Franka Research 3. Drag the gains on the chart above and observe how different gain regimes track sinusoidal or step commands — or click & drag the robot to apply perturbation forces.

Interactive. Click & drag robot to apply forces. Change excitation mode (Hold/Sine/Step) to see tracking behavior. Orbit: right-click · Zoom: scroll

Before we dig deeper into the world of low-level controllers —

How do you tune your controller gains?

Pick one to jump to the most relevant section — or just scroll through everything.

You said: Don't Tune

You're not alone — many practitioners never touch the default gains. But our experiments show that the default gain regime can quietly bottleneck policy performance. In behavior cloning, for example, swapping to a compliant, overdamped regime can improve success rates by over 30% on the same task with the same data. Keep reading to see how different gain choices interact with each stage of the learning pipeline.

You said: Tune for Task

Matching gains to the desired task stiffness is the most common heuristic — but it can be misleading when a learned policy is in the loop. Because the policy itself reacts to state, the effective stiffness is a product of both the controller gains and the learned behavior. Our findings suggest that optimizing gains for learnability — not task compliance — leads to significantly better policies.

You said: Tune for Teleop

Optimizing for a comfortable teleoperation feel is a reasonable starting point — after all, better teleop usually means better demonstration data. However, gains that feel natural for a human operator may not be optimal for the downstream learning algorithm. Our results show that the best gain regime differs depending on whether you're doing behavior cloning, RL from scratch, or sim-to-real transfer. Read on to find out which regime helps most for your pipeline.

Conclusion & Remarks

We have presented a systematic study of how position controller gains shape learning dynamics across three paradigms of modern robot learning. Our findings reveal that gains function not as behavioral parameters, but as an inductive bias that modulates the learning interface between policy and environment. Behavior cloning favors compliant, overdamped regimes; reinforcement learning adapts to any gain setting given compatible hyperparameters; and sim-to-real transfer suffers with stiff, overdamped configurations. These results provide both conceptual clarity and practical guidance for a widely used yet underexplored design decision.

Broader Implications

Whole-Body Tracking Controllers for Humanoids. Modern humanoid robots increasingly use RL-trained whole-body tracking policies as low-level controllers, analogous to the PD controllers studied here. Notably, recent work has shown that these motion tracking policies tend to be inherently stiff regardless of the underlying controller gains —yet when such policies serve as the low-level interface for higher-level loco-manipulation learning, their effective compliance directly shapes the learning dynamics of the policies above them, much as PD gains do in our setting.

The Implicit Stiff-Controller Assumption in Learning from Wearables or Videos. Similarly, paradigms that learn manipulation skills from human videos or wearable devices typically treat observed next-timestep state as the action label, implicitly assuming perfect target tracking—which our results suggest may be suboptimal for imitation learning. Whether these gain-dependent trends generalize to such cross-embodiment or whole-body control settings remains an open question, and we hope our findings offer a useful lens for investigating these directions.

BibTeX

@misc{bronars2026tunelearncontrollergains,
      title={Tune to Learn: How Controller Gains Shape Robot Policy Learning},
      author={Antonia Bronars and Younghyo Park and Pulkit Agrawal},
      year={2026},
      eprint={2604.02523},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.02523},
}

Acknowledgments

We thank the members of the Improbable AI lab for the helpful discussions and feedback on the paper. This research was financially partially supported by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, under the “Global Industrial Technology Cooperation Center program” supervised by the Korea Institute for Advancement of Technology (KIAT) (Grant No. P0028435). This work was also partly supported by the Sony Research Award.