ICRA 2025 (Under Review)
Collect Data Anywhere in Augmented Reality | DART leverages cloud-based simulation and AR to provide a scalable, intuitive, and cost-effective solution for data collection accessible anywhere in the world. This eliminates the need for a physical robot and associated environment setup costs.
Access Data through DexHub | All data collected through DART is automatically logged to the cloud and shared publicly with researchers. We provide both web interface and an easy-to-use Python API for data access.
We recruited twenty robotics novice participants to spend seven minutes collecting demonstrations with the Rainbow RB-Y1's and ALOHA'sT. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” arXiv preprint arXiv:2304.13705, 2023. default “kinematic double”-based teleoperation interfaces.
Participants collected 2.5x more demonstrations on average using DART than real-world teleoperation setups.
Participants attributed this gap to a) reduced physical fatigue, b) better visibility of local contact interactions, and c) easier resetting of the environment. Please see our paper for controlled ablation studies.
In particular, participants spend data-collection time more productively in DART than the real world.
See side-by-side comparison videos below.
DART's features are carefully designed to make data collection more efficient and less fatiguing for operators. We focused on making the data collection experience fun and engaging, rather than taxing, repetitive, and boring.
One-Click Environment Reset | Unlike real-world data collection, DART allows users to reset the environment instantly with programatically-ensured randomness, resulting in a (a) reduced mental/physical fatigue for data collectors, (b) increased data collection throughput, and (c) diversity-ensured data.
Visual Rendering happens locally on the AR device | Compared to existing VR-based teleoperation approaches which send over the entire camera feeds or rendered images over the network A. Iyer, Z. Peng, Y. Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,” arXiv preprint arXiv:2403.07870, 2024.X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: teleoperation with immersive active visual feedback,” arXiv preprint arXiv:2407.01512, 2024, DART off-loads compute intensive visual rendering processes to an edge AR device, reducing latency without compromising the visual fidelity.
Fingertip-based Robot Controls | For an intuitive user experience, DexHub uses built-in hand-tracking rather than relying on external control devices. Parallel jaw grippers are controlled by pinching and releasing fingers. Full SE(3) pose is specified by tracking four hand keypoints, as seen below.
1. Data Augmentation | Simulation allows access to privileged information, such as the robot's, environment's, and objects’ ground-truth state. The ease of randomization and augmentation in simulation enables policy robustness difficult to achieve via real-world data collection. Please see our paper for experiments.
2. RL Finetuning | One of the powerful benefit of simulation is the possibility of using RL to finetune the policy. Simulation allows for the refinement of human-collected (therefore possibly suboptimal) datasets through online reinforcement learning using massively parallelizable simulation environments. Such refinement can address the potential performance saturation often observed on policies trained only with supervised learningS. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 661–668. [Online]. [link] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635. T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid, “Aloha unleashed: A simple recipe for robot dexterity,” in 8th Annual Conference on Robot Learning. L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal, “From imitation to refinement–residual rl for precise visual assembly,” arXiv preprint arXiv:2407.16677, 2024. .
Anyone can spin up a cloud-hosted physics engine and start collecting robot data from home.
Servers can run in the following regions with a click of a button in our portal. Join our waitlist here to get access.
▶ Are you a developer?
One thing to clarify: we still believe in the power of real-world data. Real-world data is paramount for aiding in last-mile Sim2Real transfer, serving as a regularizer to prevent the policy from over-exploiting simulation artifacts and/or deviating too much from the real world.
Our hypothesis, however, is that the scale and diversity of data required for robot foundation models will be most feasibly achieved through simulation, not in real-world. The logic is simple; we don't have enough robots around the world to collect the amount of data required to train a minimal-viable-product form of robot foundation model. We anticipate that robot foundation models will be trained on a mix of large-scale simulation data and small fraction of real-world data.
DART and DexHub are still in the early stages of development, with room to grow. We've listed a few ideas we encourage the community to explore.
All above, although still easier than setting up a physical environment, are still time-consuming and requires certain level of expertise. Thus, a method that can autonomously create manipulatable simulation scenes either from language descriptions or pictures/videos of real-world scenes can further bring down the cost of data collection and increase the diversity of the collected data.
Recent advances including GenSimWang, L., Ling, Y., Yuan, Z., Shridhar, M., Bao, C., Qin, Y., ... & Wang, X. (2023). Gensim: Generating robotic simulation tasks via large language models. arXiv preprint arXiv:2310.01361., RoboGenWang, Y., Xian, Z., Chen, F., Wang, T. H., Wang, Y., Fragkiadaki, K., ... & Gan, C. (2023). Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455. , Gen2SimKatara, Pushkal, Zhou Xian, and Katerina Fragkiadaki. "Gen2sim: Scaling up robot learning in simulation with generative models." 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024. , RoboCasaNasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., ... & Zhu, Y. (2024). RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. arXiv preprint arXiv:2406.02523., RialToTorne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., & Agrawal, P. (2024). Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949. have shown promising results, but we still have a long way to go to generate truly diverse and realistic scenes resembling that of real world.
DART currently supports controlling (a) fixed-based manipulators in bimanual setup, (b) upper-limb of humanoid platforms, and (c) wheel-based mobile robots from keypoint tracking inputs using Differential IK formulation. The main missing piece, however, is an integration with a whole-body controller that can control a full bipedal robot tracking various set of human keypoints that are available from the AR device. Below are the keypoints that Apple's ARKit, for instance, provides for a human body.
Integrating recent works like HOVERHe, T., Xiao, W., Lin, T., Luo, Z., Xu, Z., Jiang, Z., ... & Zhu, Y. (2024). HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots. arXiv preprint arXiv:2410.21229. that can support arbitrary combination of keypoints for whole-body control can be a good starting point to explore.
Although DART's compact-sized message passing allows low-latency communication between the AR device and cloud-hosted physics engine (MuJoCo), it's still more desirable to run the physics engine on-device. Supporting the latest version of MuJoCo on VisionOS can be a promising direction to explore. Fortunatley, there are already some efforts in this direction: MuJoCo on Swift.
One alternative path is to improve the fidelity of internal physics engine that AR device natively supports. Apple's RealityKit, for instance, supports basic
collision detection between AR objects for simple physics interactions.
However, based on our experience, the physics engine is not yet mature enough to support complex robot manipulation tasks; it's simply not designed for serious roboticists.
Features that should be improved include:
Since the data collected through a crowd-sourced platform will be highly diverse and noisy, a principled way to curate the data to ensure the quality of the data is crucial. It is well known that LLM communities have spent significant amount of time and effort to curate the data for training, and we believe that a similar effort will be required for robot data collection as well as we enter the regime of larger-scale, crowd-sourced robot data collection.
We hope this project inspires the robotics community to think about / make progress on a platform that ignites the ecological growth of robot datasets.
We believe that robot dataset with the size of vision and language datasets cannot come from project-level data collection efforts, led by single research labs or tech companies. Instead, it requires a global, community-driven effort, mirroring the internet as a data-source for vision-language model training.