×

DexHub and DART: Towards
Internet-Scale Robot Data Collection

Younghyo Park Jagdeep Bhatia Lars Ankile Pulkit Agrawal

ICRA 2025 (Under Review)

Visit CoRL 2024 XE/WBCM Workshop
for an in-person demo on Nov 9th, Munich

Try it yourself on Apple Vision Pro! Please see our getting started guide.
If you're a researcher working on AI / Robotics, check out our call for research.

How robot datasets are collected today

  1. Buying a robot is expensive, limiting data collection to tech companies or research labs.
  2. Setting up environments requires either (a) constructing a fake setup around the robot in the lab or (b) moving robots to actual sites of interest, both of which are difficult and time-consuming.
  3. Teleoperating the robot suffers from visual occlusions, network delays, and a lack of rich feedback, which can slow down operators and prevent them from performing dynamic or precise tasks.
  4. Resetting the environment after every task completion is time-consuming and physically exhausting. Operators need to context-switch between robot control and environment setup way too frequently. In addition, ensuring diverse reset states is not as easy as it sounds; human operators often unintentionally repeat certain reset configurations that are easier to solve.
  5. Repetitive jobs, no matter how easy the task is, quickly lead to fatigue and operator burnout. Unfortunately, the number of required demonstrations scales with task complexity and the extent of required generalization.
  6. Post-processing collected data often happens on a local machine or private cloud. Different data structures and conventions for data storage make data sharing difficult, which are essential for scaling up.
How do we make this process scale to the size of vision/language datasets?

DART reimagines robot data collection

Collect Data Anywhere in Augmented Reality  |  DART leverages cloud-based simulation and AR to provide a scalable, intuitive, and cost-effective solution for data collection accessible anywhere in the world. This eliminates the need for a physical robot and associated environment setup costs.

Access Data through DexHub  |  All data collected through DART is automatically logged to the cloud and shared publicly with researchers. We provide both web interface and an easy-to-use Python API for data access.

DART is intuitive and efficient

User study shows higher data collection throughput and lower operator fatigue.

We recruited twenty robotics novice participants to spend seven minutes collecting demonstrations with the Rainbow RB-Y1's and ALOHA'sT. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” arXiv preprint arXiv:2304.13705, 2023. default “kinematic double”-based teleoperation interfaces.

Participants collected 2.5x more demonstrations on average using DART than real-world teleoperation setups.

Participants attributed this gap to a) reduced physical fatigue, b) better visibility of local contact interactions, and c) easier resetting of the environment. Please see our paper for controlled ablation studies.

In particular, participants spend data-collection time more productively in DART than the real world.

See side-by-side comparison videos below.

Features of DART

What makes DART so intuitive and efficient?

DART's features are carefully designed to make data collection more efficient and less fatiguing for operators. We focused on making the data collection experience fun and engaging, rather than taxing, repetitive, and boring.

One-Click Environment Reset  |  Unlike real-world data collection, DART allows users to reset the environment instantly with programatically-ensured randomness, resulting in a (a) reduced mental/physical fatigue for data collectors, (b) increased data collection throughput, and (c) diversity-ensured data.

DART's design choices

Visual Rendering happens locally on the AR device  |  Compared to existing VR-based teleoperation approaches which send over the entire camera feeds or rendered images over the network A. Iyer, Z. Peng, Y. Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,” arXiv preprint arXiv:2403.07870, 2024.X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: teleoperation with immersive active visual feedback,” arXiv preprint arXiv:2407.01512, 2024, DART off-loads compute intensive visual rendering processes to an edge AR device, reducing latency without compromising the visual fidelity.

Fingertip-based Robot Controls  |  For an intuitive user experience, DexHub uses built-in hand-tracking rather than relying on external control devices. Parallel jaw grippers are controlled by pinching and releasing fingers. Full SE(3) pose is specified by tracking four hand keypoints, as seen below.

DART Fingertip Control
For dexterous hands, full hand keypoints are used to control all joint angles.

Demonstrations in Simulation offers a lot of benefits

If you are a robotics researcher who already has access to a physical robot, you might be wondering,
Why should I care about collecting data in simulation if I can collect directly in the real world?
We argue that collecting data in simulation offers several benefits over real-world data collection.

1. Data Augmentation  |  Simulation allows access to privileged information, such as the robot's, environment's, and objects’ ground-truth state. The ease of randomization and augmentation in simulation enables policy robustness difficult to achieve via real-world data collection. Please see our paper for experiments.

2. RL Finetuning  |  One of the powerful benefit of simulation is the possibility of using RL to finetune the policy. Simulation allows for the refinement of human-collected (therefore possibly suboptimal) datasets through online reinforcement learning using massively parallelizable simulation environments. Such refinement can address the potential performance saturation often observed on policies trained only with supervised learningS. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 661–668. [Online]. [link] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635. T. Z. Zhao, J. Tompson, D. Driess, P. Florence, S. K. S. Ghasemipour, C. Finn, and A. Wahid, “Aloha unleashed: A simple recipe for robot dexterity,” in 8th Annual Conference on Robot Learning. L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal, “From imitation to refinement–residual rl for precise visual assembly,” arXiv preprint arXiv:2407.16677, 2024. .

Experience DART yourself!

Try DART on your own Apple Vision Pro!
Sign up here for access to our cloud-hosted physics engine.

Anyone can spin up a cloud-hosted physics engine and start collecting robot data from home. Servers can run in the following regions with a click of a button in our portal. Join our waitlist here to get access.

DART Fingertip Control

▶ Are you a developer?

Role of Real-world Datasets

One thing to clarify: we still believe in the power of real-world data. Real-world data is paramount for aiding in last-mile Sim2Real transfer, serving as a regularizer to prevent the policy from over-exploiting simulation artifacts and/or deviating too much from the real world.

Our hypothesis, however, is that the scale and diversity of data required for robot foundation models will be most feasibly achieved through simulation, not in real-world. The logic is simple; we don't have enough robots around the world to collect the amount of data required to train a minimal-viable-product form of robot foundation model. We anticipate that robot foundation models will be trained on a mix of large-scale simulation data and small fraction of real-world data.

data-fraction

I should also note amazing works from other colleagues in the field are coming up with creative ideas to collect alternative types of datasets, although not necessarily robot-embodied, that can be useful for training collected in real-world. Passive Demos  |   Human Motions  |  

Call for Research

DART and DexHub are still in the early stages of development, with room to grow. We've listed a few ideas we encourage the community to explore.

Generative Simulation Scene Modeling

Although teleoperating in simulation removes the need for physical environment setup, it still requires a human to manually design the scene in simulation
Park, Y., Margolis, G. B., & Agrawal, P. Position: Automatic Environment Shaping is the Next Frontier in RL. In Forty-first International Conference on Machine Learning.
. The process involves the following process:
  • Preparing 3D Assets: Finding the right 3D asset from public database (i.e., Objaverse
    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., ... & Farhadi, A. (2023). Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13142-13153).
    ), scanning real-world objects, or manually CAD-ing the objects from scratch.
  • Placing Assets in the Scene: Placing the objects in the scene in a semantically meaningful way that resembles the real-world setup, while maintaining strict non-penetration constraint.
  • Reset Strategy Design: Designing a reset strategy that we can use to reset the scene after each task completion.
  • All above, although still easier than setting up a physical environment, are still time-consuming and requires certain level of expertise. Thus, a method that can autonomously create manipulatable simulation scenes either from language descriptions or pictures/videos of real-world scenes can further bring down the cost of data collection and increase the diversity of the collected data.

    Recent advances including GenSimWang, L., Ling, Y., Yuan, Z., Shridhar, M., Bao, C., Qin, Y., ... & Wang, X. (2023). Gensim: Generating robotic simulation tasks via large language models. arXiv preprint arXiv:2310.01361., RoboGenWang, Y., Xian, Z., Chen, F., Wang, T. H., Wang, Y., Fragkiadaki, K., ... & Gan, C. (2023). Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455. , Gen2SimKatara, Pushkal, Zhou Xian, and Katerina Fragkiadaki. "Gen2sim: Scaling up robot learning in simulation with generative models." 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024. , RoboCasaNasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., ... & Zhu, Y. (2024). RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. arXiv preprint arXiv:2406.02523., RialToTorne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., & Agrawal, P. (2024). Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949. have shown promising results, but we still have a long way to go to generate truly diverse and realistic scenes resembling that of real world.

    Keypoint Tracking Whole-Body Controllers

    DART currently supports controlling (a) fixed-based manipulators in bimanual setup, (b) upper-limb of humanoid platforms, and (c) wheel-based mobile robots from keypoint tracking inputs using Differential IK formulation. The main missing piece, however, is an integration with a whole-body controller that can control a full bipedal robot tracking various set of human keypoints that are available from the AR device. Below are the keypoints that Apple's ARKit, for instance, provides for a human body.

    Integrating recent works like HOVERHe, T., Xiao, W., Lin, T., Luo, Z., Xu, Z., Jiang, Z., ... & Zhu, Y. (2024). HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots. arXiv preprint arXiv:2410.21229. that can support arbitrary combination of keypoints for whole-body control can be a good starting point to explore.

    MuJoCo on VisionOS and/or Better Apple RealityKit Physics

    Although DART's compact-sized message passing allows low-latency communication between the AR device and cloud-hosted physics engine (MuJoCo), it's still more desirable to run the physics engine on-device. Supporting the latest version of MuJoCo on VisionOS can be a promising direction to explore. Fortunatley, there are already some efforts in this direction: MuJoCo on Swift.

    One alternative path is to improve the fidelity of internal physics engine that AR device natively supports. Apple's RealityKit, for instance, supports basic collision detection between AR objects for simple physics interactions. However, based on our experience, the physics engine is not yet mature enough to support complex robot manipulation tasks; it's simply not designed for serious roboticists. Features that should be improved include:

  • Robot description support
  • Joint kinematics/dynamics support
  • Deformable material support
  • SDF-based collision detection support for tight tolerance tasks
  • Reliable contact resolution
  • Data Curation for Robot Learning

    Since the data collected through a crowd-sourced platform will be highly diverse and noisy, a principled way to curate the data to ensure the quality of the data is crucial. It is well known that LLM communities have spent significant amount of time and effort to curate the data for training, and we believe that a similar effort will be required for robot data collection as well as we enter the regime of larger-scale, crowd-sourced robot data collection.

    Future we Imagine

    We hope this project inspires the robotics community to think about / make progress on a platform that ignites the ecological growth of robot datasets.

    We believe that robot dataset with the size of vision and language datasets cannot come from project-level data collection efforts, led by single research labs or tech companies. Instead, it requires a global, community-driven effort, mirroring the internet as a data-source for vision-language model training.