Loading Experience...
DoorMan
NVIDIA GEAR Team

Infinite Visual
Randomizations

Powered by IsaacLab

RGB-Only
Sim-to-Real

Generalizable Policy Transfer

Scroll to explore

DoorMan Opening the Sim-to-Real Door for
Humanoid Pixel-to-Action Policy Transfer

Haoru Xue1,2,*, Tairan He1,3,*, Zi Wang1,*
Qingwei Ben1,4, Wenli Xiao1,3, Zhengyi Luo1, Xingye Da1, Fernando Castañeda1,
Guanya Shi3, Shankar Sastry2, Linxi "Jim" Fan1,†, Yuke Zhu1,†

1NVIDIA, 2UC Berkeley, 3CMU, 4CUHK
*Equal Contribution, Project Leads
NVIDIA
UC Berkeley
CMU
CUHK

Mass Scale Simulation Randomization

We use a procedural generation pipeline that randomizes physical and visual properties of articulated objects: mass, handle type, hinge damping, stiffness, texture, background, etc.

Teacher-Student-Bootstrap

We apply GRPO fine-tuning to bootstrap the student on top of classical teacher-student distillation. We find this stage to be uniquely useful for challenging loco-manipulation tasks due to partial observability. The reward is mostly binary on the task success criteria. The result: 20-30% success rate improvement.

Phase 1: Train privileged teacher with PPO
(1 L40s GPU, 6 hours)
Distill
Phase 2: Distill onto vision student with DAgger
(32 L40s GPUs, 24 hours)
Copy initialize
Phase 3: Fine-tune student with GRPO
(64 L40s GPUs, 12 hours)

Real-World Generalization

Our policy demonstrates robust generalization to diverse real-world scenarios, successfully manipulating various door types under different environmental conditions.

Diverse Handle Shape

Diverse Visuals

Diverse Locations and Door Types

Up to 31.7% Faster than Human

DoorMan on average completes the door opening task by up to 7.15 seconds faster than human teleoperators, who struggle with skillful loco-manipulation with articulated objects.

Unfluent teleoperator fails to open the door.
Teleoperator struggles to grasp door handle.
Robot fails to balance under articulation constraint and falls.

Failure Cases

While our policy demonstrates strong performance across diverse scenarios, we observe failure modes that highlight areas for future improvement. Common failure patterns include unobserved disturbances, distance estimation errors, and difficulties with unmodeled environmental states.

Stuck on door frame
Inaccurate distance gauging
Unmodeled articulation state

Abstract

Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.

Citation

@misc{xue2025openingsimtorealdoorhumanoid,
  title={Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer},
  author={Haoru Xue and Tairan He and Zi Wang and Qingwei Ben and Wenli Xiao and Zhengyi Luo and Xingye Da and Fernando Castañeda and Guanya Shi and Shankar Sastry and Linxi "Jim" Fan and Yuke Zhu},
  year={2025},
  eprint={2512.01061},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2512.01061}
}
Interested in the simulation rendering / website template?
  • Website Source Code: The website source code can be found at here. The CLAUDE.md file contains comprehensive documentation that enables coding agents to understand and adapt this website template for other projects.
  • Simulation Rendering with IsaacSim: All simulation rendering is done using IsaacSim. The workflow involves:
    1. Recording rollouts as USD animation files following IsaacLab's animation recording guide
    2. Opening the USD files interactively in IsaacSim
    3. Generating camera flyby trajectories using the Animation Curve extension
    4. Recording final videos using the Movie Capture extension
  • Contact: For further guidance on the website implementation or simulation rendering pipeline, please contact Haoru Xue at haoru-xue@berkeley.edu.