As artificial intelligence systems increasingly interact with humans in physical environments, understanding the causal relationship between visual perception and full-body motor responses becomes critical for safe and natural human-AI collaboration. However, existing motion datasets either lack visual context or are constrained by marker occlusion and limited capture volumes in large-scale scenarios.
We present VRMotion, a large-scale multimodal dataset that captures temporally aligned egocentric visual stimuli and corresponding full-body kinematic responses. We leverage VR environments to safely simulate diverse task scenarios, while an omnidirectional treadmill combined with a 27-sensor IMU system enables occlusion-free capture of unconstrained locomotion with consistent precision.
Precisely aligned egocentric video (30fps) and full-body IMU data (60Hz) capturing the causal relationship between visual stimuli and motor responses.
Omnidirectional treadmill enables natural walking patterns without physical space constraints or marker occlusion issues.
Three cognitive complexity levels: Directive (Beat Saber), Suggestive (Table Tennis), and Explorative (Blade & Sorcery).
27 wireless IMU sensors with 0.1° rotational accuracy, capturing fine-grained finger movements and full-body kinematics.
27 wireless IMU sensors (gyroscope, accelerometer, magnetometer) at 60Hz with 0.1° accuracy
Meta Quest 3 / Pico 4 Ultra with 2160×2160 per eye at 90Hz
Virtuix Omni One capturing walking direction, speed, and acceleration at 100Hz
We evaluate 10 distinct visual-temporal combinations:
| Visual Encoder | Temporal Model | MPJPE (mm) ↓ | PA-MPJPE (mm) ↓ | Vel Error (mm/f) ↓ |
|---|---|---|---|---|
| Qwen2.5-VL | LSTM | 44.25 | 40.47 | 10.87 |
| OneVision | LSTM | 48.17 | 44.36 | 10.05 |
| DINOv2 | LSTM | 54.66 | 48.63 | 12.38 |
| ResNet | LSTM | 63.48 | 52.92 | 12.43 |
| VideoMAE | LSTM | 90.46 | 64.83 | 15.04 |
| Qwen2.5-VL | ST-GCN | 127.62 | 96.53 | 48.49 |
| OneVision | ST-GCN | 202.84 | 99.91 | 104.26 |
| DINOv2 | ST-GCN | 215.46 | 113.36 | 108.26 |
| VideoMAE | ST-GCN | 268.80 | 144.12 | 108.44 |
| ResNet | ST-GCN | 409.68 | 252.03 | 206.56 |
Key Finding: LVLM-based encoders (Qwen2.5-VL) achieve state-of-the-art accuracy with 44.25mm MPJPE, significantly outperforming traditional CNNs (63.48mm) and video models (90.46mm). LSTM heads consistently outperform ST-GCN across all visual backbones.
Main-Joint Directional Speed Profiles: Directional speed heatmaps, normalized for comparability, reveal how tasks channel effort through specific kinematic pathways. Beat Saber concentrates velocity in hands and forearms: these segments reach about 4.3%–6.8% in normalized directional speed, while most other joints remain below 1%. The pattern reflects score-driven optimization where users favor rapid, localized hand motions and suppress extraneous body movement. Symmetry persists at the speed level, with both hands exhibiting comparable velocities.
Table Tennis departs from this symmetry. The right hand shows substantially higher average speed, while the left hand remains at roughly 3.7%–5.2% of the right, capturing a unilateral control regime. The rest of the body contributes primarily through supportive rotations and balance rather than high-speed distal actions. Blade & Sorcery displays more distributed speeds across hands and feet, indicating coupled upper–lower-limb engagement. The lower-limb activity is particularly pronounced in the directional speeds, with RightFoot and LeftFoot showing significant activity across all spatial dimensions—a direct result of natural locomotion on the omnidirectional treadmill. Similarity between forearm and hand speeds suggests frequent forward thrusts and two-handed manipulations typical of spears and casting-like gestures. These profiles offer compact cues for model design, where feature weighting can reflect task-specific kinematic concentrations.
Inter-Joint Velocity Correlation: Correlation structure captures coordination and functional coupling. In Beat Saber, strong correlations cluster within intra-limb groups of each arm, especially among manual and digital segments, while associations with the rest of the body are weak. The result corroborates a hand-isolated regime wherein non-manual joints maintain low-variance states. Table Tennis retains elevated intra-arm correlations but exhibits stronger associations linking the right arm, shoulders, and upper legs to the rest of the body, reflecting multi-joint synergies for stroke production, balance, and torso-driven power. Blade & Sorcery presents robust bilateral clustering within both arms and legs, indicative of coordinated whole-body dynamics under purposeful navigation and combat.
These patterns suggest opportunities for predictive modeling. Highly correlated joint groups can share representations to reduce redundancy, while task-conditioned correlation profiles can guide adaptive routing for forecasting speed changes, turning maneuvers, and compound actions. In practice, coupling structure can regularize model design and inform loss shaping to respect biomechanical synergies observed across tasks.
Analysis Insights: Beat Saber exhibits hand-isolated motion (hands 18.3× body joints), Table Tennis shows asymmetric control with right-hand dominance, and Blade & Sorcery demonstrates globally coherent full-body coordination (hands 2.2× body joints) with substantial locomotion integration.
The VRMotion dataset is publicly available on Hugging Face:
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("strfysy/VRMotion")
# Access training split
train_data = dataset["train"]
Reference implementations are available on GitHub:
# Clone the repository
git clone https://github.com/1530442592-hue/VRMotion-Baselines.git
cd VRMotion-Baselines
# Install dependencies
pip install -r requirements.txt
# Train Qwen2.5-VL + LSTM model
python scripts/train_qwen25_vl_lstm.py --batch_size 16
# Visualize / Evaluate trained model
python scripts/visualize_qwen25_vl_lstm.py --checkpoint path/to/checkpoint.pth
| Dataset | HMD Content | VR-Motion Alignment | Locomotion | VR Environment | Size (Frames) |
|---|---|---|---|---|---|
| CMU Mocap | ✗ | ✗ | ✗ | ✗ | 15.3M |
| HumanEva | ✗ | ✗ | ✗ | ✗ | 0.08M |
| Human3.6M | ✗ | ✗ | ✗ | ✗ | 3.6M |
| VR-Behavior | ✗ | ✗ | ✗ | ✓ | 26M |
| TotalCapture | ✗ | ✗ | ✗ | ✗ | 1.9M |
| EPIC-KITCHENS | ✓ | ✗ | ✗ | ✗ | 11.5M |
| DIP-IMU | ✗ | ✗ | ✗ | ✗ | 0.3M |
| EGO-CH | ✓ | ✗ | ✗ | ✗ | 0.17M |
| Egobody | ✓ | ✗ | ✗ | ✗ | 0.59M |
| VRMN-bD | ✓ | ✗ | ✗ | ✓ | 0.97M |
| Questset | ✗ | ✗ | ✗ | ✓ | N/A |
| Movement & Traffic | ✗ | ✗ | ✗ | ✓ | N/A |
| VRMotion (Ours) | ✓ | ✓ | ✓ | ✓ | 21.6M |
VRMotion is the first dataset capturing the causal relationship between VR visual stimuli and full-body responses through precise temporal alignment.
If you find our dataset or framework useful in your research, please cite our paper:
All data collection procedures were approved by the Institutional Review Board of The Chinese University of Hong Kong, Shenzhen (IRB No. CUHKSZ-D-20250059). All participants provided written informed consent prior to participation.