MoCap4D: A Synchronized Dataset Bridging Fine-grained Motion Tracking and High-Fidelity Multi-View Video

Boyan Li, Zijian Cao, Dayou Zhang, Shufang Lin, Zhicheng Liang, Fangxin Wang

CUHKSZ

Paper Code (Soon) Dataset

* Note: The current dataset link provides a partial demo.
The complete dataset will be fully released upon the acceptance of the paper.

Abstract

Existing datasets struggle to simultaneously capture rich visual information and high-precision motion data. MoCap4D bridges this gap by perfectly synchronizing sub-centimeter motion capture from 27 inertial sensors with a 24-camera high-definition array.

Large-scale & Diverse

Multimodal data capturing a wide range of human activities.

Precise Alignment

Sub-millisecond hardware synchronization protocol.

Comprehensive Benchmarks

Extensive evaluations on 3D pose estimation and motion prediction.

Dataset Highlights

20 Participants

12 Activity Scenes

7.8 Hours

20.7M Frames

Comparison with Existing Datasets

Capture Setup & Annotation Pipeline

The full data collection workflow is organized into two paired stages: synchronized multi-view capture and annotation-ready post-processing.

Hardware Synchronization

Our setup features a 360-degree array of 24 cameras synchronized with a wearable optical-inertial motion capture system. We achieve hardware-level time synchronization with less than 1ms latency, ensuring precise alignment between the high-definition video feeds and the tracked skeleton data.

Capture setup with circular camera rig and paired upper-lower cameras — 24-camera circular rig with paired upper and lower viewpoints for each capture direction.

Data Annotation Pipeline

The post-processing pipeline performs timestamp normalization, human matting, and unified skeletal marker alignment across all modalities to produce clean, high-fidelity annotations for downstream tasks.

Extensive Activities

MoCap4D covers a wide spectrum of complex human motions across 6 major categories.

Basketball

Frisbee

Guitar

Wave (fundamental)

Benchmarks

3D Pose Estimation

We evaluated 19 state-of-the-art algorithms. Methods combining Transformers with GCNs, such as MotionAGFormer, achieved the best performance with an MPJPE of 42.5mm.

Motion Prediction

Our baseline evaluations for motion prediction across different time spans (up to 400ms) show varying degrees of accuracy, providing a solid foundation for future research.

Download & Citation

Data will be provided in standardized formats including BVH and FBX for motion capture data, alongside calibrated multi-view video files.

BibTeX

@inproceedings{li2026mocap4d,
  title={MoCap4D: A Synchronized Dataset Bridging Fine-grained Motion Tracking and High-Fidelity Multi-View Video},
  author={Li, Boyan and Cao, Zijian and Zhang, Dayou and Lin, Shufang and Liang, Zhicheng and Wang, Fangxin},
  booktitle={Proceedings of the 34th ACM International Conference on Multimedia},
  year={2026}
}