What is it about?
Depth estimation is critical for self-driving cars, robots and AR devices. Expensive radar sensors give precise distance data, while cheap single cameras produce unreliable depth results. We created a new tool called WiViD that combines ordinary Wi-Fi signal readings and regular camera photos with a popular AI image generation technique (diffusion models) to calculate accurate depth information without costly hardware. Our system uses separate modules to extract useful information from Wi-Fi signals and camera pictures, plus a special matching module to fuse the two types of data well. Tests with real-world scenes prove our method predicts object distances far more accurately than camera-only approaches, cutting common depth measurement errors by nearly one-third and greatly improving reliability for everyday smart device use.
Featured Image
Photo by Dawn Casey on Unsplash
Why is it important?
This work fills a critical, unaddressed gap in modern depth-sensing research at a moment when low-cost 3D perception is urgently demanded by consumer robotics, mass-market autonomous vehicles, and lightweight AR hardware. Prior depth solutions face a rigid tradeoff: expensive LiDAR/mmWave sensors deliver precise distance readings but raise hardware costs too steeply for mass deployment, while cheap monocular cameras suffer fundamental depth ambiguity, failing on textureless surfaces, in dim lighting, and under occlusion. Existing multimodal fusion methods either ignore full complex-valued Wi-Fi signal information or lack robust cross-modal alignment, and almost no frameworks combine commodity Wi-Fi CSI and visual inputs with diffusion models for depth prediction—our WiViD breaks this paradigm. Three core unique innovations distinguish this research from prior state-of-the-art: 1. We design the first Complex-Valued CSI Encoder (CCE) to fully extract spatio-temporal phase and amplitude features from raw Wi-Fi signals, rather than discarding complex-domain information as past Wi-Fi sensing pipelines do, unlocking stable physical distance cues without costly radar hardware. 2. The novel Multi-Modal Alignment Bidirectional Cross-Attention (MABA) mechanism enables bidirectional feature matching between Wi-Fi and visual streams, fixing misalignment flaws seen in one-way cross-attention fusion and producing consistent, sharp depth maps. 3. We build the first diffusion-based multimodal depth estimation framework pairing commercial off-the-shelf Wi-Fi and camera data, harnessing diffusion’s iterative denoising refinement to fix blurry, error-prone outputs common to traditional regression networks. Its timeliness stems from the industry-wide push to eliminate pricey LiDAR from consumer devices while retaining reliable 3D perception. Real-world testing proves WiViD cuts key depth error metrics (Absolute Relative Error by 33.3%, Square Relative Error by 9.5%) versus leading monocular depth models. This creates tangible real-world impact: developers can build accurate depth-sensing systems using standard household Wi-Fi modules and basic cameras, drastically lowering hardware budgets for delivery robots, low-end self-driving vehicles, indoor AR headsets, and smart home environmental perception. The architecture offers a flexible, low-cost multimodal fusion template that researchers and engineers can adapt to future lightweight sensing tasks, establishing a new accessible pathway for high-precision depth estimation without premium active sensors.
Perspectives
From my perspective, this research started from a practical pain point I repeatedly observed in real-world intelligent system experiments: high-precision depth sensing relies heavily on costly LiDAR or millimeter wave equipment, which blocks large-scale popularization of consumer robots and lightweight AR devices, while single-camera depth prediction is unstable in complex indoor and outdoor scenes. I’ve long wondered whether ubiquitous, cheap Wi-Fi signals could compensate for the inherent defects of monocular vision, yet most existing Wi-Fi perception work only focuses on human detection or localization, rarely combining with visual depth estimation. It was challenging to design the complex-valued CSI encoder and bidirectional cross-modal alignment module at first—fusing two completely different physical signal modalities brought frequent feature mismatch and training instability during model debugging. Seeing the diffusion framework steadily reduce depth prediction errors after iterative optimization was incredibly rewarding. Personally, I hope this paper inspires more researchers to tap into underutilized low-cost wireless sensing resources instead of only pursuing high-end active sensors. Beyond theoretical innovations, I genuinely wish our WiViD framework can lower the threshold of high-precision 3D perception for small device developers, and make affordable, reliable depth-sensing technology accessible to ordinary smart devices rather than only high-end autonomous driving equipment. I also look forward to follow-up studies extending this Wi-Fi-vision diffusion fusion idea to other perception tasks, and I’m glad our team’s cross-modal exploration can offer a new, cost-effective research direction for the community.
Shijie Cheng
Tsinghua University
Read the Original
This page is a summary of: Multimodal Diffusion-Based Depth Estimation Framework with Wi-Fi and Vision, ACM Transactions on Sensor Networks, June 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3821417.
You can read the full text:
Contributors
The following have contributed to this page







