Zhengyang Shen

Zhengyang Shen

Zhengyang works on high-fidelity hand-object interaction.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    ESAM++: Efficient Online 3D Perception on the Edge
    Qin Liu
    Lavisha Aggarwal
    Vikas Bahirwani
    Lin Li
    Aleksander Holynski
    Saptarashmi Bandyopadhyay
    Marc Niethammer
    Ehsan Adeli
    Andrea Colaco
    2025
    Preview abstract Online 3D scene perception in real time is critical for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the 2D visual foundation model (VFM) with efficient 3D query lifting and merging. However, ESAM depends on a computationally expensive sparse 3D U-Net for point cloud feature extraction, which we identify as the primary efficiency bottleneck. In this paper, we propose a lightweight and scalable alternative for online 3D scene perception tailored to edge devices. Our method introduces a 3D Sparse FeaturePyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational over-head and model size. We evaluate our approach on four challenging segmentation benchmarks—ScanNet, ScanNet200, SceneNN, and 3RScan—demonstrating that our model achieves competitive accuracy with up to 3×faster inference and 3×small model size compared to ESAM, enabling practical deployment in real-world edge scenarios. Code and models will be released. View details
    Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images
    Danhang "Danny" Tang
    Franziska Müller
    Jonathan Taylor
    Mingsong Dou
    Sasa Petrovic
    Thabo Beeler
    Tze Ho Elden Tse
    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023), pp. 14666-14677
    Preview abstract We propose a novel transformer-based framework that reconstructs two high fidelity hands from multi-view RGB images. Unlike existing hand pose estimation methods, where one typically trains a deep network to regress hand model parameters from single RGB image, we consider a more challenging problem setting where we directly regress the absolute root poses of two-hands with extended forearm at high resolution from egocentric view. As existing datasets are either infeasible for egocentric viewpoints or lack background variations, we create a large-scale synthetic dataset with diverse scenarios and collect a real dataset from multi-calibrated camera setup to verify our proposed multi-view image feature fusion strategy. To make the reconstruction physically plausible, we propose two strategies: (i) a coarse-to-fine spectral graph convolution decoder to smoothen the meshes during upsampling and (ii) an optimisation-based refinement stage at inference to prevent self-penetrations. Through extensive quantitative and qualitative evaluations, we show that our framework is able to produce realistic two-hand reconstructions and demonstrate the generalisation of synthetic-trained models to real data, as well as real-time AR/VR applications. View details
    ×