OOSTraj: Out-of-Sight Trajectory Prediction with Vision-Positioning Denoising

Haichao Zhang1
Yi Xu1
Hongsheng Lu2
Takayuki Shimizu2
Yun Fu1

1Northeastern University
2Toyota Motor North America

In CVPR 2024


Out-of-Sight Trajectory Prediction
A representative illustration of real-world out-of-sight scenarios in autonomous driving. The autonomous vehicle is equipped with a camera (capturing precise visual trajectories, indicated by green dotted arrows) and a mobile signal receiver (capturing noisy sensor trajectories, represented by red dotted arrows) for tracking pedestrians and other vehicles. Pedestrians P1 and P2 are within the camera's field of view, while P3 is entirely out of sight and P4 is obscured by other vehicles. Consequently, P3 and P4 lack captured visual trajectories and are positioned dangerously, potentially crossing into the vehicle's path, posing a risk of collision. The black dotted arrows depict the hypothesized noise-free real trajectories, ideally captured by mobile sensors, contrasting with the actual noisy sensor trajectories (red arrows). The gray area in the figure demarcates the visibility range of the mobile and visual modalities: white indicates no data captured, orange signifies the presence of visual trajectories, and blue represents the availability of mobile trajectories.

Vision-Positioning Model
Overview of the Vision-Positioning Denoising and Predicting Model architecture. This illustration highlights the processing of pedestrian data, where pedestrians P3 and P4 are detectable only by mobile receivers, while P1 and P2 are visible to both camera and mobile receivers. The Camera Parameters Estimator Module utilizes the dual-modality trajectories of in-view pedestrians (like P1 and P2) to analyze the relationship between camera and world coordinates, resulting in a camera matrix embedding. For out-of-sight pedestrians (e.g., P3, P4), their noisy mobile trajectories are first refined by the Mobile Denoising Encoder, producing a denoised signal embedding. This embedding is then merged with the matrix embedding in the Visual Positioning Projection Module, facilitating the mapping of data into camera coordinates, with the application of $\mathcal{L}_Denoise$. Finally, the Out-of-Sight Prediction Decoder leverages the denoised visual signals to predict the trajectories of pedestrians not captured by the camera.


Trajectory prediction is fundamental in autonomous driving, particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data, neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range, physical obstructions, and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns, as they can result in missing essential, non-visible objects. To bridge this gap, we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects, our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments.


        title={OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising},
        author={Haichao Zhang, Yi Xu, Hongsheng Lu, Takayuki Shimizu, and Yun Fu},
        booktitle={In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},