In this paper, we address the challenge of enabling accurate and robust perception in marine autonomous systems for unmanned maritime operations. Our approach integrates data from multiple sensors, including cameras and radars, to overcome the limitations of traditional sensor fusion methods. We propose a novel cross-attention transformer-based multi-modal sensor fusion technique, specifically tailored for marine navigation. This method not only leverages deep learning to fuse complex data modalities effectively but also reconstructs a comprehensive Bird-eye-view of the environment using multi-view RGB and LWIR images. Our experimental results demonstrate the method’s effectiveness in various challenging scenarios, contributing significantly to the development of more advanced and reliable marine autonomous systems. This approach utilizes multi-modal data, integrates the temporal fusion domain, and remains robust against sensor-calibration errors, marking a notable advancement in autonomous maritime technology.