Continuous Hearth Rate (HR) monitoring based on photoplethysmography (PPG) sensors is a crucial feature of almost all wrist-worn devices. However, arm movements lead to the creation of Motion Artifacts (MA), affecting the performance of PPG-based HR tracking. This problem is commonly tackled by exploiting the recorded accelerometer data to correlate them with the PPG signal and eventually clean it. Thus, automatic fusion techniques based on Deep Learning (DL) algorithms have been proposed, but they are considered too large and complex to be deployed on wearable devices. The current work presents a novel and lightweight DL architecture, PULSE, comprised of temporal convolutions and feature-level multi-head cross-attention to improve sensor fusion’s effectiveness. Moreover, we propose a relation-based knowledge distillation mechanism to pass PULSE’s knowledge to a student network that utilizes modality-wise convolutions to replace the attention module and mimic the teacher’s performance with 5x fewer parameters. The teacher and student are evaluated on the most extensive available dataset, PPG-DaLiA, with PULSE reducing the mean absolute error by 8.2% compared to the best state-of-the-art model while simultaneously reducing the inference latency by 1.6x. The student model is further compressed using post-training quantization and deployed on two microcontrollers, demonstrating its suitability for real-time execution, having a close-to-state-of-the-art MAE of 4.81 BPM (+0.40 BPM), but a 10.9x lower memory footprint of 37.9 kB, and consuming 45.9x lower energy (0.577 mJ).