Sheekar Banerjee - Authorea

Robot Vision is the technique of enabling robots to process visual data from the environment by utilizing a combination of camera hardware and computer algorithms. Advanced deep neural networks have significantly played a vital role in indulging robots to make more sense out of complex visual data at different circumstances, especially in object detection and continuous tracking. In this research, we initiated a unique and cutting-edge backbone neural network for the conventional YOLO algorithm which we named as SBHK-Net. The network boosted up the performance of the existing YOLO algorithm drastically which manifests a strong potential of improving tracking and recognition accuracies of other conventional algorithms in the robot vision industry as well. It has the greatest accuracy 59.2% AP among all known real-time object detectors with 30 FPS or above on GPU RTX3060, and it outperforms all other known object detectors in the range of 5 FPS to 160 FPS. We used YOLOv7 as our reference point for the core research. The transformer-based detector SWINL Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) and the convolutional detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) are both outperformed by the SBHK-Net core object detector (56 FPS RTX3060, 56.4% AP) in terms of speed and accuracy, respectively. In terms of speed and accuracy, it surpasses a number of other object detectors, including DINO-5scale-R50, ViT-Adapter-B, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, and YOLOR and YoLOX. The source code of this research is available at https://github.com/ac005sheekar/SBHKNet.