Jai Kashyap - Authorea

Automated lip-reading systems have the potential to greatly improve speech recognition and communication for individuals with speech or hearing disabilities. People with aphasia, aphonia, dysphonia, voice disorders and trouble swallowing have limited speaking ability and can be assisted by lipreading technology. Traditional approaches to lipreading have relied on hand-crafted features and statistical modeling techniques, which have limitations in capturing the complex spatiotemporal dynamics of lip movements. Deep learning approaches have shown promise in addressing these limitations by extracting features from data and have achieved state-of-the-art results in various speech-related tasks. In this paper, a 3D Convolutional Neural Network (3D-CNN) is proposed as an approach to automated lip reading. This system takes a video of a person speaking and processes it through a 3D-convolutional layer to extract spatial-temporal features from the video frames. The system uses deep learning algorithms to learn the mapping between lip-movements and corresponding phonemes, enabling it to recognize spoken words. The approach is evaluated on visual recordings of spoken words, the MIRACL-VC1 dataset. It contains 10 words and multiple instances of each. The proposed model achieves 99.0%training accuracy on the dataset. The testing accuracy achieved is 61.3%, indicating model overfitting and a high “speaker-dependency”. A dataset was self-created using videos of one speaker. The model achieved an 89.0% training accuracy, and an 83.0% testing accuracy on this dataset. Both models are then evaluated on user input video. The proposed approach has applications in speech therapy, speech recognition, and translation for those with speech and voice disabilities. The human participant in this research is the researcher/author.