Introduction

Voice control is the most intuitive and efficient way in human-robot interaction. However, the robot with onboard microphones will be difficult to interact with verbally in a noisy environment, e.g. service robot in a crowded airport. This is an example of cocktail party problem[1,2] which is the holy grail in robotic hearing and perception.[3,4] Various techniques have been applied by mimicking the human sensing mechanism, e.g. speaker localization through microphone arrays[5-7] similar to human ears[8], voice feature extraction with algorithms[9-11] as in human brains[12] and sensor fusion with audio and visual modalities.[13,14] These methods alleviated the problem within a limited scope, but the voice from a person at a distance can be much weaker than noises from other nearby sources because sound pressure typically decreases with squared distance. Comparing microphones, optical detection could remotely capture vibration signals from the source while remaining unaffected by the surrounding acoustic noises. As early as 1880, the optical telecommunication pioneer Alexander Graham Bell invented the first apparatus named photophone[15] that used modulated light to reproduce sound from 231 meters away. Since then, many intriguing applications of remote sound sensing with light were adopted, especially after the invention of laser.[16]  Laser doppler vibrometers (LDV) were developed to remotely probe the surface vibrations with interferences to acquire sounds. [17-19] Due to the interferometric nature, the sensitivity to detect vibrations from smooth surfaces (spectacular reflections) is satisfactory. However, the detection can be much more complicated for large scattering surfaces where the returned signal is mixed in phases, and in this case, the detection sensitivity is reduced significantly due to speckles (see Supporting Information for analysis on LDV with rough surfaces). Additionally, the sophisticated setup of LDV makes it too costly and bulky to be massively deployed in consumer robots. On the other hand, direct intensity modulation measurements could be more sensitive and cost-effective than interferometric methods. Simpler and lower laser microphones[20] have also been attempted to accomplish similar functions as LDV by monitoring intensity modulation from specular reflections, however the strict back reflection requires a perpendicular mirror-like surface and is prone to optical misalignment and fluctuations. Recently, Nassi et al. demonstrated a direct passive sound recovery[21] from a photodetector which is telescoped at a bulb that measures the minute changes of its brightness caused by sound-excited bulb surface vibrations. However, this method involves a necessary illuminating bulb next to the speaker. None of the existing optical techniques are suitable for voice commanding a robot in a cocktail party environment. The REAL system we proposed has better performance on large scattering vocal surfaces, simpler and more affordable construction and greater adaptability. We will illustrate the principle and construction of REAL and demonstrate signals of REAL operating both on the speaker’s facial masks and on their throats respectively. Furthermore, the REAL signal could be transcribed through a memory-enabled neural network to enhanced voice contents in a noisy cocktail party environment. To our best knowledge, REAL is the first optical channel solution with the potential to solve the cocktail party problem.