Introduction
The ability to detect objects is vital for autonomous systems such as self-navigating vehicles, security surveillance, or traffic monitoring. While object recognition can be routinely achieved by convolutional neural networks (CNNs), several challenges remain. One of the most severe problem is occlusion, which takes place when features of an object are being masked by other bodies. Such a scenario occurs frequently in traffic, with cars or pedestrians constantly intersecting and passing each other. While the human brain is capable of compensating for the invisible parts of the obscured object, computers lack such scene interpretation skills. To understand why handling occlusion is challenging for machines, we briefly review how current object detection works. State of the art computer vision relies on the following basic concept: An object emits or reflects light that is measured by the corresponding pixels of a camera. The pixels’ electric outputs in response to the registered light intensities are commonly known as pixel values. Hence, each object is being perceived by the computer as a distinct stream of pixel values. For object recognition, a computer can be trained to analyse certain key features such as distinct edges or shapes, based on comparing and matching the learned and observed pixel outputs. Such an approach can be very efficiently performed using CNN, currently the most favoured tool\cite{recognition,tracking,acm,network}. However, for CNN to work reliably, a very extensive training dataset is required in order to cover as many different forms that this object can possibly take on\cite{vision,andrew2012,tsuhan2018,database}. This strong dependence on a complete dataset carries along several limitations. Firstly, the larger the dataset, the more memory is needed, adding to increased costs for running object detection. Secondly, and most importantly, extensive datasets necessitate cloud computing, using the Internet as the vast data source instead of smaller and faster memories localized in the computer itself\cite{trackinga,data}. As a result, latency in image and video processing may occur, potentially leading to fatal accidents for fast traveling autonomous vehicles that need to continuously monitor traffic in real time for safe self-navigation. These problems are further amplified when dealing with objects that are partially or completely blocked by other objects, causing changes in detected pixel values. As a result, the computer may be tricked into recognizing a different body. Even though this well-known occlusion problem can be mitigated by using an even more extensive training dataset that include images of the object being subjected to all kinds of possible obstructions, such a brute force approach would consume immense energy and memories as well as computational power, all of which are not desirable for always-on and real-time operations such as computer vision\cite{review,kyun2021}.
To solve this problem, sensors are needed that directly process visual data with minimal access to cloud computing. For this purpose, artificial intelligence (AI) accelerating cameras have been developed. These devices contain graphics processing units (GPU) or vision processing units (VPU) that execute image segmentation and classification for CNN to reduce data analysis by the cloud\cite{ramn2012}. In contrast to these existing AI accelerating cameras, our approach is to integrate video pre-processing directly at the sensor level without the need for energy hungry GPUs or VPUs. Current cameras usually just register light intensities and colour and hence are not able to perform any data analysis by themselves on a hardware level. Our proposed AI accelerating camera should have the following features crucial for efficient detection of “faulty” objects: First, each camera pixel is a single device and does not consist of multiple electronic components or circuitry in order to minimize pixel size for high video resolution and to reduce power consumption. Second, to cut down energy and computational costs, fault tolerant object detection is carried out by the camera itself without the use of any additional computer chips and with minimal runtime of any CNN based software. Current hardware approaches for handling the occlusion problem primarily involve, for instance, stereo vision where multiple cameras operate to cover various angles of the object\cite{cameras,networka}. In this way, there may always exist a perspective in which the occluded body can be detected. Other sensor technologies such as LIDAR or GPS can further render the tracking of occluded objects more robust\cite{reasoning,fusion,traffic}. However, the number of sensors employed increases materials, energy, and computational costs and further creates a heavy load of data to be processed, potentially leading to latency. Therefore, our proposed vision sensor will be a faster and more cost-effective alternative for edge-computed occlusion handling. We will discuss how our sensor deals with various forms of occlusion such as stationary objects crossed by a travelling foreground, or tracking of moving objects blocked by another moving or stationary entity.
Experimental Section/Methods
Device fabrication: Patterned FTO glasses (sheet resistance14 ohm/sq) for sensor fabrication were purchased from Latech. The independence of the TiO2 photoanodes was achieved by laser etching separate pixel islands on the FTO covered glass. First, patterned FTO glasses were annealed at 500 ℃ for cleaning. After cooling down to room temperature, substrates were immersed into a 0.04 M TiCl4 solution and kept in an oven at 70 ℃ for 45 min to deposit a compact TiO2 blocking layer. This step was repeated one more time. Once substrates were cleaned using DI water and EtOH and dried, the 20-pixel and 6 x 4 pixel sensor photoanodes were fabricated by screen printing mesoporous TiO2 layers of 1mm by 1mm squares using 30NR-D TiO2 paste (Greatcell Solar), with subsequent annealing at 500 ℃ for 30 min. After the substrates were cooled down, they were submerged into a 0.1 mM Dyenamo Yellow (DN-FN01 purchased from Dyenamo and used as is) solution of tert-butanol/acetonitrile (1:1 v:v ratio) over night. Next, the substrates were washed with acetonitrile and sandwiched with another FTO glass as the counter electrode using Surlyn. Finally, the electrolyte was injected between photoanode and counter electrode. The corresponding electrolyte compositions are described in the text. Cobalt complexes were purchased from GreatCell Solar and used as is.
Sensor characterization: Open circuit voltage (VOC) measurements were carried out using a National Instruments NI PXIe-1071 chassis and PXIe-4163 24-channel source measurement unit (SMU). Solis-3C (from Thor Labs) served as the LED white light source for illumination intensity vs time measurements. For the occlusion case studies involving the supplemented movies, an EPSON EH-TW3200 projector was utilized as the light source. The movies were projected onto the sensors by an optical lens. The pixels’ VOC outputs were probed simultaneously. For the occlusion of letter ‘E’, the 24-pixel sensor array was used. To minimize scattered light interfering with the measurement, the sensor was exposed to the letter ‘E’ by masking the corresponding pixels with black tape. The letter ‘E’ was projected onto the sensor via a movie until the pixels’ VOC reached saturation (for about 5 sec) and subsequently 4 pixels were occluded for 1 sec. After the occlusion, the letter ‘E’ was projected once more. Same voltage measurement conditions were used for the 24-pixel sensor.
Results and Discussion
1.1. Detection of stationary occluded objects