This paper reports an object tracking algorithm for a moving platform using the dynamic and active-pixel vision sensor (DAVIS). It takes advantage of both the active pixel sensor (APS) frame and dynamic vision sensor (DVS) event outputs from the DAVIS. The tracking is performed in a three step-manner: regions of interest (ROIs) are generated by a cluster-based tracking using the DVS output, likely target locations are detected by using a convolutional neural network (CNN) on the APS output to classify the ROIs as foreground and background, and finally a particle filter infers the target location from the ROIs. Doing convolution only in the ROIs boosts the speed by a factor of 70 compared with full-frame convolutions for the 240×180 frame input from the DAVIS. The tracking accuracy on a predator and prey robot database reaches 90% with a cost of less than 20ms/frame in Matlab on a normal PC without using a GPU. Moving Object Path Prediction and tracking
The DAVIS is a neuromorphic camera that outputs static active pixel sensor (APS) image frames concurrently with dynamic vision sensor (DVS) temporal contrast events . DVS address-events (AEs) asynchronously signal changes of brightness in the scene. The application of Convolutional Neural Networks (CNNs) to tracking have become widespread and there are a large number of tools available for application of this technology. Our aim in this study was to develop an object tracking system that uses DVS events to guide efficient application of CNN technology to DAVIS sensors and to demonstrate benefits from such a system. Object tracking has been studied for many years. Speed versus accuracy are traded off in conventional frame-based visual tracking. Reported CNN based trackers (CNT) are usually either too slow or too expensive for real-time applications. Increasing attention has been on tracking using event-based vision sensors These event-based tracking systems can achieve relatively high accuracy and high speed due to the low-latency and sparse data output from the DVS. However, these DVS trackers usually do not deal with the scenario of tracking a moving object on a moving background, because the distinction between the events generated by object and ego movement is a hard task. Citation proposes an algorithm for separating the two kinds of events, however their algorithm has only been tested with simple simulated environments and realistic scenarios are as yet unproven. This work aimed to develop a tracker that can perform tracking in an ego-motion scenario where both the observer and the object move around a cluttered environment. a sliding window CNN which labels the input image with a likelihood score to perform classification-based detection of the object. To reduce the number of convolutions and thus the computational cost of each frame, we use regions of interest (ROIs) generated by the DVS event output of the DAVIS camera. The detected ROIs are fed to a particle filter which improves the tracking accuracy due to misclassification by the CNN.
This work is based on a robot predator and prey dataset which was recorded at University of Ulster’s Intelligent Systems Research Centre (“Ulster dataset”). The DAVIS sensor was mounted on the top of a Pioneer four-wheeled robotic platform (the predator robot) and followed a second Pioneer robot (the prey robot) . The prey robot, which is manually controlled with a joystick, was mounted with visual targets, similar to QR codes, suitable for tracking using available OpenCV toolboxes. By using OpenCV for tracking the tags, the predator robot could autonomously follow the prey (or go into a search rotation when the prey was lost) while recordings with the DAVIS sensor were made with the on-board computer running jAER [15], the software to process DAVIS data. (No QR tag search algorithm was used by our tracking algorithm)
The Ulster dataset consists of 20 minutes of data with 9k APS frames and 160 million DVS events. From this recording, the ground truth about the position of the prey robot at a time resolution of about 5ms was hand-labelled by capturing the cursor position in jAER while following the prey with the mouse. These locations at the various timestamps, together with the APS frames and DVS events constitute the database on which the following experiments were performed.