Implementation of Face Detection and Tracking

Abstract

In this paper, we addressed the problem of facerecognition under mismatched conditions. In the proposed sys-tem, for face representation, we leveraged the state-of-the-artdeep learning models trained on the VGGFace2 dataset. Morespecifically, we used pretrained convolutional neural networkmodels to extract 2048 dimensional feature vectors from faceimages of International Challenge on Biometric Recognition inthe Wild dataset, shortly, ICB-RW 2016. In this challenge, thegallery images were collected under controlled, indoor studiosettings, whereas probe images were acquired from outdoorsurveillance cameras. For classification, we trained a nearestneighbor classifier using correlation as the distance metric.Experiments on the ICB-RW 2016 dataset have shown thatthe employed deep learning models that were trained on theVGGFace2 dataset provides superior performance. Even usinga single model, compared to the ICB-RW 2016 winner system,around 15% absolute increase in Rank-1 correct classificationrate has been achieved. Combining individual models at featurelevel has improved the performance further. The ensemble of fourmodels achieved 91.8% Rank-1, 98.0% Rank-5 identification rate,and 0.997 Area Under the Curve of Cumulative Match Score onthe probe set. The proposed method significantly outperforms theRank-1, Rank-5 identification rates, and Area Under the Curveof Cumulative Match Score of the best approach at the ICB-RW2016 challenge, which were 69.8%, 85.3%, and 0.954, respectively. Implementation of Face Detection and Tracking

INTRODUCTION

Biometric face identification for surveillance purposes hasmany applications, e.g, crime investigation, security systems,in which the features extracted from the face image of atarget is compared to the features from all face images in thegallery set as shown in Figure 1. For an efficient comparison,a compact face representation is desired. Conventionally, theengineered techniques such as Fisher Vectors (FV) were usedto extract these features . However, the advent of deeplearning architectures  and large-scale datasets,such as Labeled Faces in the Wild (LFW) and YouTubeFaces (YTF) , have facilitated the research in the field offace recognition. Although, the recent advances in the field offace recognition have been significant , mostof the works are evaluated on LFW and YTF datasets, whichare collected under matched conditions. In other words, bothtraining and testing sets in LFW  are collected from the web, without motion-blur and in relatively high resolution, or in thecase of YTF dataset contains videos recorded under similarconditions. Nonetheless, comparable results for biometric facerecognition on real-world visual surveillance scenarios havenot been achieved, yet.

Face recognition under matched conditions, where bothtraining and testing images are from the similar domain, as inLFW and YTF , is considered as solved, as the recent ad-vances reported on LFW and YTF demonstrate. FaceNet [10]has been proposed as an end to end deep learning architecturebased on Inception model  followed by L2 normalizationand Triplet Loss. The model were trained on a very large-scale private dataset of 260M images. The proposed modelachieved the record accuracy of 99.63% on LFW and 95.12%on YTF. Sun, et al. used two VGG architecturesincluding Inception modules , and extracted features from25 crops of different parts of each face per network. The ex-tracted features were concatenated and dimension was reducedto 300 using Principal Component Analysis. Afterwards, ajoint Bayesian model is learned for face recognition. Theirproposed method achieved 99.54% accuracy on LFW. TheSphereFace  approach introduced the Angular-Softmaxloss and used ResNet architecture  to learn face embeddingsin training phase. In the test, they applied nearest neighborclassifier with cosine similarity to identify the faces. Theapplied method achieved 99.42% accuracy on LFW and 95.0%on YTF

There are not many publicly available large-scale datasetsto address the face recognition problem for surveillance scenar-ios. To the best of our knowledge, CoxFace , ChokePoint, and SCFace  are the most cited datasets for thesurveillance scenarios in the literature. Despite the fact thatthese datasets are collected under the mismatched conditions,they do not encompass all the characteristic of the realsurveillance scenarios, e.g, occlusion, strong motion-blur, pose,illumination, expressions, and focus. Thus, we examined therobustness of the deep face descriptors, learned with Convo-lutional Neural Networks (CNN), namely ResNet-50 andSENet-50 , on the International Challenge on BiometricRecognition in the Wild dataset (ICB-RW 2016) , whichincludes all of the aforementioned variations present in thereal-world surveillance scenarios. In the challenge paper ,the highest performance was reported as 69.8% Rank-1, and85.3% Rank-5 accuracy on the probe set, with 0.954 AreaUnder the Curve (AUC) of Cumulative Match Score curve(CMC). Our experiments, which made use of ResNet-50 [5]and SENet-50  models trained on VGGFace2 dataset as the feature extractor, markedly improved the previousresults. We consider that our proposed method owes itsachievements to deeper CNN architectures and larger numberof images in the VGGFace2 dataset (3.31 million images) ,in which there are images of each subject in different poses andages, enabling the models to learn a robust face representation.

Related Post