Face Matching Using 3D Frontalization
Facial alignment involves finding a set of landmark points on an image with a known semantic meaning. However, this semantic meaning of landmark points is often lost in 2D approaches where landmarks are either moved to visible boundaries or ignored as the pose of the face changes. In order to extract consistent alignment points across large poses, the 3D structure of the face must be considered in the alignment step. However, extracting a 3D structure from a single 2D image usually requires alignment in the first place. We present our novel approach to simultaneously extract the 3D shape of the face and the semantically consistent 2D alignment through a 3D Spatial Transformer Network (3DSTN) to model both the camera projection matrix and the warping parameters of a 3D model. By utilizing a generic 3D model and a Thin Plate Spline (TPS) warping function, we are able to generate subject specific 3D shapes without the need for a large 3D shape basis. In addition, our proposed network can be trained in an end-to-end framework on entirely synthetic data from the 300W-LP dataset. Unlike other 3D methods, our approach only requires one pass through the network resulting in a faster than real-time alignment. Evaluations of our model on the Annotated Facial Landmarks in the Wild (AFLW) and AFLW2000-3D datasets show our method achieves state-of-the-art performance over other 3D approaches to alignment.
Following the same principles used in the design of the original Spatial Transformer Networks, we are able to design a network capable of embedding a 3D understanding of the face directly into the alignment step. Unlike 3DMM approaches, the use of TPS warps allows for any face shape to be modelled and not just combinations of the training data. The network estimates both the TPS warping parameters and the camera parameters from the input image which are then used with a generic 3D model to generate a subject specific model and to find the 2D projection of the model on the image. A final regression step is done to move the landmarks of interest to their final position.
The alignment was evaluated on the AFLW and AFLW2000-3D datasets for both 2D and 3D alignment accuracy using both the AlexNet and VGG-16 architectures as the "Shared Feature Extraction Network" component. Our results show this method is able to achieve very high accuracy at very fast speeds.