Deep Learning Approaches for Classification of Emotion Recognition based on Facial Expressions

Automated emotion recognition (AEE) plays a crucial role in numerous industries that depend on understanding human emotional responses, such as advertising, technology, and human-robot interaction, particularly within the Information Technology (IT) field. However, current systems can often come up short in comprehensively understanding an individual's emotions, as prior research has mainly focused on assessing facial expressions and categorizing them into seven primary emotions, including neutrality. In this study, we present several Deep Convolutional Neural Network (CNN) models designed specifically for the task of facial emotion recognition, utilizing the FER2013 and RAF datasets. The baseline CNN model is established through a trial-and-error method, and its results are compared with more complex deep learning techniques, including ResNet18, VGGNet16, VGGNet19, and Efficient Net-B0 models. Among these models, the VGGNet19 model achieved the best results with a test accuracy of 71.02% on the FER2013 dataset. In comparison, the ResNet18 model outperformed all other models with an 86.02% test accuracy on the RAF-DB dataset. These results underscore the potential for advancing automated emotion recognition through complex deep-learning techniques.


INTRODUCTION
Artificial intelligence and computer vision researchers have been fascinated by automatic facial expression recognition for the last ten years.They have achieved remarkable results in identifying emotions from facial expressions (e.g., Bartlett, Marian Stewart, et al.,2004).Moreover, they have explored movement and gesture as essential modes of nonverbal communication in human-human and human-computer interactions (Hudlicka, Eva,2003).However, most of the previous studies on automatic emotion recognition have concentrated on the face; only recently some researchers have begun to incorporate gestures to express emotions in human-computer interactions.
With technological advancements, automated emotion recognition has evolved into a pivotal component within industries dependent on human emotional responses.This encompasses fields such as advertising, technology, and human-robot interaction, where it maintains a specific emphasis and holds significant importance, especially within the Information Technology sector.However, the challenges posed by cultural and gender differences make the task of automated emotion recognition more difficult according to (B Allen & Pease,2004;).
Researchers have been exploring innovative approaches to address these challenges and further advance the field of automated emotion recognition.The substantial progress achieved in Face Emotion Recognition using deep learning is a particularly promising development in this ongoing effort.
In this study, our primary objective is to evaluate facial emotion recognition accuracy by employing sophisticated deep-learning models, including ResNet18, VGGNet16, VGGNet19, and Efficient Net-B0.To contrast the performance of these complex models with the baseline CNN model, we utilized the FER2013 and RAF-DB datasets, focusing on facial expressions and categorizing them into seven primary emotions: anger, disgust, fear, happiness, sadness, surprise, and neutrality.The experimental results show that the VGGNet19 model achieved 71.02% accuracy in the FER2013 dataset, while the ResNet18 model achieved 86.02% accuracy in the RAF-DB dataset.These findings emphasize that the development of more advanced automated emotion recognition systems can be achieved by implementing more complex deep learning techniques.The paper's remainder is organized as follows: Section 2 discusses related work.Section 3 describes the methodology of our study, and Section 4 presents and discusses the study's results.Finally, we offer our conclusions and future work.

RELATED WORKS
In recent years, algorithms based on Deep Neural Networks (DNN) have found extensive use in image and video analysis.Convolutional Neural Networks (CNN), including Residual Neural Network (ResNet), VGGNet, and AlexNet have proven to be successful in image classification and feature extraction (Karen Simonyan and Andrew Zisserman,2014;Alex Krizhevsky et al.,2017).
Automatic facial behavior analysis, involving tasks like recognizing basic expressions, estimating continuous emotions, and detecting facial action units, has significantly advanced due to large-scale datasets and deep neural networks.For example, (Kollias et al.,2019) introduced FaceBehaviorNet, a comprehensive multi-task, multi-domain, and multi-label network that outperforms single-task networks (Dimitrios Kollias, Viktoriia Sharmanska, 2019;Stefanos Zafeiriou,2019).They also proposed strategies for coupling tasks during training, highlighting the advantages of this approach.Furthermore, they have released the Aff-Wild2 dataset, making it the first extensive In-The-Wild dataset to provide detailed annotations for each of the three primary behavior tasks (Dimitrios Kollias and Stefanos Zafeiriou,2019).
For the classification of basic facial emotions, some researchers exclusively utilize Deep Convolutional Neural Networks.In contrast, others integrate an additional memory cell that aggregates multiple frames to predict current arousal and valence.
The FER2013 dataset is utilized in the following studies for their research.(Zhang et al., 2015) used a multimerged dataset approach for Facial Expression Recognition (FER).They expanded their dataset by including three additional datasets: AFLW, and Celeb-faces, alongside the FER2013 dataset.These databases contained facial attributes with associated labels.To use the features across these datasets, they introduced a bridging layer that connected the output with the FER2013 dataset.Their method achieved an accuracy rate of 70.6% in facial expression recognition.(Devries et. al.,2014) developed a method for improving facial expression recognition by accurately assessing the location and shape of facial landmarks.Their models include three CNN layers, a Fully Connected layer with Relu activation, and an output layer employing the L2SVM activation method.To enhance their results, they incorporated data augmentation techniques such as mirroring, rotating, zooming, and random photo rearrangement.Their approach attained a facial expression recognition accuracy of 67.21%.
The RAF-DB dataset, which we also utilize, has been employed in several studies.For example, (Li et al.,2017)'s proposed approach that introduces a new Deep Locality-Preserving CNN (DLP-CNN) algorithm designed to maximize inter-class scatters while preserving locality closeness to enhance the discriminative ability of deep features, achieving an accuracy of 74.2% in RAF-DB.(Li, Yong, et al.,2018) introduced an innovative Convolution Neural Network (CNN) with an Attention Mechanism (ACNN) designed to detect occluded parts of the human face and emphasize relevant unclouded regions.ACNN utilized adaptive weights based on importance and unobstructedness in different facial regions, offering two variants, gACNN and pACNN, to consider various regions of interest (ROIs.The gACNN variant achieved the highest accuracy of 85.07% on the RAF database by integrating global and local representations.(Wang et al.,2020) presented a Region Attention Network (RAN) to adaptively assess the significance of facial areas for emotion recognition, particularly in the presence of occlusion and position variations.By aggregating area features and employing a region-biased loss, they achieved a notable accuracy of 86.90% on the RAF-DB dataset, showcasing the effectiveness of their RAN model in facial expression recognition.

METHODOLOGY
This section provides a comprehensive overview of our study design, encompassing details regarding our datasets, the deep learning models under consideration, and the data augmentation techniques employed.

Datasets
In our experimental study, we utilized two distinct datasets, FER2013 and RAF-DB, to train and test models for categorizing facial emotions into anger, disgust, fear, happiness, sadness, surprise, and neutrality.
The FER2013 dataset consists of grayscale face images, each measuring 48x48 pixels.These images are automatically aligned, ensuring that each face is roughly centered and occupies a consistent portion of the frame.The dataset contains 7,178 examples in the test set and 28,709 examples in the training set.The objective is to categorize each face into one of seven emotion classes based on the emotion it expresses.
The RAF-DB dataset contains 29,672 real-world facial photos with annotations for simple and complex expressions.In our study, we specifically utilized the subset of 12,271 images that were labeled with fundamental emotions for the training portion of our research and 3,008 images as test data.
The proposed approach framework is presented in Figure 1.Two open-source datasets are used, FER2013 and RAF-DB, by applying the same pipeline for each data independently.Firstly, the datasets are loaded and split into validation and training sets; the test set is given in each dataset.A robust data loader is employed that can read the dataset and shuffle it when required for training.Different pipelines for augmentation techniques are employed to explore the training datasets and decrease the risk of overfitting.In this section, we explore the augmentation pipeline and provide practical studies of the application of each hyperparameter we have utilized.

Deep Learning Approaches and CNN Architectures
To obtain a robust model that could achieve reasonable accuracy, different deep-learning models are employed: CNN as a base model, ResNet18 (Kaiming He et al., 2016), VGGNet16 (Karen Simonyan and Andrew Zisserman,214), VGGNet19 (Karen Simonyan and Andrew Zisserman,2014), and Efficient Net-B0 (Mingxing Tan and Quoc Le,2019).Also, a baseline model is created from scratch to serve as a reference for future development and to investigate the impact of transfer learning and the utilization of widely used deep learning architectures.
Convolutional Neural Networks are sufficient models for various applications (James Wayman et al., 2005;Shervin Minaee et al., 2019;Shuo Yang et al.,2016).VGGNets (Karen Simonyan and Andrew Zisserman,2014) and Inception models (Christian Szegedy) demonstrated that raising the network depth increases the quality of model architecture and it could learn dramatically.By managing the distribution of each layer input, Batch Normalization (Sergey Ioffe et al.,2017) provided learning stability in deep networks and generated more reasonable optimization surfaces.ResNets (Kaiming He, Xiangyu Zhang,2016) demonstrated that using identity-based skip connections enabled them to train deeper and more robust networks.
The proposed approach framework is presented, as shown in Figure 1.Two open-source datasets are used, FER2013 and RAF-DB.We apply the same pipeline for each data set independently.
In this research, by applying multiple deep learning architectures such as ResNet, EfficientNet, and VGGNet.The VGGNet model outperforms all other models; we illustrate the architecture of the VGGNet model in the following subsection.
VGGNet: The Convolutional Neural Network model architecture known as VGG16 (Karen Simonyan and Andrew Zisserman,2014) in the paper "Very Convolutional Neural Networks for the Large Scale Image Recognition."The VGG16 scored 92.7% on the top 5 criteria in ImageNet competition, a database with much more than fourteen million images separated into 1k classes.This was one of the most common models submitted to the ILSVRC-2014 competition.It improves the AlexNet model by sequentially replacing several 1x1 and 5x5, or 3x3 kernel-sized filters with smaller 3x3 kernel-sized filters in the initial and second CNN layers.The NVIDIA Titan Black GPUs were used for the weeks-long training of VGG16. Figure 2 presents the architecture of VGGNet16.The input is an RGB image with a defined size of 224x224, which serves as the Cov1 layer's input.The image is run through a stack of CNN layers constructed of 3x3 kernel size, which is the most suitable size to capture the concepts of each side and center on the input image.The 1x1 CNN layers may be considered the linear transformation of the input channels.The 3x3 CNN layer's spatial padding ensures that the pixel size is preserved after CNN.The padding is 1 pixel, and the stride is also 1 pixel.Also, spatial pooling uses Max-Pooling layers with a 2x2 kernel, and the stride is set to 2. After the stack of CNN layers, which have varying depths in different designs, there are three Fully Connected (FC) layers; the first two consist of 4096 neurons, while the third consists of 1000 neurons for label classification, and so has 1000 neurons representing the number of classes.The softmax layer is the last layer in the model that outputs a probability range from 0 to 1. Figure 3 shows an overview of the VGGNet19 architecture, which is a convolutional neural network (CNN) used for image classification.VGGNet19 comprises 19 layers, including 16 convolutional layers and three fully connected layers.The convolutional layers are arranged in groups of two or three, with max-pooling layers in between.VGGNet19 has a very deep architecture, allowing it to capture complex image features.The figure highlights the input and output sizes of each layer in the network and the number of filters and kernel sizes used in each convolutional layer.Overall, VGGNet19 is a robust CNN architecture that has been used in a wide range of computer vision tasks.1) Random resize crop: The method of RandomResizedCrop is part of the torchvision.transformsmodule.It crops a random area of an image and resizes it to a given size.This method can accept both PIL Image and Tensor Image.The tensor image is a PyTorch tensor with [C, H, W] shape, where C represents the number of channels.H and W represent the height and width, respectively.The method returns a randomly cropped image.This paper uses scale (0.8 and 1.2), meaning the image will zoom in 80% and zoom out 120% and keep the original size (48x48).2) Adjust brightness: It is a type of image data augmentation where we randomly change an image's brightness, contrast, and saturation.We used (brightness=0.5, contrast=0.5, saturation=0.5)],p=0.5) with 50% probability.
3) Random Affine: The approach accommodates both PIL and Tensor Images.The tensor image is a PyTorch tensor with [C, H, W] shape, where C represents the number of channels and H, W represents the height and width, respectively.This method returns the affine transformed image of the input image.We used (0, translate=(0.2,0.2))], p=0.5) with 50% probability.4) Horizontal Flip: Horizontal flip, also known as image mirroring, is a technique that involves flipping an image horizontally, resulting in a mirrored version of the original image.This transformation swaps the left and right sides of the image, creating a mirror image effect with a 50% probability.5) Random rotation ranges from [-45,45] degrees.6) Tencrop: After all these transforms applied above, this method takes the image and returns a tuple of 10 cropped images.The image can be PIL or Tensor images.We used (40,40) crop dimensions.7) Normalization: Normalizing the images is a good practice when using deep neural networks.Normalizing the images means changing the images to have a mean and standard deviation of 0.0 and 1.0, respectively for each channel.This helps ensure that the input data has a similar scale and distribution, which can improve the model's performance.The torchvision.transforms.Normalize() method can be used to normalize the images in PyTorch.This method takes in each channel's mean and standard deviation values as input and returns a normalized tensor.8) Random Erasing: A torch.Tensor image has its pixels erased in a rectangular region that is chosen randomly.We used 50% probability.This is just an example, but in training, we apply small ranges of these parameters to be compatible with real-life images.It's important to note that this advanced data augmentation pipeline provides the model with a comprehensive learning experience, enabling it to grasp nuanced features and intricacies within the data.By continually refining and optimizing the data augmentation strategies, we enhance the model's adaptability and ability to make accurate predictions across various real-world scenarios.The augmentation results are illustrated in Figures 4 and 5, showcasing the effects of our augmentation pipeline on the FER2013 and RAF-DB datasets.The left images depict the original samples before augmentation, while the corresponding right images show the same examples after applying our augmentation pipeline.

RESULTS AND DISCUSSION
The performances of each model are introduced independently, and we get more intuition about how our models perform on unseen data; in the last section, the Training and Experiments section, the training phase, and the accuracy of the training and validation set are introduced.But in this section, the models are evaluated on the test set.

Base Line Model
We developed a baseline model from scratch to serve as a reference for future development and to investigate the impact of transfer learning on subsequent deep learning models.The confusion matrix of the baseline model is presented in Figure 6.a.We calculated the precision, recall, and F1-score of the baseline model for each label independently in Table 1.This provides insight into how the model performs on unseen data and where the labels have the most errors.It also allows us to gauge the complexity of the problem at hand.It highlights the significance of leveraging transfer learning and employing more intricate models on the test dataset to enhance its ability to memorize it more effectively.

Efficient Net Model
To better understand the performance of the efficient model on the FER2013 and RAF-DB datasets, various metrics were analyzed, including the confusion matrix, normalized confusion matrix, precision, recall, and F1-score for each label.The Normalized confusion matrix is the confusion matrix divided by the number of images per label to calculate the error distribution in percent as in Figure 6.b shows the confusion matrix and normalized confusion matrix of the efficient model on the FER2013 dataset.The Happy and Surprise labels get the best accuracy of 84% accuracy for the happy, and 79% accuracy for the surprise label.The model gets confused about the four labels fear, angry, neutral, and sad labels which have the most error percent.Table 2 shows the classification report of the model.Also, it's given how many images contribute to that result that gives more intuition about how our Efficient Model performs on FER2013.The surprise label gets a 0.78 F1-score even if it doesn't have more images like neutral, and sad images.Figure 7 shows the confusion matrix and normalized confusion matrix of the efficient net model of the RAF-DB test set.
The RAF-DB has a more robust error distribution and that is because the data has an RGB image with a size of 100x100 and that enables the model the capture more information about the data.(Table 3) shows the classification report of the model of the RAF-DB.The model achieves 84.547% accuracy on the test set.
The model can predict happy images accurately with a 0.93 F1 score.And that's because that data has a lot of happy examples that enable the model to predict this class more accurately.Also, the three labels surprise, sad, and neutral git above 0.8 F1 metric score.

ResNet18 Model
To gain a better understanding of how the model performs on the test set, the model was evaluated on the test set.For each dataset, the confusion matrix and normalized confusion matrix were computed.Additionally, a classification report was generated that shows the precision, recall, and F1 score for each label of the 7 basic emotions.The model predicts the happy, and surprise labels with 86%, and 80% accuracy respectively.The ResNet18 model also reduces the error on the disgusting label it achieves 78% accuracy.
The error is high on sad, and fear labels also, the model is confused between these two labels.Table 5 shows more details about how our model performs on the RAF-DB test set, it shows the precision, recall, and F1 score per each label of our seven basic emotions.The model achieved a test accuracy of 86.02%.The model gets a 0.94-F1 score on the happy label, and also it gets above 0.80 F1 scores on 4 labels, surprise, sad, angry, and neutral label.Figure 8 shows the confusion matrix and normalized confusion matrix of the ResNet18 model in the RAF-DB dataset.

VGGNet16 Model
The VGGNet16 model is evaluated on the test set but in this model, we add more customization layers.Firstly, we change the last average pooling layer output size from (7, 7) to (1, 1), meaning that we use global average pooling instead.Also, by changing the top layer to just have 7 output neurons to represent our 7 classes and determines the confusion matrix and normalized confusion matrix.The VGGNet16 model achieves the best result on the FER2013 dataset, it gets a 70.2% test accuracy.VGGNet's original model has 138M parameters and this number is too big compared to other models and the complexity is too high but the customization we add reduces the number of parameters to 33.6M parameters, and this is more reasonable.Figure 9 shows the confusion matrix and the normalized confusion matrix of the model on the FER2013 test set.The VGG16 model reduced the error on all labels with respect to another model we can notice that by comparing the diagonal of the VGGNet confusion matrix with all other model's confusion matrices.Table 6 shows the scores of precision, recall, and F1 metrics for all labels, and how many images per label contribute to calculating these metrics.From the table, we can notice that the F1-score of each label is increased and the overall accuracy on the test set is 70.2%.We also show the confusion matrix and normalized confusion matrix of the VGGNet16 model on the RAF-DB, in Figure 10.The ResNet18 model gets the best result on this dataset, but the VGG16 model gets 85.88% on the test set, and ResNet18 gets 86.02% which is not a big difference in accuracy but the VGGNet model has more parameters and its complexity is too big.Table 7 shows more details about how the VGG16 model performs on each label to get more intuition about how the model performs on unseen data on RAF-DB.

VGGNet19 Model
The VGGNet19 model is evaluated on the test set Firstly, by changing the last average pooling layer output size from (7, 7) to (1, 1), meaning that we use global average pooling instead.Also, the top layer is changed to just have 7 output neurons to represent our 7 classes.This customization leads to reducing the number of parameters that exceed 138M parameters to just 45.2M parameters The VGGNet19 model achieves the best result on the FER2013 dataset, it gets a 71.02% test accuracy, and this is more reasonable.The VGGNet19 model reduced the error on all labels with respect to another model we can notice that by comparing the diagonal of the VGG19 confusion matrix with all other model's confusion matrices.Table 8 shows the precision, recall, and F1 score metrics of all labels, and how many images per label contribute to calculating these metrics.It is noticed that the F1 metric value of each label is increased and the overall accuracy on the test set is 71.02%.Table 9 shows more details about how the VGG19 model performs on each label to get more intuition about how the model performs on unseen data on RAF-DB.

ROBUSTNESS OF OUR APPROACH
In order to confirm the robustness of our approach, we apply our pipeline to two datasets.Furthermore, we avoid depending solely on a single algorithm to solve the problem and instead explore different approaches and models, aiming to develop resilient models that accurately represent our study.To get the model's hyperparameters and training augmentation pipeline, we apply an exhaustive search between different hyperparameters and different augmentation pipelines, to get robust hyperparameters that give us the best result that represents each model independently.So, we ensure the robustness of our approach by tackling all these problems, and by applying our pipeline to two open sources of datasets.By comparing our results with other researchers, we get a better performance model and achieve the best results on those datasets.Tables 12 and 13 compare our best results with other state-of-the-art results on both datasets.The accuracy of VGGNet19 on FER2013 and ResNet18 on RAF-DB outperform all other research models.Some researchers depend on the multiple datasets for the purpose of merging them and constructing one bigger dataset, in our work we depend only on each dataset independently without using any type of merging.The customization layers we employ for each model decrease the number of parameters of the models and the inference time also increases compared to the original models.

CONCLUSION
This study has involved assessing the strengths and weaknesses of each model independently, followed by a comprehensive comparison to offer concise discussions and distinctions among them.Moreover, it has been shown that CNN architectures such as ResNet, EfficientNet, and VGGNet can achieve good results in the emotion recognition problem with different adjustments.These considerations should be taken into account for future enhancements in this study.To gain a better understanding of how the model works when dealing with unobserved data, the model was evaluated on the test set.For each dataset, the confusion matrix and normalized confusion matrix were computed.On the FER2013 dataset, the VGGNet16 model produces the best results, scoring a test accuracy of 70.2%.The 138M parameters in the VGGNet original model make it overly complex and large in comparison to other models; however, the customization we add reduces the number of parameters to 33.6M, which is fairer.The CNN is employed on the FER2013 and studies the residual learning blocks have been shown to increase the performance of deep learning models.
We selected the number of layers and neurons based on our intuition of applying CNN for different datasets.
To determine the best learning rate and scheduler, we used the trial-and-error technique.Our proposed model can be combined with other models to increase the accuracy of face recognition systems that utilize the FER2013 dataset because it is computationally less expensive.Although the FER2013 dataset is a very complicated dataset with a small number of samples in some classes, the number of samples in each class can be increased by the right amount to improve accuracy.To solve the problem, we use transfer learning techniques to retrain VGGNet16, VGGNet19, ResNet18, and EfficientNet-B0 on two open sources of datasets FER2013 and RAF-DB.Using of pre-trained deep CNN architectures proved a great performance on these datasets.The VGGNet19 model performs the best on the FER2013, with a test accuracy of 71.02%.In contrast to other models.In this model, we add more customization layers.Firstly, we change the last average pooling layer output size from (7, 7) to (1, 1), meaning that we use global average pooling instead.Also, we changed the top layer to just have seven output neurons to represent 7 classes.The VGGNet's original model has 138M parameters, making it too complicated.However, we lower the number of parameters to 45.2M with the customization, which is more reasonable.ResNet18 outperforms all other models with 86.02% accuracy on RAF-DB.

Figure 4 .
Figure 4. FER2013 dataset before and after applying our augmentation pipeline.

Figure 5 .
Figure 5. RAF-DB dataset before and after applying our augmentation pipeline.

Figure 6 .
Figure 6.Result for (a) Confusion matrix base line model and (b) Confusion Matrix and Normalized Confusion Matrix EFFICIENTNET-B0 on FER2013

TABLE 1 .
CLASSIFICATION REPORT OF BASELINE MODEL.

TABLE 2 .
THE CLASSIFICATION REPORT OF THE EFFICIENTNET-B0 MODEL ON FER2013 Figure 7. Confusion Matrix and Normalized Confusion Matrix EFFICIENTNET-B0 on RAF-DB 07% on the test set, we can figure out that the F1 score per each label is increased compared to the Efficient Net Model.The last two figures show the confusion matrix and normalized confusion matrix of the ResNet18 model on the RAF-DB test set.We can easily figure out that the model is more robust on this dataset, also it can reach a reasonable accuracy compared to other researchers.
Table 4 gives the classification report of the model on the FER2013 dataset the model gets a total accuracy of 68.

TABLE 5 .
CLASSIFICATION REPORT OF RESNET18 MODEL ON RAF-DB Figure 8. Confusion Matrix and Normalized Confusion Matrix Resnet18 on RAF-DB

TABLE 6 .
CLASSIFICATION REPORT OF THE VGGNET16 MODEL ON THE FER2013 TEST SET

TABLE 7 .
CLASSIFICATION REPORT OF THE VGGNET16 MODEL ON THE RAF-DB Figure 10.Confusion matrix and normalized confusion matrix of the VGGNet16 model on the RAF-DB

TABLE 8 .
CLASSIFICATION REPORT OF THE VGGNET19 MODEL ON THE FER2013 DATASET

Table 10
shows the overall results of all models on the FER2013 dataset, the deep learning model gets the best accuracy compared to the baseline model.We highlight the best model with gray color, in the table below.The VGGNet19 gets the best test accuracy (71.02%) on this dataset.However, the number of model parameters and the complexity of the model should be taken into consideration while deploying the best model.Table11presents the overall results on RAF-DB, based on the FER2013 results we find that the three deep learning models seem to be promising for solving the problem of Emotion Recognition so, we only use these deep learning models on the RAF-DB.The ResNet18 model gets the best accuracy results for this dataset.

TABLE 10 .
ALL MODEL RESULTS ON FER2013 TEST SET

TABLE 13 .
COMPARISON OF THE PROPOSED APPROACH AND ANOTHER STATE-OF-THE-ART RESULTS ON RAF