08.01.2020
Posted by 

This study validates automated emotion and action unit (AU) coding applying FaceReader 7 to a dataset of standardized facial expressions of six basic emotions (Standardized and Motivated Facial Expressions of Emotion). Percentages of correctly and falsely classified expressions are reported. The validity of coding AUs is provided by correlations between the automated analysis and manual Facial Action Coding System (FACS) scoring for 20 AUs. On average 80% of the emotional facial expressions are correctly classified. The overall validity of coding AUs is moderate with the highest validity indicators for AUs 1, 5, 9, 17 and 27. These results are compared to the performance of FaceReader 6 in previous research, with our results yielding comparable validity coefficients.

Practical implications and limitations of the automated method are discussed. The standardized and motivated facial expression of emotion datasetThe original acquisition of the video material used in this research was approved by Friedrich-Alexander University's legal review department, which, at the time the material was recorded, constituted the equivalent of an Institutional Review Board or Research Ethics Committee for the behavioral sciences that independently evaluated human subject research ethics. Informed consent was obtained from all research participants. Participants were treated in accordance with the ethical principles outlined in the 1964 declaration of Helsinki.Our study was based on the Standardized and Motivated Facial Expression of Emotion (SMoFEE) stimulus set. The dataset contains static (pictures) and dynamic (movies) facial expressions enacted by 80 Caucasian individuals (36 men; M = 22.63 years, SD = 2.43). Each individual enacted in a prototypical fashion patterned after the Japanese and Caucasian Facial Expressions of Emotion (JACFEE) and Neutral Faces (JACNeuF) slide set the emotions happiness, sadness, anger, disgust, fear and surprise as well as two neutral expressions (mouth open and closed). In addition, they also freely enacted each emotion for motivational contexts related to power, achievement and affiliation, as prompted by prerecorded narrative vignettes.

In total, each individual encoded 25 expressions, resulting in 2,000 videos and stills. Because several participants who enacted the emotions indicated on their informed consent form that they do not want their pictures and movie clips to be publicly available through the internet, we cannot provide a link to the stimulus set. However, provided that potential users agree in writing to honor this stipulation, the SMoFEE stimulus set can be obtained upon request from the second or third author.

Noldus face reader

For examples of the picture set, please see the supplementary materials posted on the Open Science Framework. For all pictures, FACS codings of the intensities of AU activations and classifications of the prototypicality of the emotion expressions are included. For more information on the coding and validation of the SMoFEE stimulus set, please see. In the following, we will focus only on the subset of photographs showing standardized emotional expressions, because these represent a common validity standard in research on facial emotions and allow direct comparisons with other studies employing pictures of prototypical emotions –.

ProcedureThe recording of the standardized expressions took 60 minutes on average. To achieve a high level of standardization, participants were asked to take off their jewelry and to move their hair out of the face. A black hairdressing cape was used to cover visible clothing. For the depiction of the basic emotions participants were asked to mimic the respective expression of JACFEE reference pictures as accurately and naturally as possible.

One of the two experimenters present during the recording of the videos was a certified FACS trainer; the other experimenter had been instructed by the first experimenter. Participants were given a mirror to examine and practice their mimic. For the emotion depiction during the video shot, participants were instructed to start with a neutral expression, then perform the emotion expression and to end with a neutral expression again. Each emotion condition was repeated at least twice, with the experimenter first providing performance instructions for the encoding of the emotion, then controlling the first encoding and providing feedback to help participants improve their performance. The subsequent second encoding was recorded.

If an expression still failed to satisfactorily replicate the JACFEE template, participants were asked to repeat encoding the expression until a satisfactory result was achieved. Data processing and FACS codingThe recording sessions used in present research yielded a total data pool of 640 (80 participants x 8 conditions, resulting from 6 emotions, one neutral condition with mouth open and one neutral condition with mouth closed) standardized video sequences. The software Picture Motion Browser by SONY was employed to view the videos. Subsequent cutting, editing, and the creation of still images was carried out using the software Adobe Premiere 3.0. Microsoft Windows Photo Gallery was used to view the stills. Creating coding templatesFor every recorded video sequence, both a static and dynamic coding template of the maximal emotion expression were created for subsequent FACS coding. The first step was selecting the video sequence with the maximal expression intensity.

Next, clips were cut following predefined criteria for the beginning and end of an emotion. These clips thus started with brief depiction of an initially neutral expression and then showed the waxing and waning of the emotional expression itself. In a third step, still pictures were created by identifying the frame depicting the maximal emotion expression. The person editing the videos was trained in FACS coding and determined the maximal expression according to expression intensity and prototypicality according to the FACS manual.

FACS codingAll 640 still images were FACS coded for the intensity of the activated AUs on a six-point scale (0 = none, 1 = trace, 2 = slight, 3 = marked, 4 = severe, 5 = maximum). Two coders certified for FACS each coded half of the pictures. Fifty stills were double-coded by both coders for reliability determination.

Interrater reliability of coding was.83 (agreement index), exceeding the criterion for FACS certification (.70) and indicating good reliability. For the coding of an emotion expression the participant’s neutral expression and the respective dynamic emotion sequence were used for reference. Coding of a single expression took 10–15 min and 80–120 min for all 8 expressions of a participant and thus about 13.3 h (100 min x 80 participants = 800 min) for the entire picture pool.

Video preparationThe SMoFEE video clips of the standardized emotion expressions happiness, sadness, anger, disgust, fear and surprise as well as the neutral expressions with mouth closed were chosen as source material for FR analysis. Prior to the analysis the video clips had to be converted for further processing by FR. For this step the software PlayMemories Home by SONY was used. The target data format was mp4 with 1920 x 1080 resolution and 25 Frames per second. The final dataset thus consisted of 480 emotion sequences (80 participants x 6 emotions) and 80 recordings of the closed-mouth neutral expressions. FR settingsThe basic FR settings were modified for the analysis. Automatic continuous calibration and frame-to-frame smoothing of classification values were deactivated to ensure high-accuracy raw data and also because we made the calibration against a neutral face an explicit, separate feature of our analysis (see above).

The general default face model, which had been trained on a wide variety of images and according to the handbook will work best for most people was selected because it fit SMoFEE’s Caucasian adult participants best. Sample rate was set to every frame. The optional classification of the contempt expression was excluded from analysis as it was not featured in the SMoFEE dataset. AU classification was activated. Data export was set to continuous values to ensure full use of all FR output. Data analysis with calibrationFR’s calibration feature allows to control for confounding effects of a recoded person’s physiognomy or habitual facial expression in the evaluation of dynamic emotional expressions. Thus, if a neutral expression features aspects of an emotion expression, FR’s emotion coding algorithms can be biased when a neutral or when an emotional expression is presented by that person.

To minimize this bias, FR allows to conduct a person-specific calibration based on the features of a neutral expression target person. The calibration is based on the analysis of two seconds of video of a neutral expression, with the algorithm identifying the image with the lowest model error for calibration (for further details, please see ).

Subsequent changes in emotional expression detected by FR in emotion-expression videos then represent the deviation from the neutral-expression calibration template. Note, however, that calibration only influences emotion classification in FR, not AU coding.Calibration per participant was carried out using the respective video of the neutral expression with mouth closed. FR does not allow automatic alignment of a calibration to all videos of the respective participant. Thus, for each participant, we manually set up analysis with the respective calibration. The subsequent batch analysis again took 90 min.

Data analysisThe software SYSTAT 13 was used to run all statistical analyses. Additional significance testing was conducted using SPSS Statistics 24. All source data, a SYSTAT processing and analysis script, an SPSS version of the analysis file, and future updates are available from.To extract FR’s dominant emotion classification, the maximal intensity of each of the six possible emotion categorizations ( happy, sad, angry, disgusted, scared and surprised) was determined for each clip. The emotion categorization with the highest maximal intensity score represents the dominant emotion classification of the video clip. The intensity scores of the categorization neutral were excluded from this analysis, as all videos begin and end with a neutral expression.

As no shots with intended neutral expression were analyzed, the exclusion of the neutral categorization does not impair the examination of the dominant emotion expression.In the FACS evaluation, AU coding was carried out using the still of the maximal emotion expression of the respective video. To approach the FR data in a similar way, the maximum in the video for each AU over all frames of the entire video was extracted. This method followed the assumption that the frames in which the maximal AU activations occur according to FR correspond to a certain extent with the manually selected frame of the maximal emotion expression of the FACS analysis. As both coding methods used the maxima of the AU activations as basis of the evaluation, the congruence between the manual and automatic coding of the intensity of the AU activations could be examined. Only the 20 AUs that are coded by both FR and FACS were included.

As the calibration of data in FR does not influence the coding of AUs, the dataset without prior calibration was used for these calculations.We computed Spearman rank-order correlations as a measure of congruence between the maximum values of the AU activations from FR, assessed on a continuous scale, and the ordinal FACS scale scores of the SMoFEE dataset. Our use of Spearman correlations was also based on the observation that most of the AU variables in both FR and SMoFEE codings were not normally distributed according to either inspection of histograms and/or Shapiro-Wilk tests for normality ( p. Descriptive statistics: Emotion classification without and with calibrationThe dominant emotion classification of a video sequence–that is, the emotion for which FR calculated the highest likelihood—was calculated for the dataset without and with calibration. The rate of congruence between the dominant emotion classification and the intended emotion referring to the performance condition reflects the degree of correct emotion categorization of FR. If a different than the intended emotion is classified as dominant emotion expression by FR, the coding is classified as incorrect. The proportion of correct and incorrect codings by FR depending on the emotion condition is shown in. Face Reader ClassificationIntended ExpressionHappinessSadnessAngerSurpriseFearDisgustHappiness100%0%0%0%0%0%(100%)(0%)(0%)(0%)(0%)(0%)Sadness3.75%75%15%2.50%2.50%1.25%(3.75%)(73.75%)(13.75%)(3.75%)(3.75%)(1.25%)Anger6.25%7.50%83.75%0%0%2.50%(6.25%)(5%)(86.25%)(0%)(0%)(2.50%)Surprise2.50%1.25%1.25%87.50%7.50%0%(2.50%)(0%)(1.25%)(90%)(6.25%)(0%)Fear2.50%3.75%7.50%32.50%51.25%2.50%(3.75%)(1.25%)(6.25%)(33.75%)(52.5%)(2.50%)Disgust17.50%1.25%5%0%0%76.25%(17.50%)(0%)(6.25%)(0%)(0%)(76.25%).

The intended facial expressions are represented horizontally, the FR evaluations vertically. The numbers in brackets show the classification with prior calibration. Gray fields indicate correct classifications.

The total number of videos per emotion condition is N = 80.Without prior calibration FR reached a 79% mean ratio of correct identification. With calibration the classification was correct in 80% of the cases. Both datasets showed varying degrees of correct categorization depending on the emotion conditions. In both cases expressions of the condition happiness were consistently identified correctly. In descending order of the degree of correct classification follow the conditions surprise, anger, disgust, sadness and fear for both analysis options. Although the correct identification rate for fear was substantially above chance (= 16.7%), it was only about 50% in both conditions.

This expression was most frequently falsely coded as surprise. This occurred in both datasets in about one third of the cases. Additionally, in both datasets around one fifth of the disgust expressions were falsely categorized as happiness by FR.

Overall, differences between the coding of emotion expressions without and with calibration were only marginal. Inferential statistics: Congruence between manual and automatic AU codingSpearman correlations between the FR and FACS coding of the 20 relevant AUs were calculated for each emotion condition.

Additional significance tests were only carried out for the congruent correlations in every emotion condition—meaning the coding of the same AU with the two different methods—as other correlations seemed less relevant for the purpose of this research. Shows the correlations between the manual and automatic coding of AUs structured by the emotion performance conditions. Provides the corresponding mean and standard deviation values. Because different AUs are relevant for each expression, we focused on the correlations of these essential AUs.

Noldus

The AU configurations according to the FACS Investigator’s Guide form an appropriate guideline for the relevant AUs in each emotion condition. The essential AUs for the respective expression are thus highlighted in. The correlation matrix shows various missing values due to the fact that either FR, or FACS, or both methods classified these AUs as inactive in all cases. The relevant AUs for the expression of the respective emotion are marked gray. Range for FACS values: 0–5. Range for FR values: 0–1.Finally, presents additional evaluation metrics for assessing the performance of AU measurement in FR. Presence refers to the frequency with which FR and FACS coding detected an AU activation in the 480 emotion sequences.

Recall gives the ratio of FACS-coded AUs detected by FR (i.e., FR/FACS). Precision is a ratio of how often FaceReader is correct when classifying an AU as present (i.e, FACS/FR). F1 summarizes the trade-off between recall and precision via the formula: 2 x (Precision x Recall)/(Precision + Recall). Accuracy gives the percentage of correct classification according to the formula: (correctly classified AU absence + correctly classified AU presence)/number of emotion clips. For more information on these indices, please see. The results presented in indicate that FR detected AUs more frequently in the full emotion clips than FACS detected them in the maximal-expression still images.

Results for the F1 measure, which best reflects the FACS category agreement calculation (see ), suggest that FR measurements of AUs 1, 2, 4, 5, 6, 7, 9, 10, 12, 17, and 25 all exceeded the.70 threshold needed to pass the FACS calibration test and therefore represent sufficiently precise assessments of the activation of these AUs. FR measurements of AUs 24, 26, and 27 performed in the acceptable range (.60 to.70), and FR measurements of AUs 14, 15, 18, 20, 23, and 43 performed poorly (.

HappinessAUs 6 (cheek raiser) and 12 (lip corner puller, zygomaticus major) are responsible for this expression, but a significant positive correlation between the manual and automatic coding could only be confirmed for AU 6—for AU 12 we found no substantial correlation despite good between-measure agreement (F1), perhaps due to a ceiling effect leading to restricted variance in FR and FACS measurements (see ) and hence also restricted covariance between them. Additional significant correlations were observed for AUs 10, 25 and 26. AngerThe activation of AUs 4 (brow lowerer), 5 (upper lid raiser), 7 (lid tightener), 10 (upper lip raiser), 17 (chin raiser), 23 (lip tightener), 24 (lip pressor), 25 (lips part) and 26 (jaw drop) is associated with the facial expression of anger. We observed congruence between the manual and automatic coding of AUs for AUs 4, 5, 7, 10, 17, 24 and 25. Only for AUs 23 and 26 no significant correlations occurred.

These AUs were also characterized by unsatisfactory performance indices (i.e, F1). Further connections appeared for AUs 6, 9, 12 and 15.

SurpriseThis expression is characterized by activations of AUs 1 (inner brow raiser), 2 (outer brow raiser), 5 (upper lid raiser), 25 (lips part), 26 (jaw drop) and 27 (mouth stretch). For five out of the six relevant AUs (1, 2, 5, 25, and 17) we found significant convergence between FACS and FR.

For AU 26 no significant result emerged. Consistent with the lack of convergence in our correlation analyses, AUs 26 and 27 performed poorly overall according to the F1 index of agreement (see ). For AUs 7 and 18 manual and automatic coding converged, too. DisgustThe expression of disgust can be traced to activation of AUs 9 (nose wrinkler), 10 (upper lip raiser), 15 (lip corner depressor), 17 (chin raiser), 25 (lips part) and 26 (jaw drop). We found congruence between the FR and FACS coding for AUs 9, 17, 25 and 26. For the coding of AUs 10 and 15 we observed no significant correlations.

The latter AU was also characterized by poor between-measure agreement overall according to the F1 index. Additional significant positive correlations emerged for AUs 1, 2, 4, 6 and 20. DiscussionAlthough FACS represents a pioneering method of emotion research, the coding system is very time-consuming, and this might partially impede its actual usage. Automated facial coding software largely eliminates this disadvantage by offering the promise of valid emotion classification and AU coding for a fraction of the usual time. The objective of this study was to verify the validity of such automated methods using FR version 7.

The convergence between manual and automated coding of AU intensities and the degree of correct emotion classification were determined. Overall the results of this study support the validity of automated coding methods. Accuracy of emotion codingFR accomplished correct classifications of the intended emotion expressions in 79% and 80% of the cases without and with calibration, respectively. These and all following classification rates need to be compared to a baseline of 16.7% representing chance performance. Happy expressions were consistently coded correctly. For the other emotions FR always identified the intended expression with the highest likelihood; however incorrect classifications occurred, too. Expressions of surprise and anger had the lowest rates of false codings.

In contrast, almost half of the fear expressions were classified incorrectly by the software. Notably, however, even in this case FR did not fare substantially worse than human coders tested across many studies (i.e., 39% to 90%) , possibly due to the high similarity between AU activation patterns of fear and surprise. For all other expressions FR performed consistently better than the average human coder. We conclude from our findings that for the emotions happiness, surprise, anger, disgust and sadness, FR-based categorization can be rated as valid, while the identification rate for fear constitutes a limitation for FR’s capacity that is similar to average human coders’.Another aspect we assessed was whether calibrating the software for individual physiognomy features enhanced emotion identification.

We found only minor differences between results with and without calibration. Except for expressions of sadness and disgust, the performance of the calibrated analysis is marginally better than without calibration. However, the differences between the two analysis options seem negligible and unsystematic. Compared to the effort necessary for implementing the calibration, the resulting improvements seem minuscule and dispensable, particularly when dealing with large sets of videos.Compared to emotion classification results obtained with FR 6 , version 7 performed somewhat poorer in certain conditions. Firstly, these authors’ mean percentage of correct classification was higher (88%) than that what we achieved in our study (80%). Furthermore, the range of correct classifications across emotion conditions was more favorable in their study with 76% to 94%, compared to 51% to 100% in our study.

Only for expressions of happiness and anger did FR7 yield slightly better results than its predecessor. However, the differences between our studies and the one by Lewinski et al may also be due to the different stimulus materials used.

Although both studies used high quality frontal recordings of the face, Lewinski et al’.s analysis was based on stills while this study relied on video sequences. Maybe the coding of videos with dynamic expressions such as the SMoFEE dataset is more demanding for FR and thus more prone to errors than coding prototypical stills.

Taking into account that the SMoFEE sequences are typical but not completely standardized emotion expressions, comparatively lower detection rates than for fully standardized stills can be expected. Nevertheless, the mean correct identification of 80% of such recordings suggest that FR 7 provides an overall valid classification of emotional expressions. Another reason for the difference between our findings and Lewinski et al’.s may be that the data basis on which FR 7 was trained was larger than the one used for FR 6, and the general face model was improved from FR 6 to FR 7. More generally, because different software packages and different versions of a given package use different data sets for training their algorithms, some performance differences between different applications of these packages and versions are to be expected, particularly if they are applied to different and novel sets of pictures. It would therefore be helpful for researchers interested in comparing the performance of different versions of software like FR if manufacturers of such software made all technical information about the training, validation, and classification approach of each version of the software permanently available on their website or an open-science resource such as OSF. Convergent validity of manual and automated AU codingTo evaluate convergence between manual and automated coding of specific AUs, we focused in our correlation analyses for each AU only on those emotions for which it was a key ingredient and therefore should also show sufficient variation (i.e., correlations with grey background in ). In addition, we proceeded based on the assumption that inter-coder correlation coefficients above.40 indicate fair, and values above.60 good agreement.

Taking these considerations into account, across emotional expressions agreement was good for AUs 1 and 2, fair for AUs 9, 17, and 20, mostly fair (i.e., with a minority of coefficients below.40) for AUs 4 and 5, and insufficient for AUs 6, 7, 10, 12, 15, 23, 24, 25, 26, and 27. AUs 14, 18, and 43 were not part of any prototypical emotion expression and interpreting their correlation coefficients may therefore be hampered a lack of variation (i.e., range restriction; see ). Convergence for these latter AUs may be more meaningfully tested in future studies that specifically target expressions involving these AUs.The low convergence coefficients for AUs 12 (Lip Corner Puller) and 26 (Jaw Drop) pose an unexpected result as the features of these AUs are rather distinctive and deficient detection of these activations thus seems unlikely. The latter conclusion is also underscored by the acceptable levels of between-measure agreement according to the F1 index. For AU 12, the result may have been due in part to a ceiling effect for the expression of happiness, limiting the extent to which FACS and FR measurements could covary across their full scales. For AU 26, the difficulty of differentiating between this AU and AUs 25 and 27 might have caused low convergence between manual and automated coding concerning the activated AU.

Taken together, 7 of the 17 AUs with meaningful variation (excluding AUs 14, 18, and 43) offer fair to good convergence according to their correlation coefficients (.40) while the rest falls below this threshold (. Implications for applicationsWhat are some of the implications of our findings for using FR 7 (versus FACS) in future research? At first blush, our results for emotion classification suggests that FR 7 is more efficient at classifying and quantifying the intensity of basic emotional expressions than FACS, particularly when it comes to video, as opposed to stills. Hence, FR 7 could be applied to measuring individuals’ emotional expressions in continuously filmed therapy sessions, laboratory interactions, field settings, and so on. However, how well FR 7 will perform will likely depend on the quality of the video material: how well the target person is visible from the front, as opposed to from an angle or moving around, how well the face is lit, whether only one target person is continually visible or other individuals enter and leave the frame, and many other potential distortions. With regard to these factors, our study used material with optimal quality, and the accuracy we achieved both for FR 7 and for FACS therefore probably represents the upper-bound. Although FR 7 is designed to also detect emotions under non-optimal conditions, there is a gradient of accuracy, with the highest levels obtained with material filmed under optimal conditions like in our study and decreasing levels as filming conditions deteriorate.One possible danger that we were unable to evaluate in our present study but that should be addressed in future work is to what extent FR 7 will be more prone to misclassifying emotional expressions under non-optimal conditions.

This is particularly likely for the fear expression, but to a lesser extent also for almost all other expressions except happiness. If left unchecked, such misclassifications could accumulate over the course of a long video, yielding an increasingly invalid aggregate assessment of the emotional expressions of a given target person.

On the other hand it is also conceivable that if measurement errors occur due to chance and do not reflect a particular bias to misclassify a given emotion systematically, aggregation across many measurements (e.g., 25 frames per second in film material) will yield a particularly accurate picture of a target person’s emotional dynamics. To resolve these issues, we suggest that more work is needed in which trained actors deliberately encode series of emotional expressions in more natural way and settings and under non-optimal filming conditions and FR 7 and FACS codings of such film clips are then compared to the intended, encoded emotions as well as to each other.We expect that obtaining high accuracy will be even more daunting for AU codings from non-optimal video material. As our results show, even a relatively straightforward classification of affect into positive and negative responses based on corrugator and zygomatic activations, as proposed by Cacioppo and colleagues (e.g., , ), can be challenging, as within-emotion convergence between FR and FACS measurements for the corresponding AUs (4, 6, and 12) falls between.08 and.53 (Spearman correlations) even under optimal conditions. However, we cannot rule out that correlation coefficients underestimate the true level of convergence, particularly when the activation of an AU or set of AUs is near a ceiling and correlation coefficients are attenuated due to restricted variances, as in the case of AU 12 in the context of a happy expression. The satisfactory overall level of agreement according to the F1 index for AU 12—and also for AUs 4 and 6 –suggests that convergence assessed via correlation may not tell the whole story under conditions characterized by range restriction.To ensure accurate classification of AU activations it is of course absolutely critical to capture facial features on video under optimal conditions (i.e., well lit portrait shots with little overall body movement captured on high-resolution HD video).

This is something that may be easier to achieve under some circumstances (e.g., therapy sessions, lab tasks) than others (e.g., videos obtained in field settings or with webcams). Nevertheless, the more researchers pay attention to capturing “facial action” of their research participants on video under optimal conditions, the more likely will the codings resulting from FR 7 analysis be accurate and valid.

Facial Expression Recognition

Only then can the greater efficiency of the software approach to measuring facial emotion be exploited to its full extent. LimitationsIn addition to the limitations we already discussed, the sample of this study was composed of middle aged Caucasians only. Thus the reported validity of the software is restricted to such subjects, as features like age and ethnicity can influence the performance of FR.

Face Reader

This study can only be used as an orientation. For other types of sample and video material, the validity of FR 7 codings should be tested in pilot work involving test participants deliberately encoding relevant emotional expressions under typical filming conditions.Another critical aspect is that neutral expressions were left out in the examination of the emotion classification. This was due to the usage of video sequences containing neutral expression elements and the risk of inflationary categorization of expressions as neutral regardless of the emotion expression as stated in the method section. This leaves a lack of validity indicators for the coding of intentionally neutral expressions as neutral as well as the deficient coding of emotion expressions as neutral. For further analysis of video sequences the neutral expression parts should be extracted from intentional emotion expressions to avoid this problem.