Scientific Publications
Human Digital Double Study (ACM Siggraph MIG 2025)
Authors: Siyi Liu, Kazi Injamamul Haque and Zerrin Yumak
Abstract: Creating human digital doubles is becoming easier and much more accessible to everyone using consumer grade devices. In this work, we investigate how avatar style (realistic vs cartoon) and avatar familiarity (self, acquaintance, unknown person) affect self/other-identification, perceived realism, affinity and social presence with a controlled offline experiment. We created two styles of avatars (realistic-looking MetaHumans and cartoon-looking ReadyPlayerMe avatars) and facial animations stimuli for them using performance capture. Questionnaire responses demonstrate that higher appearance realism leads to a higher level of identification, perceived realism and social presence. However, avatars with familiar faces, especially those with high appearance realism, lead to a lower level of identification, perceived realism, and affinity. Although participants identified their digital doubles as their own, they consistently did not like their avatars, especially of realistic appearance. But they were less critical and more forgiving about their acquaintance's or an unknown person's digital double.
Benchmark Study (CGF Journal, Eurographics 2025)
Authors: Kazi Injamamul Haque, Alkiviadis Pavlou and Zerrin Yumak
Abstract: Recent advancements in the field of audio-driven 3D facial animation have accelerated rapidly, with numerous papers being published in a short span of time. This surge in research has garnered significant attention from both academia and industry with its potential applications on digital humans. Various approaches, both deterministic and non-deterministic, have been explored based on foundational advancements in deep learning algorithms. However, there remains no consensus among researchers on standardized methods for evaluating these techniques. Additionally, rather than converging on a common set of datasets and objective metrics suited for specific methods, recent works exhibit considerable variation in experimental setups. This inconsistency complicates the research landscape, making it difficult to establish a streamlined evaluation process and rendering many cross-paper comparisons challenging. Moreover, the common practice of A/B testing in perceptual studies focus only on two common metrics and not sufficient for non-deterministic and emotion-enabled approaches. The lack of correlations between subjective and objective metrics points out that there is a need for critical analysis in this space. In this study, we address these issues by benchmarking state-of-the-art deterministic and non-deterministic models, utilizing a consistent experimental setup across a carefully curated set of objective metrics and datasets. We also conduct a perceptual user study to assess whether subjective perceptual metrics align with the objective metrics. Our findings indicate that model rankings do not necessarily generalize across datasets, and subjective metric ratings are not always consistent with their corresponding objective metrics. The supplementary video, edited code scripts for training on different datasets and documentation related to this benchmark study are made publicly available.
ProbTalk3D-X (Computers & Graphics Journal 2025)
Authors: Kazi Injamamul Haque, Sichun Wu and Zerrin Yumak
Abstract: Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, the majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we present ProbTalk3D-X by extending a prior work ProbTalk3D- a two staged VQ-VAE based non-deterministic model, by additionally incorporating prosody features for improved facial accuracy using an emotionally rich facial animation dataset, 3DMEAD. Further, we present a comprehensive comparison of non-deterministic emotion controllable models (including new extended experimental models) leveraging VQ-VAE, VAE and diffusion techniques. We provide an extensive comparative analysis of the experimental models against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. Our evaluation demonstrates that ProbTalk3D-X and original ProbTalk3D achieve superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for visual quality judgment. The entire codebase including the extended models is publicly available.
ProbTalk3D (ACM SIGGRAPH MIG 2024)
Authors: Sichun Wu, Kazi Injamamul Haque and Zerrin Yumak
Abstract: Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available.
FaceXHuBERT (ACM SIGCHI ICMI 2023)
Authors: Kazi Injamamul Haque and Zerrin Yumak
Abstract: This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code publicly and recommend watching the video.
FaceDiffuser (ACM SIGGRAPH MIG 2023)
Authors: Stefan Stan, Kazi Injamamul Haque and Zerrin Yumak
Abstract: Speech-driven 3D facial animation synthesis has been a challenging task both in industry and research. Recent methods mostly focus on deterministic deep learning methods meaning that given a speech input, the output is always the same. However, in reality, the non-verbal facial cues that residethroughout the face are non-deterministic in nature. In addition, majority of the approaches focus on 3D vertex based datasets and methods that are compatible with existing facial animation pipelines with rigged characters is scarce. To eliminate these issues, we present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations that is trained with both 3D vertex and blendshape based datasets. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. To the best of our knowledge, we are the first to employ the diffusion method for the task of speech-driven 3D facial animation synthesis. We have run extensive objective and subjective analyses and show that our approach achieves better or comparable results in comparison to the state-of-the-art methods. We also introduce a new in-house dataset that is based on a blendshape based rigged character. We recommend watching the accompanying supplementary video.
Doctoral Consortium (SIGGRAPH Asia 2023)
Authors: Kazi Injamamul Haque
Abstract: This doctoral research focuses generating expressive 3D facial animation for digital humans by studying and employing data-driven techniques. Faces are the first point of interest during human interaction, and it is not any different for interacting with digital humans. Even minor inconsistencies in facial animation can disrupt user immersion. Traditional animation workflows prove realistic but time-consuming and labor-intensive that cannot meet the ever-increasing demand for 3D contents in recent years. Moreover, recent data-driven approaches focus on speech-driven lip synchrony, leaving out facial expressiveness that resides throughout the face. To address the emerging demand and reduce production efforts, we explore data-driven deep learning techniques for generating controllable, emotionally expressive facial animation. We evaluate the proposed models against state-of-the-art methods and ground-truth, quantitatively, qualitatively, and perceptually. We also emphasize the need for non-deterministic approaches in addition to deterministic methods in order to ensure natural randomness in the non-verbal cues of facial animation.
Appearance and Animation Realism Study (ACM SIGAI IVA 2023)
Authors: Nabila Amadou, Kazi Injamamul Haque and Zerrin Yumak
Abstract: 3D Virtual Human technology is growing with several potential applications in health, education, business and telecommunications. Investigating the perception of these virtual humans can help guide to develop better and more effective applications. Recent developments show that the appearance of the virtual humans reached to a very realistic level. However, there is not yet adequate analysis on the perception of appearance and animation realism for emotionally expressive virtual humans. In this paper, we designed a user experiment and analyzed the effect of a realistic virtual human's appearance realism and animation realism in varying emotion conditions. We found that higher appearance realism and higher animation realism leads to higher social presence and higher attractiveness ratings. We also found significant effects of animation realism on perceived realism and emotion intensity levels. Our study sheds light into how appearance and animation realism effects the perception of highly realistic virtual humans in emotionally expressive scenarios and points out to future directions.
Musicality (GALA 2019)
Authors: Nouri Khalass, Georgia Zarnomitrou, Kazi Injamamul Haque, Salim Salmi, Simon Maulini, Tanja Linkermann, Nestor Z. Salamon, J. Timothy Balint and Rafael Bidarra
Abstract: Musicality is the concept that refers to a person’s ability to perceive and reproduce music. Due to its complexity, it can be best defined by different aspects of music like pitch, harmony, etc. Scientists believe that musicality is not an inherent trait possessed only by musicians but something anyone can nurture and train in themselves. In this paper we present a new game, named Musicality, that aims at measuring and improving the musicality of any person with some interest in music. Our application offers users a fun, quick, interactive way to accomplish this goal at their own pace. Specifically, our game focuses on three of the most basic aspects of musicality: instrument recognition, tempo and tone. For each aspect we created different mini-games in order to make training a varied and attractive activity.