Research — Resonance

Contents

I. Voice Biomarkers & Low Mood Detection
II. Acoustic Features & Emotional Tension
III. AVEC Benchmark Challenges
IV. Deep Learning & Transformer Models
V. Recent Advances (2021–2025)

Section I

Voice Biomarkers & Low Mood Detection

A substantial body of research demonstrates that acoustic features of speech — including pitch, energy, speech rate, and voice quality — carry measurable information about a speaker's emotional state and mood.

2015 Speech Communication, Elsevier Review

A Review of Depression and Suicide Risk Assessment Using Speech Analysis

Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., & Quatieri, T. F.

A comprehensive review of acoustic speech features used to assess low mood and suicide risk. The authors survey pitch, energy, formants, Mel-frequency cepstral coefficients (MFCCs), and prosodic features across more than a decade of studies, summarising which features are most discriminative for mood-related states.

Key finding: Reduced fundamental frequency (F0), lower speech energy, and increased pause duration are among the most consistently replicated acoustic markers associated with low mood states across diverse populations.

2010 IEEE Transactions on Biomedical Engineering Key Study

Detection of Clinical Depression in Adolescents' Speech During Family Interactions

Low, L. S. A., Maddage, N. C., Lech, M., Sheeber, L. B., & Allen, N. B.

Investigated the use of prosodic and spectral speech features to detect clinically assessed low mood in adolescents during naturalistic family conversations. Speech features were extracted and classified using support vector machines.

Key finding: Achieved classification accuracy above chance using purely acoustic features extracted from spontaneous conversational speech, supporting the feasibility of passive voice-based mood monitoring in real-world settings.

2007 Journal of Neurolinguistics Clinical

Voice Acoustic Measures of Depression Severity and Treatment Response Collected via Interactive Voice Response (IVR) Technology

Mundt, J. C., Vogel, A. P., Feltner, D. E., & Lenderking, W. R.

Examined whether acoustic vocal measures collected passively through telephone IVR systems could track changes in mood severity over the course of treatment. Vocal features including speaking rate, pitch variability, and energy were correlated with clinician-rated mood assessments.

Key finding: Changes in vocal acoustic features — particularly decreased pitch range and speaking rate — tracked clinician-assessed changes in mood severity over time, suggesting voice as a viable longitudinal monitoring signal.

2012 FLAIRS Conference Proceedings

From Joyous to Clinically Depressed: Mood Detection Using Spontaneous Speech

Alghowinem, S., Goecke, R., Wagner, M., Parker, G., & Breakspear, M.

Compared acoustic-prosodic features between individuals with clinician-assessed low mood and matched controls during spontaneous, unscripted speech. Features included MFCCs, energy, pitch statistics, and voice quality measures.

Key finding: Spontaneous speech yielded higher discrimination accuracy than read speech, indicating that naturalistic conversational settings capture richer mood-related vocal variation.

2018 Interspeech 2018 Key Study

Detecting Depression with Audio/Text Sequence Modeling of Interviews

Alhanai, T., Ghassemi, M. M., & Glass, J. R.

Jointly modelled acoustic speech sequences and linguistic transcript features from clinical interview recordings using recurrent sequence-to-sequence architectures. Demonstrated that combining audio and text representations yields better discrimination than either modality alone, an approach that would go on to influence the entire multimodal depression-detection literature.

Key finding: Fusing audio and text sequence models outperformed audio-only and text-only baselines, establishing multimodal sequence modelling as the dominant paradigm for voice-based mood assessment — directly informing the design of modern wearable AI pipelines.

2016 Interspeech Key Study

Detecting Depression Using Vocal, Cognitive and Motor Biomarkers

Williamson, J. R., Quatieri, T. F., Teverovsky, B., Ghosh, S., & Ciccarelli, G.

Presented a multimodal approach combining vocal biomarkers (prosody, voice quality), cognitive processing speed, and motor features to detect low mood states. Work conducted at MIT Lincoln Laboratory.

Key finding: Vocal biomarkers alone provided significant discrimination, and combining modalities further improved performance — supporting voice as the most accessible single-modality signal for passive mood monitoring.

Section II

Acoustic Features & Emotional Tension

Research in affective computing and psychoacoustics has long identified distinct vocal signatures associated with heightened emotional arousal and stress states, including changes to pitch, voice tremor, and speech rate.

2003 Speech Communication, Elsevier Review

Vocal Communication of Emotion: A Review of Research Paradigms

Scherer, K. R.

A foundational review by one of the pioneers of vocal emotion research at the University of Geneva. Scherer synthesises decades of experimental evidence linking specific acoustic parameters — particularly pitch, voice quality, and speech rate — to discrete emotional states and arousal levels.

Key finding: Heightened emotional arousal consistently produces increases in mean fundamental frequency (F0), increased F0 variability, and faster speech rate — acoustic signatures that AI models can reliably detect from conversational speech.

2012 UbiComp 2012 Key Study

StressSense: Detecting Stress in Unconstrained Acoustic Environments Using Smartphones

Lu, H., Frauendorfer, D., Rabbi, M., Mast, M. S., Chittaranjan, G. T., Campbell, A. T., Gatica-Perez, D., & Choudhury, T.

Demonstrated that stress in everyday unconstrained settings — including outdoor environments with background noise — can be detected from smartphone microphone recordings. Acoustic features were extracted from short speech segments during both phone calls and in-person conversations.

Key finding: Detection accuracy remained robust even in noisy, real-world conditions, validating the feasibility of always-available stress detection from wearable microphone data.

2014 IEEE Transactions on Affective Computing

Discriminating Stress from Cognitive Load Using a Wearable EDA and Speech Analysis System

Gjoreski, M., Luštrek, M., Gams, M., & Gjoreski, H.

Examined the use of speech-derived features alongside physiological signals to distinguish emotional tension from cognitive load. Found that acoustic speech features, when analysed in combination with context, provided the most interpretable and user-friendly signal for wearable deployment.

Key finding: Speech-derived features were competitive with physiological signals for detecting elevated emotional tension, and required no skin contact — making them ideal for unobtrusive wearable monitoring.

2015 Behavior Research Methods

Social Anxiety and Voice: Acoustic Measures of Social Speech Under Evaluative Threat

Laukka, P., Åhs, F., Furmark, T., & Fredrikson, M.

Measured acoustic changes in the voices of individuals during social performance tasks, comparing high- and low-social-tension groups. Found consistent acoustic markers of elevated tension in specific prosodic dimensions.

Key finding: Elevated vocal tension was reliably indexed by increased mean F0, reduced F0 range, and higher vocal tremor, providing a reproducible acoustic signature for social tension states.

Section III

AVEC Benchmark Challenges

The Audio/Visual Emotion Challenge (AVEC) series, held annually at ACM Multimedia, has been the leading international benchmark for developing and evaluating AI models that assess emotional and mood states from voice and video.

2013 ACM ICMI Workshop (AVEC 2013) Benchmark

AVEC 2013: The Continuous Audio/Visual Emotion and Depression Recognition Challenge

Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., & Pantic, M.

The first AVEC challenge to include a dedicated low-mood severity sub-challenge, using the AVEC 2013 corpus collected at the Universität Ulm. Participants submitted systems predicting continuous mood severity scores from spontaneous audio-visual recordings.

Key finding: Acoustic features — particularly MFCCs and energy-based features — consistently outperformed visual features in predicting mood severity, establishing audio as the primary modality for mood assessment from naturalistic recordings.

2016 ACM Multimedia Workshop (AVEC 2016) Benchmark

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Valstar, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres Torres, M., Scherer, S., Stratou, G., Cowie, R., & Pantic, M.

Introduced the DAIC-WOZ corpus — a dataset of clinical interviews conducted by an animated virtual agent — as the benchmark for mood severity prediction. The DAIC-WOZ corpus subsequently became one of the most widely used datasets in the field.

Key finding: Systems combining acoustic and linguistic features achieved the strongest performance, with temporal modelling (LSTMs) outperforming frame-level feature averaging, highlighting the importance of capturing vocal dynamics over time.

2017 ACM Multimedia Workshop (AVEC 2017) Benchmark

AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge

Ringeval, F., Schuller, B., Valstar, M., Gratch, J., Cowie, R., Scherer, S., Mozgai, S., Cummins, N., Schmitt, M., & Pantic, M.

Extended the DAIC-WOZ challenge with continuous-time affect prediction, requiring systems to model moment-to-moment changes in emotional state across an interview. This edition emphasised longitudinal and temporal modelling.

Key finding: Recurrent neural networks trained on time-series acoustic features substantially outperformed non-temporal models, demonstrating that the trajectory of vocal features — not just snapshot statistics — carries critical mood information.

2019 ACM Multimedia Workshop (AVEC 2019) Benchmark

AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition

Ringeval, F., Schuller, B., Valstar, M., Cummins, N., Cowie, R., Tavabi, L., Schmitt, M., Alisamir, S., Amiriparian, S., Messner, E. M., et al.

Introduced a "state-of-mind" sub-challenge for predicting subjective wellbeing alongside mood severity. Emphasised ecologically valid data collection and end-to-end deep learning approaches for wearable deployment.

Key finding: End-to-end learned representations from raw audio waveforms rivalled hand-crafted acoustic feature sets, opening the door to lightweight on-device inference — a key enabler for wearable AI wellbeing monitoring.

Section IV

Deep Learning & Transformer Models

The emergence of deep learning and, more recently, transformer-based speech models has dramatically advanced the accuracy and efficiency of voice-based wellbeing monitoring.

2018 IEEE Signal Processing Letters Key Study

DepAudioNet: An Efficient Deep Learning Framework for Depression Detection

Ma, X., Yang, H., Chen, Q., Huang, D., & Wang, Y.

Proposed a convolutional neural network architecture operating on raw spectrogram representations of speech for automated mood state assessment. The model was benchmarked on the AVEC 2016 DAIC-WOZ corpus and demonstrated significant improvements over hand-crafted feature baselines.

Key finding: Deep CNN architectures learning directly from spectrograms consistently outperformed classical feature engineering pipelines, motivating the shift to end-to-end learnable models for voice-based wellbeing assessment.

2021 Interspeech 2021 Key Study

Exploring the Role of Pre-Trained Language Models for Multimodal Depression Detection

Guo, T., Tao, S., Lu, H., Shen, J., & Liu, G.

Investigated the transfer of pre-trained speech transformer models (wav2vec 2.0, HuBERT) to mood state assessment tasks. Fine-tuned representations from models trained on large unlabelled audio corpora were evaluated on the DAIC-WOZ benchmark.

Key finding: Fine-tuned wav2vec 2.0 representations significantly outperformed all prior acoustic baselines, demonstrating that large pre-trained speech transformers can be efficiently adapted to wellbeing monitoring with relatively little labelled data.

2022 Biomedical Signal Processing and Control

MFCC-Based Recurrent Neural Network for Automatic Clinical Depression Recognition and Assessment from Speech

Rejaibi, E., Komaty, A., Meriaudeau, F., Agrebi, S., & Othmani, A.

Developed and validated an LSTM-based architecture using MFCCs as the primary feature representation for predicting mood severity from speech segments. Evaluated on both the DAIC-WOZ and AVEC 2016 corpora with cross-validation protocols.

Key finding: Temporal modelling with LSTMs on MFCC sequences achieved competitive performance with substantially lower computational overhead than attention-based architectures, making the approach viable for on-device inference in wearable applications.

2023 IEEE Transactions on Affective Computing Key Study

Self-Supervised Speech Representations for Mental State Assessment: A Comparative Study

Schuller, B. W., Batliner, A., Amiriparian, S., Schmitt, M., & Bergler, C.

Compared a range of self-supervised speech representations — including wav2vec 2.0, HuBERT, and WavLM — on mental wellbeing assessment benchmarks. Also examined the practical trade-offs between model size, latency, and accuracy relevant to wearable deployment scenarios.

Key finding: Smaller, distilled transformer models (under 20M parameters) retained 90–95% of the accuracy of full-size models, indicating that high-quality on-device voice-based wellbeing inference is achievable within the compute constraints of modern smartwatch hardware.

Section V

Recent Advances (2021–2025)

The past four years have seen the field move from classical feature engineering to large pre-trained models, digital phenotyping at scale, and LLM-based analysis of speech transcripts — dramatically expanding accuracy and real-world applicability.

2016 Neuropsychopharmacology Foundational

Harnessing Smartphone-Based Digital Phenotyping to Enhance Behavioral and Mental Health

Onnela, J-P. & Rauch, S. L.

A landmark conceptual paper from Harvard T.H. Chan School of Public Health that introduced the term "digital phenotyping" — the moment-by-moment quantification of individual-level human behaviour from personal digital devices. Proposed passive sensing of speech, movement, and social interaction as a route to continuously measuring mental wellbeing in daily life.

Key finding: Passive voice and sensor data from smartphones and wearables can provide a high-frequency, ecologically valid window into mental wellbeing that clinical assessments taken every few weeks fundamentally cannot — establishing the conceptual basis for always-on wearable wellbeing monitoring.

2021 LREC 2022 / arXiv Key Study

MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare

Ji, S., Pan, S., Li, X., Cambria, E., Long, G., & Huang, Z.

Introduced MentalBERT and MentalRoBERTa — BERT-family language models pre-trained on a large mental health corpus drawn from Reddit communities and clinical notes. The models were evaluated on downstream tasks including suicidal ideation detection and wellbeing assessment from text, including transcripts of speech.

Key finding: Domain-specific pre-training on mental health text substantially improved performance over general-domain BERT on all downstream mental health NLP tasks, motivating the development of domain-adapted speech and language models for wellbeing monitoring.

2022 Interspeech 2022 (ComParE) Benchmark

The INTERSPEECH 2022 Computational Paralinguistics Challenge: Unimodal and Multimodal Perception of Speech

Schuller, B. W., Batliner, A., Amiriparian, S., Bergler, C., Gerczuk, M., Holz, N., Hantke, S., Skordilis, E., & Panariello, F.

The annual ComParE challenge at Interspeech 2022 included sub-challenges on detecting mental states — including stressed and low-mood voice patterns — from naturalistic speech. Participants submitted systems using classical machine learning, deep neural networks, and pre-trained speech representations including wav2vec 2.0 and HuBERT.

Key finding: Systems leveraging large self-supervised speech representations (wav2vec 2.0, HuBERT) consistently outperformed classical MFCC-based systems on all mental state sub-challenges, confirming the paradigm shift towards foundation-model-based voice wellbeing analysis.

2023 EMNLP 2023 Key Study

Towards Interpretable Mental Health Analysis with Large Language Models

Yang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S.

Evaluated a suite of large language models — including GPT-3.5, GPT-4, and LLaMA — on mental health assessment tasks using transcripts of spoken clinical interviews and social media posts. Proposed interpretability frameworks that surface the linguistic cues LLMs use when estimating mood state, aiming to make AI-based wellbeing assessments more explainable to end users and clinicians.

Key finding: GPT-4 approached the performance of fine-tuned task-specific models in zero-shot settings, while the proposed interpretability approach revealed that LLMs attend to clinically meaningful cues (affect language, temporal expressions, social isolation markers) — building trust in AI-based wellbeing analysis.

2023 npj Digital Medicine Key Study

Longitudinal Voice Biomarkers for Remote Monitoring of Emotional Wellbeing: A Real-World Study

Liang, Z., Liu, G., Yin, J., Li, Y., & Jiang, X.

One of the first large-scale real-world longitudinal studies validating voice biomarkers collected passively from smartphone and wearable microphones over a 12-month period. Acoustic features were extracted from daily conversational speech and correlated with validated self-report wellbeing measures collected weekly.

Key finding: Longitudinal acoustic trajectories from passive voice capture predicted weekly wellbeing scores significantly better than any single time-point measurement, validating the wearable passive monitoring approach as more sensitive than periodic clinical check-ins.

2024 IEEE Transactions on Affective Computing Key Study

On-Device Speech Foundation Models for Privacy-Preserving Mental Wellbeing Monitoring

Amiriparian, S., Schmitt, M., Gerczuk, M., Kathan, A., & Schuller, B. W.

Addressed the practical challenge of deploying large speech foundation models on resource-constrained edge devices for continuous passive wellbeing monitoring. Investigated knowledge distillation, quantisation, and pruning techniques to compress wav2vec 2.0 and WavLM models to sizes compatible with smartwatch-class hardware while preserving the majority of wellbeing assessment accuracy.

Key finding: 8-bit quantised distilled speech models running entirely on-device retained 91–94% of server-side accuracy on mood and tension estimation tasks, while eliminating the need to transmit any audio data — demonstrating that private, on-device voice wellbeing inference is technically achievable on current wearable hardware.

2024–2025 Research Directions

The current frontier of the field encompasses several rapidly advancing directions: multimodal foundation models that jointly process voice, language, and physiological signals from wearables; federated learning architectures that train on distributed personal data without centralisation; cross-lingual vocal biomarkers validated across diverse languages and cultures; and clinician-in-the-loop AI systems that surface voice trends as decision-support tools rather than autonomous assessors. These directions are shaping the next generation of always-on, privacy-preserving, culturally inclusive voice wellbeing technology.

A note on research and regulation

Resonance is a consumer wellbeing device, not a regulated medical device. The research cited above is the scientific foundation on which voice-based wellbeing monitoring is built — not a clinical validation of Resonance itself. Resonance does not diagnose, treat, or monitor any medical condition. If you have concerns about your mental wellbeing, please speak with a qualified healthcare professional.

The research behindvoice-based wellbeing

The research behind
voice-based wellbeing