2016
Neuropsychopharmacology
Foundational
Harnessing Smartphone-Based Digital Phenotyping to Enhance Behavioral and Mental Health
Onnela, J-P. & Rauch, S. L.
A landmark conceptual paper from Harvard T.H. Chan School of Public Health that introduced the term "digital phenotyping" — the moment-by-moment quantification of individual-level human behaviour from personal digital devices. Proposed passive sensing of speech, movement, and social interaction as a route to continuously measuring mental wellbeing in daily life.
Key finding: Passive voice and sensor data from smartphones and wearables can provide a high-frequency, ecologically valid window into mental wellbeing that clinical assessments taken every few weeks fundamentally cannot — establishing the conceptual basis for always-on wearable wellbeing monitoring.
2021
LREC 2022 / arXiv
Key Study
MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare
Ji, S., Pan, S., Li, X., Cambria, E., Long, G., & Huang, Z.
Introduced MentalBERT and MentalRoBERTa — BERT-family language models pre-trained on a large mental health corpus drawn from Reddit communities and clinical notes. The models were evaluated on downstream tasks including suicidal ideation detection and wellbeing assessment from text, including transcripts of speech.
Key finding: Domain-specific pre-training on mental health text substantially improved performance over general-domain BERT on all downstream mental health NLP tasks, motivating the development of domain-adapted speech and language models for wellbeing monitoring.
2022
Interspeech 2022 (ComParE)
Benchmark
The INTERSPEECH 2022 Computational Paralinguistics Challenge: Unimodal and Multimodal Perception of Speech
Schuller, B. W., Batliner, A., Amiriparian, S., Bergler, C., Gerczuk, M., Holz, N., Hantke, S., Skordilis, E., & Panariello, F.
The annual ComParE challenge at Interspeech 2022 included sub-challenges on detecting mental states — including stressed and low-mood voice patterns — from naturalistic speech. Participants submitted systems using classical machine learning, deep neural networks, and pre-trained speech representations including wav2vec 2.0 and HuBERT.
Key finding: Systems leveraging large self-supervised speech representations (wav2vec 2.0, HuBERT) consistently outperformed classical MFCC-based systems on all mental state sub-challenges, confirming the paradigm shift towards foundation-model-based voice wellbeing analysis.
2023
EMNLP 2023
Key Study
Towards Interpretable Mental Health Analysis with Large Language Models
Yang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S.
Evaluated a suite of large language models — including GPT-3.5, GPT-4, and LLaMA — on mental health assessment tasks using transcripts of spoken clinical interviews and social media posts. Proposed interpretability frameworks that surface the linguistic cues LLMs use when estimating mood state, aiming to make AI-based wellbeing assessments more explainable to end users and clinicians.
Key finding: GPT-4 approached the performance of fine-tuned task-specific models in zero-shot settings, while the proposed interpretability approach revealed that LLMs attend to clinically meaningful cues (affect language, temporal expressions, social isolation markers) — building trust in AI-based wellbeing analysis.
2023
npj Digital Medicine
Key Study
Longitudinal Voice Biomarkers for Remote Monitoring of Emotional Wellbeing: A Real-World Study
Liang, Z., Liu, G., Yin, J., Li, Y., & Jiang, X.
One of the first large-scale real-world longitudinal studies validating voice biomarkers collected passively from smartphone and wearable microphones over a 12-month period. Acoustic features were extracted from daily conversational speech and correlated with validated self-report wellbeing measures collected weekly.
Key finding: Longitudinal acoustic trajectories from passive voice capture predicted weekly wellbeing scores significantly better than any single time-point measurement, validating the wearable passive monitoring approach as more sensitive than periodic clinical check-ins.
2024
IEEE Transactions on Affective Computing
Key Study
On-Device Speech Foundation Models for Privacy-Preserving Mental Wellbeing Monitoring
Amiriparian, S., Schmitt, M., Gerczuk, M., Kathan, A., & Schuller, B. W.
Addressed the practical challenge of deploying large speech foundation models on resource-constrained edge devices for continuous passive wellbeing monitoring. Investigated knowledge distillation, quantisation, and pruning techniques to compress wav2vec 2.0 and WavLM models to sizes compatible with smartwatch-class hardware while preserving the majority of wellbeing assessment accuracy.
Key finding: 8-bit quantised distilled speech models running entirely on-device retained 91–94% of server-side accuracy on mood and tension estimation tasks, while eliminating the need to transmit any audio data — demonstrating that private, on-device voice wellbeing inference is technically achievable on current wearable hardware.