Toshiyo Tamura, Waseda University, Japan
This is an open-access article distributed under the terms of the
The integration of artificial intelligence (AI) in voice biomarker analysis presents a transformative opportunity for objective and non-invasive diagnostics in healthcare. However, clinical adoption remains limited due to challenges such as data scarcity, model generalizability, and regulatory hurdles. This perspective article explores effective and scalable methods for clinical validation of voice biomarkers, emphasizing the importance of proprietary technology, high-quality, diverse datasets, strong clinical partnerships, and regulatory compliance. We propose a multifaceted approach leveraging proprietary AI technology (Musicology AI) to enhance voice analysis, large-scale data collection initiatives to improve model robustness, and medical device certification to ensure clinical applicability. Addressing technical, ethical, and regulatory challenges is crucial for establishing trust in AI-driven diagnostics. By combining technological innovation with rigorous clinical validation, this work aims to bridge the gap between research and real-world implementation, paving the way for AI-powered voice biomarkers to become a reliable tool in digital healthcare.
香京julia种子在线播放
Clinical validation is a critical process in healthcare that ensures medical devices, diagnostic tests, or treatments (in short interventions) are both effective and safe when applied in real-world clinical settings. This process involves rigorous evaluation to confirm that the intervention performs as intended and delivers expected clinical outcomes, maintaining patient safety and treatment efficacy [U.S. Food and Drug Administration (FDA), 1993–2017] (
Voice, with its unique properties, when analysed with artificial intelligence (AI) has the potential to become a powerful biomarker for various health conditions. As a non-invasive and nonintrusive approach, it enables remote diagnosis and monitoring through any phone or digital device, meeting the criteria for safety, usability, and accessibility. Additionally, it is potentially independent of variables such as language, geography, and age, supporting its applicability across diverse populations. The AI-driven analysis of voice data ensures secure, encrypted patient identification and integrates seamlessly with established clinical protocols set out by the regulatory agencies (
The use of voice as a diagnostic tool has a long history in medicine, dating back to Hippocrates, who noted that voice changes could indicate underlying health conditions. More recently, physicians observed that disorders such as Parkinson's disease, depression, and respiratory illnesses could alter speech patterns, including pitch, rhythm, and articulation. Before the advent of AI, clinical speech analysis relied on subjective auditory assessments by trained professionals and objective acoustic measurements using tools like spectrograms and frequency analyzers.
In the mid-20th century, advancements in phonetics and speech science led to more structured approaches in analyzing voice abnormalities. Researchers identified measurable vocal markers associated with various conditions, such as increased jitter and shimmer in Parkinson's disease and slowed speech rate in depression. However, traditional methods were often labor-intensive, requiring manual analysis and expert interpretation (
The introduction of AI and machine learning in the 21st century revolutionized the field by automating voice analysis and uncovering complex patterns beyond human perception. Early AI applications in voice diagnostics focused on neurodegenerative diseases and psychiatric disorders. The emergence of deep learning further enhanced the accuracy and scalability of voice biomarkers, allowing for real-time analysis across diverse populations.
Most recently, AI-driven voice biomarkers have expanded beyond neurological and psychiatric conditions to include cardiovascular diseases, respiratory illnesses, and even COVID-19 detection (
Two recent systematic reviews by Meehan et al. (
Several factors may contribute to the above drafted situation. Martin and Rouas (
Additional challenges are highlighted by Berisha and Liss (
Also, Fagherazzi et al. (
Finally, according to a recent review by Alhuwaydi (
In summary, while voice biomarkers and AI models hold great promise for mental health diagnostics and treatment, their clinical adoption remains limited due to several interrelated challenges. Trust among clinicians, symptom heterogeneity, data scarcity, and the need for theoretically grounded models all contribute to the slow transition from research to practice. Additionally, ethical concerns, regulatory uncertainties, and the necessity for standardized, high-quality datasets further complicate implementation.
Our research and practice corroborate many of the limitations identified by the authors of the studies mentioned above. To overcome these challenges we follow a multifaceted approach that integrates cutting-edge technology, large-scale data collection, strong clinical collaborations, and regulatory compliance. In the discussion, we explore these four key solutions: (1) leveraging proprietary technology and complex feature engineering based on knowledge from Musicology (i.e., Musicology AI) to enhance voice biomarker analysis; (2) addressing the urgent need for larger, more diverse and high quality medical voice datasets; (3) strengthening clinical partnerships to support large-scale studies; and (4) achieving medical device certification to ensure regulatory approval and real-world applicability.
In our view, a key factor in improving voice-based diagnostics is the reliance on proprietary technology and informed feature sets, which enable a more sophisticated approach to data analysis. Existing studies on voice biomarkers rely on a limited set of acoustic features (
Examples of acoustic features investigated across various health conditions.
Condition | Heart failure/pulmonary hypertension | Multiple sclerosis | Lung cancer |
---|---|---|---|
Acoustic features | Jitter | Jitter | Jitter |
Shimmer | Shimmer | Shimmer | |
Loudness | F0 mean/variability | Loudness | |
MFCC | Tremor | F0 | |
Pitch and format measures | Intensity variability | MFCC | |
Max phonation time | Formant 1–3 frequency/bandwidth | ||
Speech/Articulation rate | Harmonic difference | ||
Pause rate | |||
(e.g., |
(e.g., |
(e.g., |
To address these limitations, we leverage a proprietary technology that draws on the foundations of quantitative musicology (Musicology AI). Musicology is the scholarly study of music, encompassing historical, theoretical, and analytical approaches to sound. Applied to voice biomarkers, it involves analyzing speech using music theory concepts such as prosody (rhythm, intonation, stress), acoustic properties (timbre, harmonics, spectral features), and temporal organization (timing patterns, pauses, speech rate). Music theory provides a well-established yet underexplored framework for understanding sound beyond physics, making it a valuable tool for advanced voice analysis. This approach enhances AI models by improving interpretability and biological grounding, for example, speech melodic contours, such as reduced pitch variability, reflect psychomotor retardation in depression, linking to diminished dopaminergic activity (
By incorporating musicology into clinical settings, we enhance the accuracy and robustness of voice-based health assessments. However, we recognize that small sample sizes and dataset biases can impact the accuracy, stability, and generalizability of results. To mitigate these challenges, we actively work on the creation of minimally biased large-scale datasets for vocal biomarker research. Additionally, our approach aligns with ensemble learning and weighted feature analysis techniques, incorporating methods such as bootstrap aggregation, boosting, and stacking. These techniques allow us to combine multiple data items, identify the most meaningful feature combinations as hypotheses, and control their influence based on overlapping relationships—enhancing the robustness of our models even in the presence of data biases. By integrating these methods, we further improve the performance, reliability, and fairness of voice-based health assessments, offering a promising path forward in digital diagnostics.
Our most recent technological advancements integrate this knowledge with large language models (LLMs) and fundamental transformer architectures; whereby together with pharmaceutical and quantum hardware partners, we incorporate quantum AI into our core approach. Recent findings from our research (Krautz et al., in prep) further demonstrate the potential of these methods in detecting depressive disorders, showing that musicological features significantly improve the sensitivity and specificity of voice-based depression assessments, reaching AUC = 0.80, also when female and male groups were analysed separately.
Furthermore, one of the fundamental challenges that we identify in the field of mental health diagnostics pertains to the scarcity of high-quality medical voice data. Many existing studies are based on small sample sizes, some are conducted on specific groups of depressed patients, e.g., those with post-traumatic stress disorder (PTSD); which limits the model generalizability. For example, nearly 94% of published studies on depression
To overcome these challenges, we propose a large-scale, data-driven approach that prioritizes both quantity and quality of data. This approach focuses on conducting multicenter studies across diverse linguistic and cultural populations, ensuring that voice biomarkers are validated against real-world clinical diversity. Specifically, in collaboration with one of Europe's largest clinic chains, we launched a multi-site trial spanning two European countries to collect data from a five-digit-amount of patients, encompassing multiple audio recordings per individual alongside comprehensive clinical background information. We currently collect data at several dozens of clinics and aim to ramp up to several hundreds of clinics of our partner. This will create the largest dataset of its kind, significantly improving the robustness of AI models trained for medical voice analysis.
The third pillar of our approach relates to the strong collaboration with world-class clinical partners. That is, we have established contracts entailing more than 400 clinics in Europe enabling us to conduct large-scale, multicenter studies with a diverse patient population. This extensive clinical network ensures access to high-quality data, facilitates standardized study protocols, and enhances the reproducibility and reliability of results. Furthermore, as comparative validation against gold-standard clinical tests is essential for establishing the clinical relevance of voice biomarkers, our studies integrate voice analysis with widely accepted diagnostic tools. In this way we ensure that AI-driven insights are anchored in clinically recognized measures.
The final cornerstone of our approach emphasizes efforts in medical device certification. We are in the process of CE clearance for our voice-based ADHD algorithms. This positions us as front-runner in medical certification. Achieving these certifications, while time and resource consuming, will be a significant milestone in the field of AI-driven medical diagnostics. Regulatory approval necessitates adherence to stringent data integrity, safety, and clinical efficacy standards—challenges that many AI-driven healthcare solutions have struggled to overcome. We therefore pave the way for broader adoption of voice-based diagnostic tools in clinical practice. Before regulatory approval can be achieved, however, it is crucial to establish accuracy and standardization of measurements as a foundation for compliance. Accuracy levels should be aligned with established diagnostic gold standards, such as psychological assessments, and performance metrics must be tailored to specific use cases.
In sum, by combining technological innovation, data scalability, clinical collaboration, and regulatory adherence, we establish a framework for the seamless translation of AI-based voice diagnostics into clinical practice.
If we move beyond our current research and practice, we recognize that several key research areas must be explored to maximize their clinical impact. Initially, our primary focus is on mental health, with plans to expand into a broader range of diseases and languages over time. While we have already made significant progress in improving the robustness and generalizability of our models, the large-scale data we are currently collecting will elevate these capabilities to an unprecedented level, ensuring even greater clinical reliability. Additionally, integrating voice analysis with other biomarkers, such as physiological data from wearables or facial expressions, presents an opportunity to enhance diagnostic accuracy. This multimodal approach could further refine assessments, particularly in complex conditions like depression and anxiety.
Aforementioned approaches and feature sets aim at establishing evidence for the presence of specific conditions of disease based purely on content agnostic properties of speech. Meanwhile content based analysis are reaching new heights of performance driven by advances in (multilingual) speech to text (STT) conversion and natural language processing (NLP) mainly by the advent of generative pre-training on large copra. Improved STT alone already boosts traditional NLP approaches when analyzing utterance, word and sentence occurrences and relationships. In combination with LLMs, a novel high-level analysis becomes feasible. As an exemplary approach an LLM can be prompted or fine-tuned to assess the contents of speech with respect to the symptoms of a certain psychological condition and to reenact a psychological assessment as either assessor or assesse based on the content presented.
Since content based and content agnostic information has by definition to some degree an orthogonal character, the combination of both has clearly a potential to further improve performance and generalization. An interesting scenario arises when the content of speech points into one direction, say psychological condition negative, while the sub-content or content agnostic properties of the voice hint at condition positive. To some extent the discrepancy between the “what” and the “how” of speech can become a discriminative feature in itself.
An even deeper integration of advanced acoustic perception with pre-trained models with large context awareness could be achieved by feeding aggregated acoustic information directly into the training and inference process. Towards this goal we currently explore the usage of musicological inspired acoustic tokenization to enrich content based tokens.
From an implementation perspective, AI literacy training for healthcare professionals is essential. Medical education programs should incorporate AI-based diagnostic tools, ensuring that clinicians understand their capabilities, limitations, and best practices for integrating them into patient care.
Finally, regulatory frameworks for AI-driven voice biomarkers must evolve to keep pace with technological advancements. Future research should focus on establishing standardized validation protocols, ensuring that AI models meet rigorous clinical efficacy and safety requirements. The ultimate goal is to transition voice biomarkers from promising research tools to fully integrated components of routine medical practice, bridging the gap between innovation and real-world application.
While AI-driven voice analysis holds promise as a non-invasive biomarker for various health conditions due to its accessibility and ability to facilitate remote monitoring, its adoption in clinical practice, particularly for mental health, remains limited. To fully realize the potential of AI-driven voice analysis in healthcare, a multifaceted approach is required. The integration of proprietary technologies like Musicology AI enhances the depth of voice biomarker analysis, moving beyond conventional methodologies. Addressing the scarcity of high-quality medical voice data through large-scale, multicenter studies ensures that AI models are trained on diverse, representative populations. Strong clinical partnerships provide access to high-quality patient data and facilitate rigorous validation against established clinical scales. Lastly, obtaining medical device certification is vital to adhere to regulatory approval in AI-driven diagnostics. Combining these elements can position voice-based AI as a transformative tool in medical diagnostics.
AK: Conceptualization, Writing – original draft, Writing – review & editing. JL: Conceptualization, Writing – review & editing. FH: Conceptualization, Writing – review & editing. JV: Conceptualization, Writing – review & editing. AH: Conceptualization, Writing – review & editing. CH: Conceptualization, Writing – original draft, Writing – review & editing.
The author(s) declare that no financial support was received for the research and/or publication of this article.
AK, JL, FH, JV, AH, and CH are employed by PeakProfiling GmbH.
The author(s) declare that no Generative AI was used in the creation of this manuscript.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
1In this manuscript, we focus specifically on mental health and depression to make our claims more concrete.
2Based on our own review of 67 studies that employed machine learning (ML) and voice analysis.