The ELOQUENCE project partner, Brno University of Technology (BUT), has recently contributed valuable insights through a scientific publication “Speech production under stress for machine learning: multimodal dataset of 79 cases and 8 signals”. The research focuses on the early identification of cognitive or physical overload, a critical concern in fields where human decision-making can have significant consequences for safety and property. We are sharing the abstract below and congratulate the whole team for this success!
In recent decades, Machine Learning (ML) has rapidly grown as an industry sector, promoting significant advancements in the recognition and classification of human speech. Despite early research on speech processing dating back to the 1960s, technological limitations hindered widespread adoption until cheap and accessible Graphics Processing Units (GPUs) became available. The wide availability of GPUs opened the research avenues in Neural Networks (NNs). This progress led to a general improvement in ML performance, namely Natural Language Processing (NLP), image processing and also speech processing. The most prominent field in speech processing is Automatic Speech Recognition (ASR), which transcribes speech from audio recordings. In addition to ASR, supplemental tasks such as Gender Identification (GID), Language Identification (LID), and Speaker Identification (SID) can be provided. The growing importance of metadata related to speech transcriptions has created a market to recognize emotions, health, age, and other information from speech. Stress detection, however, is rapidly developing due to its relevance in key areas of human activity. The concept of stress has been known since ancient Rome, but its systematic study in a physiological sense did not begin until the 19th century with Claude Bernard’s theory of “milieu intérieur” and Walter Cannon’s extension of this concept to a theory of homeostasis. Cannon also linked psychological and psychosomatic symptoms and proposed that prolonged exposure to fear could result in death. The fight-or-flight response, which he developed with Philip Bard, is a widely accepted theory that a mix of different physiological processes prepares the body for fighting or fleeing in response to an acute stressor. Based on this work, more research has been conducted on the application of speech-based features to stress estimation and the development of multimodal datasets. John Hansen’s early work explored the features of stressed speech, but limited data hindered progress. Hansen later collected data in cooperation with the North Atlantic Treaty Organization (NATO) to establish initial stress-related databases. He identified four main features for stress estimation: intensity, pitch, duration of words, and vocal tract spectrum. The Lombard effect needs to be taken into account w.r.t. these features, as it may negatively impact its descriptive power. Tet Fei Yap’s doctoral thesis explored the effects of cognitive load on speech and found that formant frequencies, although lower-dimensional than Mel-Frequency Cepstral Coefcients (MFCC), were comparable in performance for cognitive load classification systems. These advances in research help to study the very nature of stress demonstration in speech, but there is still a considerable lack of datasets to support the research efforts.
Read the full article here.