ISBN-13: 9781461451426 / Angielski / Miękka / 2012 / 124 str.
ISBN-13: 9781461451426 / Angielski / Miękka / 2012 / 124 str.
"Emotion Recognition Using Speech Features" provides coverage of emotion-specific features present in speech. The author also discusses suitable models for capturing emotion-specific information for distinguishing different emotions. The content of this book is important for designing and developing natural and sophisticated speech systems. In this Brief, Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about exploiting multiple evidences derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Features includes discussion of: - Global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; - Exploiting complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance; - Proposed multi-stage and hybrid models for improving the emotion recognition performance. This brief is for researchers working in areas related to speech-based products such as mobile phone manufacturing companies, automobile companies, and entertainment products as well as researchers involved in basic and applied speech processing research.
Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Emotion: Psychological perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Emotion: Speech signal perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Speech production mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Source features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 System features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Emotional speech databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Applications of speech emotion recognition . . . . . . . . . . . . . . . . . . . . 9 1.5 Issues in speech emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Objectives and scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Main highlights of research investigations . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Brief overview of contributions to this book . . . . . . . . . . . . . . . . . . . . 12 1.8.1 Emotion recognition using excitation source information . . . 12 1.8.2 Emotion recognition using vocal tract information . . . . . . . . . 12 1.8.3 Emotion recognition using prosodic information . . . . . . . . . . 13 1.9 Organization of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Emotion: Psychological perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Emotion: Speech signal perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Speech production mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Source features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 System features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Emotional speech databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Applications of speech emotion recognition . . . . . . . . . . . . . . . . . . . . 9 1.5 Issues in speech emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Objectives and scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Main highlights of research investigations . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Brief overview of contributions to this book . . . . . . . . . . . . . . . . . . . . 12 1.8.1 Emotion recognition using excitation source information . . . 12 1.8.2 Emotion recognition using vocal tract information . . . . . . . . . 12 1.8.3 Emotion recognition using prosodic information . . . . . . . . . . 13 1.9 Organization of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Emotion: Psychological perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Emotion: Speech signal perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Speech production mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Source features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 System features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Emotional speech databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Applications of speech emotion recognition . . . . . . . . . . . . . . . . . . . . 9 1.5 Issues in speech emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Objectives and scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Main highlights of research investigations . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Brief overview of contributions to this book . . . . . . . . . . . . . . . . . . . . 12 1.8.1 Emotion recognition using excitation source information . . . 12 1.8.2 Emotion recognition using vocal tract information . . . . . . . . . 12 1.8.3 Emotion recognition using prosodic information . . . . . . . . . . 13 1.9 Organization of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Emotion: Speech signal perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Speech production mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Source features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 System features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Emotional speech databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Applications of speech emotion recognition . . . . . . . . . . . . . . . . . . . . 9 1.5 Issues in speech emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Objectives and scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Main highlights of research investigations . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Brief overview of contributions to this book . . . . . . . . . . . . . . . . . . . . 12 1.8.1 Emotion recognition using excitation source information . . . 12 1.8.2 Emotion recognition using vocal tract information . . . . . . . . . 12 1.8.3 Emotion recognition using prosodic information . . . . . . . . . . 13 1.9 Organization of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Speech Emotion Recognition: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Emotional speech corpora: A review. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Excitation source features: A review . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Vocal tract system features: A review . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Prosodic features: A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Classification models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Motivation for the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8 Summary of the literature and scope for the present work . . . . . . . . . 31. . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Emotional speech corpora: A review. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 >2.3 Excitation source features: A review . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Vocal tract system features: A review . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Prosodic features: A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Classification models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Motivation for the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8 Summary of the literature and scope for the present work . . . . . . . . . 31 3 Emotion Recognition using Excitation Source Information . . . . . . . . . . 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34. . . . . . . . . 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 viii Contents 3.3 Emotional speech corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1 Indian Institute of Technology Kharagpur-Simulated Emotional Speech Corpus: IITKGP-SESC . . . . . . . . . . . . . . . 38 3.3.2 Berlin Emotional Speech Database: Emo-DB . . . . . . . . . . . . . 40 3.4 Excitation source features for emotion recognition . . . . . . . . . . . . . . . 40 3.4.1 Higher-order relations among LP residual samples . . . . . . . . 41 3.4.2 Phase of LP residual signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Parameters of the instants of glottal closure (Epoch parameters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.4 Dynamics of epoch parameters at syllable level . . . . . . . . . . . 48 3.4.5 Dynamics of epoch parameters at utterance level . . . . . . . . . 49 3.4.6 Glottal pulse parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Classification models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Auto-associative neural networks . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.2 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4 Emotion Recognition using Vocal Tract Information . . . . . . . . . . . . . . . 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Linear prediction cepstral coefficients (LPCCs) . . . . . . . . . . . 69 4.2.2 Mel frequency cepstral coefficients (MFCCs) . . . . . . . . . . . . . 70 4.2.3 Formant features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Gaussian mixture models (GMM) . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Linear prediction cepstral coefficients (LPCCs) . . . . . . . . . . . 69 4.2.2 Mel frequency cepstral coefficients (MFCCs) . . . . . . . . . . . . . 70 4.2.3 Formant features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Gaussian mixture models (GMM) . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5 Emotion Recognition using Prosodic Information . . . . . . . . . . . . . . . . . 815.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Prosodic features: importance in emotion recognition . . . . . . . . . . . . 82 5.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Extraction of global and local prosodic features . . . . . . . . . . . . . . . . . 86 5.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 . . . . . . . . . . . . . . . . 81 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Prosodic features: importance in emotion recognition . . . . . . . . . . . . 82 5.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Extraction of global and local prosodic features . . . . . . . . . . . . . . . . . 86 5.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Summary of the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Contributions of the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Conclusions from the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Scope for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Summary of the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Contributions of the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Conclusions from the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Scope for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Linear Prediction Analysis of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.1 The Prediction Error Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Estimation of Linear Prediction Coefficients . . . . . . . . . . . . . . . . . . . . 103 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.1 The Prediction Error Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Estimation of Linear Prediction Coefficients . . . . . . . . . . . . . . . . . . . . 103 Contents ix B MFCC Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 C Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.1 Training the GMMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.1.1 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . 112 C.1.2 Maximum a posteriori (MAP) Adaptation . . . . . . . . . . . . . . . 113 C.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.1 Training the GMMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.1.1 Expectation Maximization (EM) Algorithm . . . . . . . . . . . . . . 112 C.1.2 Maximum a posteriori (MAP) Adaptation . . . . . . . . . . . . . . . 113C.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 a posteriori (MAP) Adaptation . . . . . . . . . . . . . . . 113 C.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116K. Sreenivasa Rao is at the Indian Institute of Technology, Kharagpur, India.
Shashidhar G, Koolagudi is at the Graphic Era University, Dehradun, India.
“Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions. The content of this book is important for designing and developing natural and sophisticated speech systems.
Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions.
Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance; and proposed multi-stage and hybrid models for improving the emotion recognition performance.
1997-2024 DolnySlask.com Agencja Internetowa