Introduction.- Literature Survey.- A Framework for Spontaneous Speech Emotion Recognition.- Improving Emotion Classification Accuracies.- Case Studies.- Conclusions.- Appendix.- Index.
Rupayan Chakraborty (Member IEEE) works as a scientist at the TCS Research and Innovation - Mumbai. He has been working in the area of speech and audio signal processing/recognition since 2008 and was involved in academic research prior to joining TCS. He worked as a researcher at the Computer Vision and Pattern Recognition (CVPR) Unit of the Indian Statistical Institute (ISI), Kolkata. He obtained his PhD degree from TALP Research Centre, Barcelona, Spain, in December 2013, in the area of acoustic event detection and localization using distributed sensor arrays in room environments, while working on the “Speech and Audio Recognition for Ambient Intelligence (SARAI)” project. After completing his PhD, he was a visiting scientist at the CVPR Unit of ISI, Kolkata, for 1 year. He has published research work in top-tier conferences and journals. He is currently working in the area of “speech emotion recognition and analysis”.
Meghna Pandharipande received her Bachelor of Engineering (BE) in Electronics and Telecommunication in June 2002 from Amravati University, Amravati. Between September 2002 and December 2003, she was a faculty member of the Department of Electronics and Telecommunication at Shah and Anchor Kutchhi Engineering College, Mumbai. In 2004, she completed her certification in Embedded Systems at CMC, Mumbai and then worked as a Lotus Notes developer in a startup ATS, Mumbai for a year. Since June 2005 she has been with TCS (having first joined the Cognitive Systems Research Laboratory, Tata InfoTech under Prof. P.V.S. Rao) and since 2006, she has been working as a researcher at TCS Research and Innovation - Mumbai. Her research interest is in the area of speech signal processing and has been working extensively on building systems that can process all aspects of spoken speech. More recently, she has been researching non-linguistic aspects of speech processing, like speaking rate and emotion detection from speech.
Sunil Kumar Kopparapu (Senior Member, IEEE; ACM Senior Member India) obtained his doctoral degree in Electrical Engineering from the Indian Institute of Technology Bombay, Mumbai, India in 1997. His thesis “Modular integration for low-level and high-level vision problems in a multi-resolution framework” provided a broad framework to enable reliable and fast vision processing.
Between 1997 and 2000 he was with the Automation Group, Commonwealth Scientific and Industrial Research Organization (CSIRO), Brisbane, Australia working on practical image processing and 3D vision problems, mainly for the benefit of the Australian mining industry.
Prior to joining the Cognitive Systems Research Laboratory (CSRL), Tata InfoTech Limited as a senior research member in 2001, he was an expert for developing virtual self-line of e-commerce products for the R&D Group at Aquila Technologies Private Limited, India.
In his current role as a principal scientist at TCS Research and Innovation - Mumbai, he is actively working in the areas of speech, script, image and natural-language processing with a focus on building usable systems for mass use in Indian conditions. He has co-authored a book titled “Bayesian Approach to Image Interpretation” and more recently a Springer Brief on Non-linguistic Analysis of Call Center Conversations.
This book captures the current challenges in automatic recognition of emotion in spontaneous speech and makes an effort to explain, elaborate, and propose possible solutions. Intelligent human–computer interaction (iHCI) systems thrive on several technologies like automatic speech recognition (ASR); speaker identification; language identification; image and video recognition; affect/mood/emotion analysis; and recognition, to name a few. Given the importance of spontaneity in any human–machine conversational speech, reliable recognition of emotion from naturally spoken spontaneous speech is crucial. While emotions, when explicitly demonstrated by an actor, are easy for a machine to recognize, the same is not true in the case of day-to-day, naturally spoken spontaneous speech. The book explores several reasons behind this, but one of the main reasons for this is that people, especially non-actors, do not explicitly demonstrate their emotion when they speak, thus making it difficult for machines to distinguish one emotion from another that is embedded in their spoken speech. This short book, based on some of authors’ previously published books, in the area of audio emotion analysis, identifies the practical challenges in analysing emotions in spontaneous speech and puts forward several possible solutions that can assist in robustly determining the emotions expressed in spontaneous speech.