1 An Introduction to Text Analytics.- 2 Text Preparation and Similarity Computation.- 3 Matrix Factorization and Topic Modeling.- 4 Text Clustering.- 5 Text Classification: Basic Models.- 6 Linear Models for Classification and Regression.- 7 Classifier Performance and Evaluation.- 8 Joint Text Mining with Heterogeneous Data.- 9 Information Retrieval and Search Engines.- 10 Language Modeling and Deep Learning.- 11 Attention Mechanisms and Transformers.- 12 Text Summarization.- 13 Information Extraction and Knowledge Graphs.- 14 Question Answering.- 15 Opinion Mining and Sentiment Analysis.- 16 Text Segmentation and Event Detection.
Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his undergraduate degree in Computer Science from the Indian Institute of Technology at Kanpur in 1993 and his Ph.D. in Operations Research from the Massachusetts Institute of Technology in 1996. He has published more than 400 papers in refereed conferences and journals, and has applied for or been granted more than 80 patents. He is author or editor of 20 books, including textbooks on linear algebra, machine learning (for text), neural networks, recommender systems, and outlier analysis. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He has received several internal and external awards, including the EDBT Test-of-Time Award (2014), the ACM SIGKDD Innovation Award (2019), and the IEEE ICDM Research Contributions Award (2015). He is also a recipient of the W. Wallace McDowell Award, which is the highest technical honor given by IEEE Computer Society in the field of computer science. He has served as an editor-in-chief of the ACM SIGKDD Explorations. He is currently serving as the editor-in-chief of the ACM Transactions on Knowledge Discovery from Data and as an editor-in-chief of ACM Books. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to knowledge discovery and data mining algorithms.”
This second edition textbook covers a coherently organized framework for text analytics, which integrates material drawn from the intersecting topics of information retrieval, machine learning, and natural language processing. Particular importance is placed on deep learning methods. The chapters of this book span three broad categories:
1. Basic algorithms: Chapters 1 through 7 discuss the classical algorithms for text analytics such as preprocessing, similarity computation, topic modeling, matrix factorization, clustering, classification, regression, and ensemble analysis.
2. Domain-sensitive learning and information retrieval: Chapters 8 and 9 discuss learning models in heterogeneous settings such as a combination of text with multimedia or Web links. The problem of information retrieval and Web search is also discussed in the context of its relationship with ranking and machine learning methods.
3. Natural language processing: Chapters 10 through 16 discuss various sequence-centric and natural language applications, such as feature engineering, neural language models, deep learning, transformers, pre-trained language models, text summarization, information extraction, knowledge graphs, question answering, opinion mining, text segmentation, and event detection.
Compared to the first edition, this second edition textbook (which targets mostly advanced level students majoring in computer science and math) has substantially more material on deep learning and natural language processing. Significant focus is placed on topics like transformers, pre-trained language models, knowledge graphs, and question answering.