ISBN-13: 9781138626911 / Angielski / Twarda / 2019 / 204 str.
ISBN-13: 9781138626911 / Angielski / Twarda / 2019 / 204 str.
Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.
"Even though textual data science cannot be considered as the youngest sibling of other data science fields, there is still quite a big space to be filled with up-to-date textbooks describing and analyzing various methods and facets of this very interesting topic. In this book, Mónica Bécue-Bertaut tries to fill this gap, giving theoretical and practical instructions about one of the relatively little known, but powerful methods in textual data science–Correspondence Analysis (CA)... Extensive graphical images and visualizations represented by various types of plot and diagram are used throughout the material, which provides an even better aid to the reader
for grasping the main ideas of the topic... separate mention should be drawn to the language used in the book. It is clear, simple, and even fun to read, providing an
understandable way of covering complex topics... Mónica Bécue-Bertaut achieved a good blend of theory and practice in her book, which can be used as a handy resource for students and beginners in data science, as well as for specialists in textual data analysis."
- Gia Jgarkava, ISCB December 2019
1. Encoding: from a corpus to statistical tables
Textual and contextual data
Textual data
Contextual data
Documents and aggregate documents
Examples and notation
Choosing textual units
Graphical forms
Lemmas
Stems
Repeated segments
In practice
Preprocessing
Unique spellings
Partially-automated preprocessing
Word selection
Word and segment indexes
The Life UK corpus: preliminary results
Verbal content through word and repeated segment indexes
Univariate description of contextual variables
A note on the frequency range
Implementation with the Xplortext package
In summary
2. Correspondence analysis of textual data
Data and goals
Correspondence analysis: a tool for linguistic data analysis
Data: a small example
Objectives
Associations between documents and words
Profile comparisons
Independence of documents and words
The X2 test
Association rates between columns and words
Active row and column clouds
Row and column pro_le spaces
Distributional equivalence and the X2 distance
Inertia of a cloud
Fitting document and word clouds
Factorial axes
Visualizing rows and columns
Category representation
Word representation
Transition formulas
Superimposed representation of rows and columns
Interpretation aids
Eigenvalues and representation quality of the clouds
Contribution of documents and words to axis inertia
Representation quality of a point
Supplementary rows and columns
Supplementary tables
Supplementary frequency rows and columns
Supplementary quantitative and qualitative variables
Validating the visualization
Interpretation scheme for textual CA results
Implementation with Xplortext
Summary of the CA approach
3. Applications of correspondence analysis
Choosing the level of detail for analyses
Correspondence analysis on aggregate free text answers
Data and objectives
Word selection
CA on the aggregate table
Document representation
Word representation
Simultaneous interpretation of the plots
Supplementary elements
Supplementary words
Supplementary repeated segments
Supplementary categories
Implementation with Xplortext
Direct analysis
Data and objectives
The main features of direct analysis
Direct analysis of the culture question
Implementation with Xplortext
4. Clustering in textual analysis
Clustering documents
Dissimilarity measures between documents
Measuring partition quality
Document clusters in the factorial space
Partition quality
Dissimilarity measures between document clusters
The single-linkage method
The complete-linkage method
Ward's method
Agglomerative hierarchical clustering
Hierarchical tree construction algorithm
Selecting the final partition
Interpreting clusters
Direct partitioning
Combining clustering methods
Consolidating partitions
Direct partitioning followed by AHC
A procedure for combining CA and clustering
Example: joint use of CA and AHC
Data and objectives
Data preprocessing using CA
Constructing the hierarchical tree
Choosing the final partition
Contiguity-constrained hierarchical clustering
Principles and algorithm
AHC of age groups with a chronological constraint
Implementation with Xplortext
Example: clustering free text answers
Data and objectives
Data preprocessing
CA: eigenvalues and total inertia
Interpreting the first axes
AHC: building the tree and choosing the final partition
Describing cluster features
Lexical features of clusters
Describing clusters in terms of characteristic words
Describing clusters in terms of characteristic documents
Describing clusters using contextual variables
Describing clusters using contextual qualitative variables
Describing clusters using quantitative contextual variables
Implementation with Xplortext
Summary of the use of AHC on factorial coordinates coming from CA
5. Lexical characterization of parts of a corpus
Characteristic words
Characteristic words and CA
Characteristic words and clustering
Clustering based on verbal content
Clustering based on contextual variables
Hierarchical words
Characteristic documents
Example: characteristic elements and CA
Characteristic words for the categories
Characteristic words and factorial planes
Documents that characterize categories
Characteristic words in addition to clustering
Implementation with Xplortext
6. Multiple factor analysis for textual analysis
Multiple tables in textual analysis
Data and objectives
Data preprocessing
Problems posed by lemmatization
Description of the corpora data
Indexes of the most frequent words
Notation
Objectives
Introduction to MFACT
The limits of CA on multiple contingency tables
How MFACT works
Integrating contextual variables
Analysis of multilingual free text answers
MFACT: eigenvalues of the global analysis
Representation of documents and words
Superimposed representation of the global and partial configurations
Links between the axes of the global analysis and the separate analyses
Representation of the groups of words
Implementation with Xplortext
Simultaneous analysis of two open-ended questions: impact of lemmatization
Objectives
Preliminary steps
MFACT on the left and right: lemmatized or nonlemmatized
Implementation with Xplortext
Other applications of MFACT in textual analysis
MFACT summary
7. Applications and analysis workflows
General rules for presenting results
Analyzing bibliographic databases
Introduction to the lupus data
The corpus
Exploratory analysis of the corpus
CA of the documents _ words table
The eigenvalues
Meta-keys and doc-keys
Analysis of the year-aggregate table
Eigenvalues and CA of the lexical table
Chronological study of drug names
Implementation with Xplortext
Conclusions from the study
Badinter's speech: a discursive strategy Methods
Breaking up the corpus into documents
The speech trajectory unveiled by CA
Results
Argument flow
Conclusions on the study of Badinter's speech
Implementation with Xplortext
Political speeches
Data and objectives
Methodology
Results
Data preprocessing
Lexicometric characteristics of the speeches and lexical table coding
Eigenvalues and Cramér's V
Speech trajectory
Word representation
Remarks
Hierarchical structure of the corpus
Conclusions
Implementation with Xplortext
Corpus of sensory descriptions
Introduction
Data
Eight Catalan wines
Jury
Verbal categorization
Encoding the data
Objectives
Statistical methodology
MFACT and constructing the mean configuration
Determining consensual words
Results
Data preprocessing
Some initial results
Individual configurations
MFACT: directions of inertia common to the majority of groups
MFACT: representing words and documents on the first plane
Word contributions
MFACT: group representation
Consensual words
Conclusion
1997-2024 DolnySlask.com Agencja Internetowa