ISBN-13: 9781119526810 / Angielski / Twarda / 2019 / 256 str.
ISBN-13: 9781119526810 / Angielski / Twarda / 2019 / 256 str.
Preface xiAbout the Authors xvAcknowledgements xviiChapter 1 Introduction to Data Science 11.1 Why Data Science? 11.2 What is Data Science? 11.3 The Data Science Methodology 21.4 Data Science Tasks 51.4.1 Description 61.4.2 Estimation 61.4.3 Classification 61.4.4 Clustering 71.4.5 Prediction 71.4.6 Association 7Exercises 8Chapter 2 The Basics of Python and R 92.1 Downloading Python 92.2 Basics of Coding in Python 92.2.1 Using Comments in Python 92.2.2 Executing Commands in Python 102.2.3 Importing Packages in Python 112.2.4 Getting Data into Python 122.2.5 Saving Output in Python 132.2.6 Accessing Records and Variables in Python 142.2.7 Setting Up Graphics in Python 152.3 Downloading R and RStudio 172.4 Basics of Coding in R 192.4.1 Using Comments in R 192.4.2 Executing Commands in R 202.4.3 Importing Packages in R 202.4.4 Getting Data into R 212.4.5 Saving Output in R 232.4.6 Accessing Records and Variables in R 24References 26Exercises 26Chapter 3 Data Preparation 293.1 The Bank Marketing Data Set 293.2 The Problem Understanding Phase 293.2.1 Clearly Enunciate the Project Objectives 293.2.2 Translate These Objectives into a Data Science Problem 303.3 Data Preparation Phase 313.4 Adding an Index Field 313.4.1 How to Add an Index Field Using Python 313.4.2 How to Add an Index Field Using R 323.5 Changing Misleading Field Values 333.5.1 How to Change Misleading Field Values Using Python 343.5.2 How to Change Misleading Field Values Using R 343.6 Reexpression of Categorical Data as Numeric 363.6.1 How to Reexpress Categorical Field Values Using Python 363.6.2 How to Reexpress Categorical Field Values Using R 383.7 Standardizing the Numeric Fields 393.7.1 How to Standardize Numeric Fields Using Python 403.7.2 How to Standardize Numeric Fields Using R 403.8 Identifying Outliers 403.8.1 How to Identify Outliers Using Python 413.8.2 How to Identify Outliers Using R 42References 43Exercises 44Chapter 4 Exploratory Data Analysis 474.1 EDA Versus HT 474.2 Bar Graphs with Response Overlay 474.2.1 How to Construct a Bar Graph with Overlay Using Python 494.2.2 How to Construct a Bar Graph with Overlay Using R 504.3 Contingency Tables 514.3.1 How to Construct Contingency Tables Using Python 524.3.2 How to Construct Contingency Tables Using R 534.4 Histograms with Response Overlay 534.4.1 How to Construct Histograms with Overlay Using Python 554.4.2 How to Construct Histograms with Overlay Using R 584.5 Binning Based on Predictive Value 584.5.1 How to Perform Binning Based on Predictive Value Using Python 594.5.2 How to Perform Binning Based on Predictive Value Using R 62References 63Exercises 63Chapter 5 Preparing to Model the Data 695.1 The Story So Far 695.2 Partitioning the Data 695.2.1 How to Partition the Data in Python 705.2.2 How to Partition the Data in R 715.3 Validating your Partition 725.4 Balancing the Training Data Set 735.4.1 How to Balance the Training Data Set in Python 745.4.2 How to Balance the Training Data Set in R 755.5 Establishing Baseline Model Performance 77References 78Exercises 78Chapter 6 Decision Trees 816.1 Introduction to Decision Trees 816.2 Classification and Regression Trees 836.2.1 How to Build CART Decision Trees Using Python 846.2.2 How to Build CART Decision Trees Using R 866.3 The C5.0 Algorithm for Building Decision Trees 886.3.1 How to Build C5.0 Decision Trees Using Python 896.3.2 How to Build C5.0 Decision Trees Using R 906.4 Random Forests 916.4.1 How to Build Random Forests in Python 926.4.2 How to Build Random Forests in R 92References 93Exercises 93Chapter 7 Model Evaluation 977.1 Introduction to Model Evaluation 977.2 Classification Evaluation Measures 977.3 Sensitivity and Specificity 997.4 Precision, Recall, and Fß Scores 997.5 Method for Model Evaluation 1007.6 An Application of Model Evaluation 1007.6.1 How to Perform Model Evaluation Using R 1037.7 Accounting for Unequal Error Costs 1047.7.1 Accounting for Unequal Error Costs Using R 1057.8 Comparing Models with and without Unequal Error Costs 1067.9 Data-Driven Error Costs 107Exercises 109Chapter 8 Naïve Bayes Classification 1138.1 Introduction to Naive Bayes 1138.2 Bayes Theorem 1138.3 Maximum a Posteriori Hypothesis 1148.4 Class Conditional Independence 1148.5 Application of Naive Bayes Classification 1158.5.1 Naive Bayes in Python 1218.5.2 Naive Bayes in R 123References 125Exercises 126Chapter 9 Neural Networks 1299.1 Introduction to Neural Networks 1299.2 The Neural Network Structure 1299.3 Connection Weights and the Combination Function 1319.4 The Sigmoid Activation Function 1339.5 Backpropagation 1349.6 An Application of a Neural Network Model 1349.7 Interpreting the Weights in a Neural Network Model 1369.8 How to Use Neural Networks in R 137References 138Exercises 138Chapter 10 Clustering 14110.1 What is Clustering? 14110.2 Introduction to the K-Means Clustering Algorithm 14210.3 An Application of K-Means Clustering 14310.4 Cluster Validation 14410.5 How to Perform K-Means Clustering Using Python 14510.6 How to Perform K-Means Clustering Using R 147Exercises 149Chapter 11 Regression Modeling 15111.1 The Estimation Task 15111.2 Descriptive Regression Modeling 15111.3 An Application of Multiple Regression Modeling 15211.4 How to Perform Multiple Regression Modeling Using Python 15411.5 How to Perform Multiple Regression Modeling Using R 15611.6 Model Evaluation for Estimation 15711.6.1 How to Perform Estimation Model Evaluation Using Python 15911.6.2 How to Perform Estimation Model Evaluation Using R 16011.7 Stepwise Regression 16111.7.1 How to Perform Stepwise Regression Using R 16211.8 Baseline Models for Regression 162References 163Exercises 164Chapter 12 Dimension Reduction 16712.1 The Need for Dimension Reduction 16712.2 Multicollinearity 16812.3 Identifying Multicollinearity Using Variance Inflation Factors 17112.3.1 How to Identify Multicollinearity Using Python 17212.3.2 How to Identify Multicollinearity in R 17312.4 Principal Components Analysis 17512.5 An Application of Principal Components Analysis 17512.6 How Many Components Should We Extract? 17612.6.1 The Eigenvalue Criterion 17612.6.2 The Proportion of Variance Explained Criterion 17712.7 Performing Pca with K = 4 17812.8 Validation of the Principal Components 17812.9 How to Perform Principal Components Analysis Using Python 17912.10 How to Perform Principal Components Analysis Using R 18112.11 When is Multicollinearity Not a Problem? 183References 184Exercises 184Chapter 13 Generalized Linear Models 18713.1 An Overview of General Linear Models 18713.2 Linear Regression as a General Linear Model 18813.3 Logistic Regression as a General Linear Model 18813.4 An Application of Logistic Regression Modeling 18913.4.1 How to Perform Logistic Regression Using Python 19013.4.2 How to Perform Logistic Regression Using R 19113.5 Poisson Regression 19213.6 An Application of Poisson Regression Modeling 19213.6.1 How to Perform Poisson Regression Using Python 19313.6.2 How to Perform Poisson Regression Using R 194Reference 195Exercises 195Chapter 14 Association Rules 19914.1 Introduction to Association Rules 19914.2 A Simple Example of Association Rule Mining 20014.3 Support, Confidence, and Lift 20014.4 Mining Association Rules 20214.4.1 How to Mine Association Rules Using R 20314.5 Confirming Our Metrics 20714.6 The Confidence Difference Criterion 20814.6.1 How to Apply the Confidence Difference Criterion Using R 20814.7 The Confidence Quotient Criterion 20914.7.1 How to Apply the Confidence Quotient Criterion Using R 210References 211Exercises 211Appendix Data Summarization and Visualization 215Part 1: Summarization 1: Building Blocks of Data Analysis 215Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 217Part 3: Summarization 2: Measures of Center, Variability, and Position 222Part 4: Summarization and Visualization of Bivariate Elationships 225Index 231
CHANTAL D. LAROSE, PHD, is an Assistant Professor of Statistics & Data Science at Eastern Connecticut State University (ECSU). She has co-authored three books on data science and predictive analytics and helped develop data science programs at ECSU and SUNY New Paltz. Her PhD dissertation, Model-Based Clustering of Incomplete Data, tackles the persistent problem of trying to do data science with incomplete data.DANIEL T. LAROSE, PHD, is a Professor of Data Science and Statistics and Director of the Data Science programs at Central Connecticut State University. He has published many books on data science, data mining, predictive analytics, and statistics. His consulting clients include The Economist magazine, Forbes Magazine, the CIT Group, and Microsoft.
1997-2024 DolnySlask.com Agencja Internetowa