ISBN-13: 9781119909248 / Angielski / Miękka / 2023 / 400 str.
ISBN-13: 9781119909248 / Angielski / Miękka / 2023 / 400 str.
Introduction xxiiiChapter 1 AWS Data Lakes and Analytics Technology Overview 1Why AWS? 1What Does a Data Lake Look Like in AWS? 2Analytics on AWS 3Skills Required to Build and Maintain an AWS Analytics Pipeline 3Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5The Data Vision 6Support 6DA Team Roles 7Early Stage Roles 7Team Lead 8Data Architect 8Data Engineer 8Data Analyst 9Maturity Stage Roles 9Data Scientist 9Cloud Engineer 10Business Intelligence (BI) Developer 10Machine Learning Engineer 10Business Analyst 11Niche Roles 11Analytics Flow at a Process Level 12Workflow Methodology 12The DA Team Mantra: "Automate Everything" 14Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15Centralized 15Distributed 16Center of Excellence 16Summary 17Chapter 3 Working on AWS 19Accessing AWS 20Everything Is a Resource 21S3: An Important Exception 21IAM: Policies, Roles, and Users 22Policies 22Identity- Based Policies 24Resource- Based Policies 25Roles 25Users and User Groups 25Summarizing IAM 26Working with the Web Console 26The AWS Command- Line Interface 29Installing AWS cli 29Linux Installation 30macOS Installation 30Windows 31Configuring AWS cli 31A Note on Region 33Setting Individual Parameters 33Using Profiles and Configuration Files 33Final Notes on Configuration 36Using the AWS cli 36Using Skeletons and File Inputs 39Cleaning Up! 43Infrastructure- as- Code: CloudFormation and Terraform 44CloudFormation 44CloudFormation Stacks 46CloudFormation Template Anatomy 47CloudFormation Changesets 52Getting Stack Information 55Cleaning Up Again 57CloudFormation Conclusions 58Terraform 58Coding Style 58Modularity 59Limitations 59Terraform vs. CloudFormation 60Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60AWS CDK 60Pulumi 62Cloudcraft 62Infrastructure Management Conclusions 63Chapter 4 Serverless Computing and Data Engineering 65Serverless vs. Fully Managed 65AWS Serverless Technologies 66AWS Lambda 67Pricing Model 67Laser Focus on Code 68The Lambda Paradigm Shift 69Virtually Infinite Scalability 70Geographical Distribution 70A Lambda Hello World 71Lambda Configuration 74Runtime 74Container- Based Lambdas 75Architectures 75Memory 75Networking 76Execution Role 76Environment Variables 76AWS EventBridge 77AWS Fargate 77AWS DynamoDB 77AWS SNS 77Amazon SQS 78AWS CloudWatch 78Amazon QuickSight 78AWS Step Functions 78Amazon API Gateway 79Amazon Cognito 79AWS Serverless Application Model (SAM) 79Ephemeral Infrastructure 80AWS SAM Installation 80Configuration 80Creating Your First AWS SAM Project 81Application Structure 83SAM Resource Types 85SAM Lambda Template 86!! Recursive Lambda Invocation !! 88Function Metadata 88Outputs 89Implicitly Generated Resources 89Other Template Sections 90Lambda Code 90Building Your First SAM Application 93Testing the AWS SAM Application Locally 96Deployment 99Cleaning Up 104Summary 104Chapter 5 Data Ingestion 105AWS Data Lake Architecture 106Serverless Data Lake Architecture Structure 106Ingestion 106Storage and Processing 108Cataloging, Governance, and Search 108Security and Monitoring 109Consumption 109Sample Processing Architecture: Cataloging Images into DynamoDB 109Use Case Description 109SAM Application Creation 110S3- Triggered Lambda 111Adding DynamoDB 119Lambda Execution Context 121Inserting into DynamoDB 121Cleaning Up 123Serverless Ingestion 124AWS Fargate 124AWS Lambda 124Example Architecture: Fargate- Based Periodic Batch Import 125The Basic Importer 125ECS CLI 128AWS Copilot cli 128Clean Up 136AWS Kinesis Ingestion 136Example Architecture: Two- Pronged Delivery 137Fully Managed Ingestion with AppFlow 146Operational Data Ingestion with Database Migration Service 151DMS Concepts 151DMS Instance 151DMS Endpoints 152DMS Tasks 152Summary of the Workflow 152Common Use of DMS 153Example Architecture: DMS to S3 154DMS Instance 154DMS Endpoints 156DMS Task 162Summary 167Chapter 6 Processing Data 169Phases of Data Preparation 170What Is ETL? Why Should I Care? 170ETL Job vs. Streaming Job 171Overview of ETL in AWS 172ETL with AWS Glue 172ETL with Lambda Functions 172ETL with Hadoop/EMR 173Other Ways to Perform ETL 173ETL Job Design Concepts 173Source Identification 174Destination Identification 174Mappings 174Validation 174Filter 175Join, Denormalization, Relationalization 175AWS Glue for ETL 176Really, It's Just Spark 176Visual 176Spark Script Editor 177Python Shell Script Editor 177Jupyter Notebook 177Connectors 177Creating Connections 178Creating Connections with the Web Console 178Creating Connections with the AWS cli 179Creating ETL Jobs with AWS Glue Visual Editor 184ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184Job Bookmarks 187Transformations 188Apply Mapping 189Filter 189Other Available Transforms 190Run the Edited Job 191Visual Editor with Source and Target Conclusions 192Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192Creating ETL Jobs with the Spark Script Editor 192Developing ETL Jobs with AWS Glue Notebooks 193What Is a Notebook? 194Notebook Structure 194Step 1: Load Code into a DynamicFrame 196Step 2: Apply Field Mapping 197Step 3: Apply the Filter 197Step 4: Write to S3 in Parquet Format 198Example: Joining and Denormalizing Data from Two S3 Locations 199Conclusions for Manually Authored Jobs with Notebooks 203Creating ETL Jobs with AWS Glue Interactive Sessions 204It's Magic 205Development Workflow 206Streaming Jobs 207Differences with a Standard ETL Job 208Streaming Sources 208Example: Process Kinesis Streams with a Streaming Job 208Streaming ETL Jobs Conclusions 217Summary 217Chapter 7 Cataloging, Governance, and Search 219Cataloging with AWS Glue 219AWS Glue and the AWS Glue Data Catalog 219Glue Databases and Tables 220Databases 220The Idea of Schema- on- Read 221Tables 222Create Table Manually 223Creating a Table from an Existing Schema 225Creating a Table with a Crawler 225Summary on Databases and Tables 226Crawlers 226Updating or Not Updating? 230Running the Crawler 231Creating a Crawler from the AWS CLI 231Retrieving Table Information from the CLI 233Classifiers 235Classifier Example 236Crawlers and Classifiers Summary 237Search with Amazon Athena: The Heart of Analytics in AWS 238A Bit of History 238Interface Overview 238Creating Tables Manually 239Athena Data Types 240Complex Types 241Running a Query 242Connecting with JDBC and ODBC 243Query Stats 243Recent Queries and Saved Queries 243The Power of Partitions 244Athena Pricing Model 244Automatic Naming 245Athena Query Output 246Athena Peculiarities (SQL and Not) 246Computed Fields Gotcha and WITH Statement Workaround 246Lowercase! 247Query Explain 248Deduplicating Records 249Working with JSON, Flattening, and Unnesting 250Athena Views 251Create Table as Select (CTAS) 252Saving Queries and Reusing Saved Queries 253Running Parameterized Queries 254Athena Federated Queries 254Athena Lambda Connectors 255Note on Connection Errors 256Performing Federated Queries 257Creating a View from a Federated Query 258Governing: Athena Workgroups, Lake Formation, and More 258Athena Workgroups 259Fine- Grained Athena Access with IAM 262Recap of Athena- Based Governance 264AWS Lake Formation 265Registering a Location in Lake Formation 266Creating a Database in Lake Formation 268Assigning Permissions in Lake Formation 269LF- Tags and Permissions in Lake Formation 271Data Filters 277Governance Conclusions 279Summary 280Chapter 8 Data Consumption: BI, Visualization, and Reporting 283QuickSight 283Signing Up for QuickSight 284Standard Plan 284Enterprise Plan 284Users and User Groups 285Managing Users and Groups 285Managing QuickSight 286Users and Groups 287Your Subscriptions 287SPICE Capacity 287Account Settings 287Security and Permissions 287VPC Connections 288Mobile Settings 289Domains and Embedding 289Single Sign- On 289Data Sources and Datasets 289Creating an Athena Data Source 291Creating Other Data Sources 292Creating a Data Source from the AWS cli 292Creating a Dataset from a Table 294Creating a Dataset from a SQL Query 295Duplicating Datasets 296Note on Creating Datasets 297QuickSight Favorites, Recent, and Folders 297SPICE 298Manage SPICE Capacity 298Refresh Schedule 299QuickSight Data Editor 299QuickSight Data Types 302Change Data Types 302Calculated Fields 303Joining Data 305Excluding Fields 309Filtering Data 309Removing Data 310Geospatial Hierarchies and Adding Fields to Hierarchies 310Unsupported Format Dates 311Visualizing Data: QuickSight Analysis 312Adding a Title and a Description to Your Analysis 313Renaming the Sheet 314Your First Visual with AutoGraph 314Field Wells 314Visuals Types 315Saving and Autosaving 316A First Example: Pie Chart 316Renaming a Visual 317Filtering Data 318Adding Drill- Downs 320Parameters 321Actions 324Insights 328ML- Powered Insights 330Sharing an Analysis 335Dashboards 335Dashboard Layouts and Themes 335Publishing a Dashboard 336Embedding Visuals and Dashboards 337Data Consumption: Not Only Dashboards 337Summary 338Chapter 9 Machine Learning at Scale 339Machine Learning and Artificial Intelligence 339What Are ML/AI Use Cases? 340Types of ML Models 340Overview of ML/AI AWS Solutions 341Amazon SageMaker 341SageMaker Domains 342Adding a User to the Domain 344SageMaker Studio 344SageMaker Example Notebook 346Step 1: Prerequisites and Preprocessing 346Step 2: Data Ingestion 347Step 3: Data Inspection 348Step 4: Data Conversion 349Step 5: Upload Training Data 349Step 6: Train the Model 349Step 7: Set Up Hosting and Deploy the Model 351Step 8: Validate the Model 352Step 9: Use the Model 353Inference 353Real Time 354Asynchronous 354Serverless 354Batch Transform 354Data Wrangler 356SageMaker Canvas 357Summary 358Appendix Example Data Architectures in AWS 359Modern Data Lake Architecture 360ETL in a Lake House 361Consuming Data in the Lake House 361The Modern Data Lake Architecture 362Batch Processing 362Stream Processing 363Architecture Design Recommendations 364Automate Everything 365Build on Events 365Performance = Cost Savings 365AWS Glue Catalog and Athena- Centric Workflow 365Design Flexible 365Pick Your Battles 365Parquet 366Summary 366Index 367
GIONATA "JOE" MINICHINO is Principal Software Engineer and Data Architect on the Data & Analytics Team at Teamwork. He specializes in cloud computing, machine/deep learning, and artificial intelligence and designs end-to-end Amazon Web Services pipelines that move large quantities of diverse data for analysis and visualization.
1997-2025 DolnySlask.com Agencja Internetowa