Spark: Big Data Cluster Computing in Production » książka

zaloguj się | załóż konto

topmenu

Szukaj

Książki na zamówienie

Wyszukiwanie zaawansowane

Pusty koszyk

Bezpłatna dostawa dla zamówień powyżej 20 zł

Kategorie główne

• Nauka

[2944077]

• Literatura piękna

[1814251]

więcej...

Kategorie szczegółowe BISAC

Spark: Big Data Cluster Computing in Production

ISBN-13: 9781119254010 / Angielski / Miękka / 2016 / 216 str.

Ganelin, Ilya

Spark: Big Data Cluster Computing in Production

ISBN-13: 9781119254010 / Angielski / Miękka / 2016 / 216 str.

Ganelin, Ilya

cena 195,86
(netto: 186,53 VAT: 5%)

Najniższa cena z 30 dni: 193,46

Termin realizacji zamówienia:
ok. 30 dni roboczych.

Darmowa dostawa!

Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more. Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.

Review Spark hardware requirements and estimate cluster size
Gain insight from real-world production use cases
Tighten security, schedule resources, and fine-tune performance
Overcome common problems encountered using Spark in production

Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

Kategorie:

Informatyka, Bazy danych

Kategorie BISAC:

Computers > Data Science - Data Warehousing

Wydawca:

John Wiley & Sons

Język:

Angielski

ISBN-13:

9781119254010

Rok wydania:

2016

Ilość stron:

216

Waga:

0.41 kg

Wymiary:

23.37 x 18.54 x 1.27

Oprawa:

Miękka

Wolumenów:

Dodatkowe informacje:

Wydanie ilustrowane

Introduction xix

Chapter 1 Finishing Your Spark Job 1

Installation of the Necessary Components 2

Native Installation Using a Spark Standalone Cluster 3

The History of Distributed Computing That Led to Spark 3

Enter the Cloud 4

Understanding Resource Management 5

Using Various Formats for Storage 8

Text Files 10

Sequence Files 11

Avro Files 11

Parquet Files 12

Making Sense of Monitoring and Instrumentation 13

Spark UI 13

Spark Standalone UI 15

Metrics REST API 16

Metrics System 16

External Monitoring Tools 16

Summary 17

Chapter 2 Cluster Management 19

Background 21

Spark Components 24

Driver 25

Workers and Executors 26

Configuration 27

Spark Standalone 30

Architecture 31

Single –Node Setup Scenario 31

Multi –Node Setup 32

YARN 33

Architecture 35

Dynamic Resource Allocation 37

Scenario 39

Mesos 40

Setup 41

Architecture 42

Dynamic Resource Allocation 44

Basic Setup Scenario 44

Comparison 46

Summary 50

Chapter 3 Performance Tuning 53

Spark Execution Model 54

Partitioning 56

Controlling Parallelism 56

Partitioners 58

Shuffling Data 59

Shuffling and Data Partitioning 61

Operators and Shuffl ing 63

Shuffling Is Not That Bad After All 67

Serialization 67

Kryo Registrators 69

Spark Cache 69

Spark SQL Cache 73

Memory Management 73

Garbage Collection 74

Shared Variables 75

Broadcast Variables 76

Accumulators 78

Data Locality 81

Summary 82

Chapter 4 Security 83

Architecture 84

Security Manager 84

Setup Configurations 85

ACL 86

Configuration 86

Job Submission 87

Web UI 88

Network Security 95

Encryption 96

Event logging 101

Kerberos 101

Apache Sentry 102

Summary 102

Chapter 5 Fault Tolerance or Job Execution 105

Lifecycle of a Spark Job 106

Spark Master 107

Spark Driver 109

Spark Worker 111

Job Lifecycle 112

Job Scheduling 112

Scheduling within an Application 113

Scheduling with External Utilities 120

Fault Tolerance 122

Internal and External Fault Tolerance 122

Service Level Agreements (SLAs) 123

Resilient Distributed Datasets (RDDs) 124

Batch versus Streaming 130

Testing Strategies 133

Recommended Confi gurations 139

Summary 142

Chapter 6 Beyond Spark 145

Data Warehousing 146

Spark SQL CLI 147

Thrift JDBC/ODBC Server 147

Hive on Spark 148

Machine Learning 150

DataFrame 150

MLlib and ML 153

Mahout on Spark 158

Hivemall on Spark 160

External Frameworks 161

Spark Package 161

XGBoost 163

spark –jobserver 164

Future Works 166

Integration with the Parameter Server 167

Deep Learning 175

Enterprise Usage 182

Collecting User Activity Log with Spark and Kafka 183

Real –Time Recommendation with Spark 184

Real –Time Categorization of Twitter Bots 186

Summary 186

Index 189

Ilya Ganelin is a data engineer working at Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex.

Ema Orhian is a Big Data Engineer interested in scaling algorithms. She is the main committer on jaws–spark–sql–rest, a data warehouse explorer on top of Spark SQL.

Kai Sasaki is a software engineer working in distributed computing and machine learning. He is a Spark contributor who develops mainly MLlib, ML libraries.

Brennon York has been a core contributor to Apache Spark since 2014 including development on GraphX and the core build environment.

TIPS, TRICKS, AND SOLUTIONS FOR USING SPARK IN PRODUCTION

Spark′s popularity means the field is expanding in terms of both use and capability. Faster than Hadoop and MapReduce, but compatible with Java^®, Scala, Python^®, and R, this open source clustering framework is becoming a must–have skill. Spark: Big Data Cluster Computing in Production goes beyond the basics to show you how to bring Spark to real–world production environments. With expert instruction, real–life use cases, and frank discussion, this guide helps you move past the challenges and bring proof–of–concept Spark applications live.

Fine–tune your Spark app to run on production data
Manage resources, organize storage, and master monitoring
Learn about potential problems from real–world use cases, and see where Spark fits best
Estimate cluster size and nail down hardware requirements
Tune up performance with memory management, partitioning, shuffling, and more
Ensure data security with Kerberos
Head off Spark streaming problems in production
Integrate Spark with Yarn, Mesos, Tachyon, and more

Krainaksiazek.pl w programie rzetelna firma

Krainaksiaze.pl - płatności przez paypal

Czytaj nas na:

Zobacz:

1997-2026 DolnySlask.com Agencja Internetowa