Advanced Analytics with Spark. Patterns for Learning from Data at Scale. 2nd Edition - Helion
ISBN: 978-14-919-7290-8
stron: 280, Format: ebook
Data wydania: 2017-06-12
Księgarnia: Helion
Cena książki: 152,15 zł (poprzednio: 176,92 zł)
Oszczędzasz: 14% (-24,77 zł)
In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.
You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.
If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.
With this book, you will:
- Familiarize yourself with the Spark programming model
- Become comfortable within the Spark ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public data sets
- Discover which machine learning tools make sense for particular problems
- Acquire code that can be adapted to many uses
Osoby które kupowały "Advanced Analytics with Spark. Patterns for Learning from Data at Scale. 2nd Edition", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Advanced Analytics with Spark. Patterns for Learning from Data at Scale. 2nd Edition eBook -- spis treści
- Foreword
- Preface
- Whats in This Book
- The Second Edition
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- 1. Analyzing Big Data
- The Challenges of Data Science
- Introducing Apache Spark
- About This Book
- The Second Edition
- 2. Introduction to Data Analysis with Scala and Spark
- Scala for Data Scientists
- The Spark Programming Model
- Record Linkage
- Getting Started: The Spark Shell and SparkContext
- Bringing Data from the Cluster to the Client
- Shipping Code from the Client to the Cluster
- From RDDs to Data Frames
- Analyzing Data with the DataFrame API
- Fast Summary Statistics for DataFrames
- Pivoting and Reshaping DataFrames
- Joining DataFrames and Selecting Features
- Preparing Models for Production Environments
- Model Evaluation
- Where to Go from Here
- 3. Recommending Music and the Audioscrobbler Data Set
- Data Set
- The Alternating Least Squares Recommender Algorithm
- Preparing the Data
- Building a First Model
- Spot Checking Recommendations
- Evaluating Recommendation Quality
- Computing AUC
- Hyperparameter Selection
- Making Recommendations
- Where to Go from Here
- 4. Predicting Forest Cover with Decision Trees
- Fast Forward to Regression
- Vectors and Features
- Training Examples
- Decision Trees and Forests
- Covtype Data Set
- Preparing the Data
- A First Decision Tree
- Decision Tree Hyperparameters
- Tuning Decision Trees
- Categorical Features Revisited
- Random Decision Forests
- Making Predictions
- Where to Go from Here
- 5. Anomaly Detection in Network Traffic with K-means Clustering
- Anomaly Detection
- K-means Clustering
- Network Intrusion
- KDD Cup 1999 Data Set
- A First Take on Clustering
- Choosing k
- Visualization with SparkR
- Feature Normalization
- Categorical Variables
- Using Labels with Entropy
- Clustering in Action
- Where to Go from Here
- 6. Understanding Wikipedia with Latent Semantic Analysis
- The Document-Term Matrix
- Getting the Data
- Parsing and Preparing the Data
- Lemmatization
- Computing the TF-IDFs
- Singular Value Decomposition
- Finding Important Concepts
- Querying and Scoring with a Low-Dimensional Representation
- Term-Term Relevance
- Document-Document Relevance
- Document-Term Relevance
- Multiple-Term Queries
- Where to Go from Here
- 7. Analyzing Co-Occurrence Networks with GraphX
- The MEDLINE Citation Index: A Network Analysis
- Getting the Data
- Parsing XML Documents with Scalas XML Library
- Analyzing the MeSH Major Topics and Their Co-Occurrences
- Constructing a Co-Occurrence Network with GraphX
- Understanding the Structure of Networks
- Connected Components
- Degree Distribution
- Filtering Out Noisy Edges
- Processing EdgeTriplets
- Analyzing the Filtered Graph
- Small-World Networks
- Cliques and Clustering Coefficients
- Computing Average Path Length with Pregel
- Where to Go from Here
- 8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data
- Getting the Data
- Working with Third-Party Libraries in Spark
- Geospatial Data with the Esri Geometry API and Spray
- Exploring the Esri Geometry API
- Intro to GeoJSON
- Preparing the New York City Taxi Trip Data
- Handling Invalid Records at Scale
- Geospatial Analysis
- Sessionization in Spark
- Building Sessions: Secondary Sorts in Spark
- Where to Go from Here
- 9. Estimating Financial Risk Through Monte Carlo Simulation
- Terminology
- Methods for Calculating VaR
- Variance-Covariance
- Historical Simulation
- Monte Carlo Simulation
- Our Model
- Getting the Data
- Preprocessing
- Determining the Factor Weights
- Sampling
- The Multivariate Normal Distribution
- Running the Trials
- Visualizing the Distribution of Returns
- Evaluating Our Results
- Where to Go from Here
- 10. Analyzing Genomics Data and the BDG Project
- Decoupling Storage from Modeling
- Ingesting Genomics Data with the ADAM CLI
- Parquet Format and Columnar Storage
- Predicting Transcription Factor Binding Sites from ENCODE Data
- Querying Genotypes from the 1000 Genomes Project
- Where to Go from Here
- 11. Analyzing Neuroimaging Data with PySpark and Thunder
- Overview of PySpark
- PySpark Internals
- Overview and Installation of the Thunder Library
- Loading Data with Thunder
- Thunder Core Data Types
- Categorizing Neuron Types with Thunder
- Where to Go from Here
- Overview of PySpark
- Index