Agile Data Science 2.0. Building Full-Stack Data Analytics Applications with Spark - Helion

ebook

Autor: Russell Jurney
ISBN: 978-14-919-6006-6
stron: 352, Format: ebook
Data wydania: 2017-06-07
Księgarnia: Helion

Cena książki: 126,65 zł (poprzednio: 147,27 zł)
Oszczędzasz: 14% (-20,62 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Agile - Programowanie

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools.

Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.

Build value from your data in a series of agile sprints, using the data-value pyramid
Extract features for statistical models from a single dataset
Visualize data with charts, and expose different aspects through interactive reports
Use historical data to predict the future via classification and regression
Translate predictions into actions
Get feedback from users after each sprint to keep your project on track

Osoby które kupowały "Agile Data Science 2.0. Building Full-Stack Data Analytics Applications with Spark", wybierały także:

Mapa Agile & Scrum. Jak si 57,69 zł, (15,00 zł -74%)
TDD. Sztuka tworzenia dobrego kodu 48,54 zł, (19,90 zł -59%)
Mapowanie historyjek użytkownika. Przepis na produkt idealny 44,90 zł, (22,45 zł -50%)
Zarządzanie 3.0. Kierowanie zespołami z wykorzystaniem metodyk Agile 59,00 zł, (29,50 zł -50%)
Architektura Lean w projektach Agile 58,98 zł, (29,49 zł -50%)

Spis treści

Agile Data Science 2.0. Building Full-Stack Data Analytics Applications with Spark eBook -- spis treści

Preface
- Agile Data Science Mailing List
- Data Syndrome, Product Analytics Consultancy
  - Live Training
- Who This Book Is For
- How This Book Is Organized
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
I. Setup
1. Theory
- Introduction
- Definition
  - Methodology as Tweet
  - Agile Data Science Manifesto
    - Iterate, iterate, iterate
    - Ship intermediate output
    - Prototype experiments over implementing tasks
    - Integrate the tyrannical opinion of data
    - Climb up and down the data-value pyramid
    - Discover and pursue the critical path to a killer product
    - Get meta
    - Synthesis
- The Problem with the Waterfall
  - Research Versus Application Development
- The Problem with Agile Software
  - Eventual Quality: Financing Technical Debt
  - The Pull of the Waterfall
- The Data Science Process
  - Setting Expectations
  - Data Science Team Roles
  - Recognizing the Opportunity and the Problem
  - Adapting to Change
    - Harnessing the power of generalists
    - Leveraging agile platforms
    - Sharing intermediate results
- Notes on Process
  - Code Review and Pair Programming
  - Agile Environments: Engineering Productivity
    - Collaboration space
    - Private space
    - Personal space
  - Realizing Ideas with Large-Format Printing
2. Agile Tools
- Scalability = Simplicity
- Agile Data Science Data Processing
- Local Environment Setup
  - System Requirements
  - Setting Up Vagrant
  - Downloading the Data
- EC2 Environment Setup
  - Downloading the Data
- Getting and Running the Code
  - Getting the Code
  - Running the Code
  - Jupyter Notebooks
- Touring the Toolset
  - Agile Stack Requirements
  - Python 3
    - Anaconda and Miniconda
    - Jupyter notebooks
  - Serializing Events with JSON Lines and Parquet
    - JSON for Python
  - Collecting Data
  - Data Processing with Spark
    - Hadoop required
    - Processing data with Spark
  - Publishing Data with MongoDB
    - Booting Mongo
    - Pushing data to MongoDB from PySpark
  - Searching Data with Elasticsearch
    - Elasticsearch and PySpark
      - Making PySpark data searchable
      - Searching our data
    - Python and Elasticsearch with pyelasticsearch
  - Distributed Streams with Apache Kafka
    - Starting up Kafka
    - Topics, console producer, and console consumer
    - Realtime versus batch computing with Spark
    - Kafka in Python with kafka-python
  - Processing Streams with PySpark Streaming
  - Machine Learning with scikit-learn and Spark MLlib
    - Why scikit-learn as well as Spark MLlib?
  - Scheduling with Apache Airflow (Incubating)
    - Installing Airflow
    - Preparing a script for use with Airflow
      - Conditionally initializing PySpark
      - Parameterizing scripts on the command line
    - Creating an Airflow DAG in Python
    - Complete scripts for Airflow
    - Testing a task in Airflow
    - Running a DAG in Airflow
    - Backfilling data in Airflow
    - The power of Airflow
  - Reflecting on Our Workflow
  - Lightweight Web Applications
    - Python and Flask
      - Flask echo microservice
      - Python and Mongo with pymongo
      - Displaying executives in Flask
  - Presenting Our Data
    - Booting Bootstrap
    - Visualizing data with D3.js
- Conclusion
3. Data
- Air Travel Data
  - Flight On-Time Performance Data
  - OpenFlights Database
- Weather Data
- Data Processing in Agile Data Science
  - Structured Versus Semistructured Data
- SQL Versus NoSQL
  - SQL
  - NoSQL and Dataflow Programming
  - Spark: SQL + NoSQL
  - Schemas in NoSQL
  - Data Serialization
  - Extracting and Exposing Features in Evolving Schemas
- Conclusion
II. Climbing the Pyramid
4. Collecting and Displaying Records
- Putting It All Together
- Collecting and Serializing Flight Data
- Processing and Publishing Flight Records
  - Publishing Flight Records to MongoDB
- Presenting Flight Records in a Browser
  - Serving Flights with Flask and pymongo
  - Rendering HTML5 with Jinja2
- Agile Checkpoint
- Listing Flights
  - Listing Flights with MongoDB
  - Paginating Data
    - Reinventing the wheel?
    - Serving paginated data
    - Prototyping back from HTML
- Searching for Flights
  - Creating Our Index
  - Publishing Flights to Elasticsearch
  - Searching Flights on the Web
- Conclusion
5. Visualizing Data with Charts and Tables
- Chart Quality: Iteration Is Essential
- Scaling a Database in the Publish/Decorate Model
  - First Order Form
  - Second Order Form
  - Third Order Form
  - Choosing a Form
- Exploring Seasonality
  - Querying and Presenting Flight Volume
    - Iterating on our first chart
- Extracting Metal (Airplanes [Entities])
  - Extracting Tail Numbers
    - Data processing: batch or realtime?
    - Grouping and sorting data in Spark
    - Publishing airplanes with Mongo
    - Serving airplanes with Flask
    - Ensuring database performance with indexes
    - Linking back in to our new entity
    - Information architecture
  - Assessing Our Airplanes
- Data Enrichment
  - Reverse Engineering a Web Form
  - Gathering Tail Numbers
  - Automating Form Submission
  - Extracting Data from HTML
  - Evaluating Enriched Data
- Conclusion
6. Exploring Data with Reports
- Extracting Airlines (Entities)
  - Defining Airlines as Groups of Airplanes Using PySpark
  - Querying Airline Data in Mongo
  - Building an Airline Page in Flask
  - Linking Back to Our Airline Page
  - Creating an All Airlines Home Page
- Curating Ontologies of Semi-structured Data
- Improving Airlines
  - Adding Names to Carrier Codes
  - Incorporating Wikipedia Content
  - Publishing Enriched Airlines to Mongo
  - Enriched Airlines on the Web
- Investigating Airplanes (Entities)
  - SQL Subqueries Versus Dataflow Programming
  - Dataflow Programming Without Subqueries
  - Subqueries in Spark SQL
  - Creating an Airplanes Home Page
  - Adding Search to the Airplanes Page
    - Code versus configuration
    - Configuring a search widget
    - Building an Elasticsearch query programmatically
  - Creating a Manufacturers Bar Chart
  - Iterating on the Manufacturers Bar Chart
  - Entity Resolution: Another Chart Iteration
    - Entity resolution in 30 seconds
    - Resolving manufacturers in PySpark
    - Updating our chart
    - Boeing versus Airbus revisited
    - Cleanliness: Benefits of entity resolution
- Conclusion
7. Making Predictions
- The Role of Predictions
- Predict What?
- Introduction to Predictive Analytics
  - Making Predictions
    - Features
    - Regression
    - Classification
- Exploring Flight Delays
- Extracting Features with PySpark
- Building a Regression with scikit-learn
  - Loading Our Data
  - Sampling Our Data
  - Vectorizing Our Results
  - Preparing Our Training Data
  - Vectorizing Our Features
  - Sparse Versus Dense Matrices
  - Preparing an Experiment
  - Training Our Model
  - Testing Our Model
  - Conclusion
- Building a Classifier with Spark MLlib
  - Loading Our Training Data with a Specified Schema
  - Addressing Nulls
  - Replacing FlightNum with Route
  - Bucketizing a Continuous Variable for Classification
    - Determining arrival delay buckets
      - Iterative visualization with histograms
      - Bucket quest conclusion
    - Bucketizing with a DataFrame UDF
    - Bucketizing with pyspark.ml.feature.Bucketizer
  - Feature Vectorization with pyspark.ml.feature
    - Vectorizing categorical columns with Spark ML
    - Vectorizing continuous variables and indexes with Spark ML
  - Classification with Spark ML
    - Test/train split with DataFrames
    - Creating and fitting a model
    - Evaluating a model
    - Conclusion
- Conclusion
8. Deploying Predictive Systems
- Deploying a scikit-learn Application as a Web Service
  - Saving and Loading scikit-learn Models
    - Saving and loading objects using pickle
    - Saving and loading models using sklearn.externals.joblib
  - Groundwork for Serving Predictions
  - Creating Our Flight Delay Regression API
    - Filling in the predict_utils API
  - Testing Our API
  - Pulling Our API into Our Product
- Deploying Spark ML Applications in Batch with Airflow
  - Gathering Training Data in Production
  - Training, Storing, and Loading Spark ML Models
  - Creating Prediction Requests in Mongo
    - Feeding Mongo recommendation tasks from a Flask API
    - A frontend for generating prediction requests
    - Making a prediction request
  - Fetching Prediction Requests from MongoDB
  - Making Predictions in a Batch with Spark ML
    - Loading Spark ML models in PySpark
    - Making predictions with Spark ML
  - Storing Predictions in MongoDB
  - Displaying Batch Prediction Results in Our Web Application
  - Automating Our Workflow with Apache Airflow (Incubating)
    - Setting up Airflow
    - Creating a DAG for creating our model
    - Creating a DAG for operating our model
    - Using Airflow to manage and execute DAGs and tasks
      - Linking our Airflow script to the Airflow DAGs directory
      - Executing our Airflow setup script
      - Querying Airflow from the command line
      - Testing tasks in Airflow
      - Testing DAGs in Airflow
      - Monitoring tasks in the Airflow web interface
  - Conclusion
- Deploying Spark ML via Spark Streaming
  - Gathering Training Data in Production
  - Training, Storing, and Loading Spark ML Models
  - Sending Prediction Requests to Kafka
    - Setting up Kafka
      - Start Zookeeper
      - Start the Kafka server
      - Create a topic
      - Verify our new prediction request topic
    - Feeding Kafka recommendation tasks from a Flask API
    - A frontend for generating prediction requests
      - Polling requests and LinkedIn InMaps
      - A controller for the page
      - An API controller for serving prediction responses
      - Creating a template with a polling form
    - Making a prediction request
  - Making Predictions in Spark Streaming
  - Testing the Entire System
    - Overall system summary
    - Rubber meets road
    - Paydirt!
- Conclusion
9. Improving Predictions
- Fixing Our Prediction Problem
- When to Improve Predictions
- Improving Prediction Performance
  - Experimental Adhesion Method: See What Sticks
  - Establishing Rigorous Metrics for Experiments
    - Defining our classification metrics
    - Feature importance
    - Implementing a more rigorous experiment
    - Comparing experiments to determine improvements
    - Inspecting changes in feature importance
    - Conclusion
  - Time of Day as a Feature
- Incorporating Airplane Data
  - Extracting Airplane Features
  - Incorporating Airplane Features into Our Classifier Model
- Incorporating Flight Time
- Conclusion
A. Manual Installation
- Installing Hadoop
- Installing Spark
- Installing MongoDB
- Installing the MongoDB Java Driver
- Installing mongo-hadoop
  - Building mongo-hadoop
  - Installing pymongo_spark
- Installing Elasticsearch
- Installing Elasticsearch for Hadoop
- Setting Up Our Spark Environment
- Installing Kafka
- Installing scikit-learn
- Installing Zeppelin
Index