Programming Pig. Dataflow Scripting with Hadoop. 2nd Edition - Helion

ebook

Autor: Alan Gates, Daniel Dai
ISBN: 978-14-919-3704-4
stron: 368, Format: ebook
Data wydania: 2016-11-09
Księgarnia: Helion

Cena książki: 126,65 zł (poprzednio: 147,27 zł)
Oszczędzasz: 14% (-20,62 zł)

Osoby, które kupiły tę książkę, wybierały także »

For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets.

Updated with use cases and programming examples, this second edition is the ideal learning tool for new and experienced users alike. You’ll find comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell. When you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.

Delve into Pig’s data model, including scalar and complex data types
Write Pig Latin scripts to sort, group, join, project, and filter your data
Use Grunt to work with the Hadoop Distributed File System (HDFS)
Build complex data processing pipelines with Pig’s macros and modularity features
Embed Pig Latin in Python for iterative processing and other advanced tasks
Use Pig with Apache Tez to build high-performance batch and interactive data processing applications
Create your own load and store functions to handle data formats and storage mechanisms

Osoby które kupowały "Programming Pig. Dataflow Scripting with Hadoop. 2nd Edition", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Programming Pig. Dataflow Scripting with Hadoop. 2nd Edition eBook -- spis treści

Preface
- Who Should Read This Book
- Whats New in This Edition
- Conventions Used in This Book
- Code Examples in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments from the First Edition (Alan Gates)
- Second Edition Acknowledgments (Alan Gates and Daniel Dai)
1. What Is Pig?
- Pig Latin, a Parallel Data Flow Language
  - Comparing Query and Data Flow Languages
- Pig on Hadoop
  - MapReduces Hello World
  - How Pig Differs from MapReduce
- What Is Pig Useful For?
- The Pig Philosophy
- Pigs History
2. Installing and Running Pig
- Downloading and Installing Pig
  - Downloading the Pig Package from Apache
  - Installation and Setup
  - Downloading Pig Artifacts from Maven
  - Downloading the Source
  - Downloading Pig from Distributions
    - Downloading Pig from Hortonworks
    - Downloading Pig from Cloudera
    - Downloading Pig from MapR
- Running Pig
  - Running Pig Locally on Your Machine
  - Running Pig on Your Hadoop Cluster
  - Running Pig in the Cloud
    - Amazon Elastic MapReduce
    - Microsoft HDInsight
    - Google Cloud Platform
  - Command-Line and Configuration Options
  - Return Codes
- Grunt
  - Entering Pig Latin Scripts in Grunt
  - HDFS Commands in Grunt
  - Controlling Pig from Grunt
  - Running External Commands
  - Others
3. Pigs Data Model
- Types
  - Scalar Types
  - Complex Types
    - Map
    - Tuple
    - Bag
  - Nulls
- Schemas
  - Casts
4. Introduction to Pig Latin
- Preliminary Matters
  - Case Sensitivity
  - Comments
- Input and Output
  - load
  - store
  - dump
- Relational Operations
  - foreach
    - Expressions in foreach
    - UDFs in foreach
    - Generating complex data
    - Naming fields in foreach
    - CASE expressions
  - filter
  - group
  - order by
  - distinct
  - join
  - limit
  - sample
  - parallel
- User-Defined Functions
  - Registering Java UDFs
  - Registering UDFs in Scripting Languages
  - define and UDFs
  - Calling Static Java Functions
  - Calling Hive UDFs
5. Advanced Pig Latin
- Advanced Relational Operations
  - Advanced Features of foreach
    - flatten
    - Nested foreach
  - Casting a Relation to a Scalar
  - Using Different Join Implementations
    - Joining small to large data
    - Joining skewed data
    - Joining sorted data
  - cogroup
  - union
    - union onschema
  - cross
  - More on Nested foreach
  - rank
  - cube
  - assert
- Integrating Pig with Executables and Native Jobs
  - stream
  - native
- split and Nonlinear Data Flows
- Controlling Execution
  - set
  - Setting the Partitioner
- Pig Latin Preprocessor
  - Parameter Substitution
  - Macros
  - Including Other Pig Latin Scripts
6. Developing and Testing Pig Latin Scripts
- Development Tools
  - Syntax Highlighting and Checking
  - describe
  - explain
  - illustrate
  - Pig Statistics
  - Job Status
  - Debugging Tips
- Testing Your Scripts with PigUnit
7. Making Pig Fly
- Writing Your Scripts to Perform Well
  - Filter Early and Often
  - Project Early and Often
  - Set Up Your Joins Properly
  - Use Multiquery When Possible
  - Choose the Right Data Type
  - Select the Right Level of Parallelism
- Writing Your UDFs to Perform
- Tuning Pig and Hadoop for Your Job
- Using Compression in Intermediate Results
- Data Layout Optimization
- Map-Side Aggregation
- The JAR Cache
- Processing Small Jobs Locally
- Bloom Filters
- Schema Tuple Optimization
- Dealing with Failures
8. Embedding Pig
- Embedding Pig Latin in Scripting Languages
  - Compiling
  - Binding
    - Binding multiple sets of variables
  - Running
    - Running multiple bindings
  - Utility Methods
- Using the Pig Java APIs
  - PigServer
    - Instantiating PigServer
    - Setting Pig properties
    - Launching Pig jobs
    - Auxiliary methods
  - PigRunner
    - Notification
9. Writing Evaluation and Filter Functions
- Writing an Evaluation Function in Java
  - Where Your UDF Will Run
  - Evaluation Function Basics
    - Interacting with Pig values
  - Input and Output Schemas
  - Error Handling and Progress Reporting
  - Constructors and Passing Data from Frontend to Backend
    - Loading the distributed cache
    - UDFContext
  - Overloading UDFs
  - Variable-Length Input Schema
  - Memory Issues in Eval Funcs
  - Compile-Time Evaluation
  - Shipping JARs Automatically
- The Algebraic Interface
- The Accumulator Interface
- Writing Filter Functions
- Writing Evaluation Functions in Scripting Languages
  - Jython UDFs
  - JavaScript UDFs
  - JRuby UDFs
  - Groovy UDFs
  - Streaming Python UDFs
  - Comparing Scripting Language UDF Features
10. Writing Load and Store Functions
- Load Functions
  - Frontend Planning Functions
    - Determining the InputFormat
    - Determining the location
    - Getting the casting functions
  - Passing Information from the Frontend to the Backend
  - Backend Data Reading
    - Getting ready to read
    - Reading records
  - Additional Load Function Interfaces
    - Loading metadata
    - Using partitions
    - Casting bytearrays
    - Pushing down projections
    - Predicate pushdown
- Store Functions
  - Store Function Frontend Planning
    - Determining the OutputFormat
    - Setting the output location
    - Checking the schema
  - Store Functions and UDFContext
  - Writing Data
    - Preparing to write
    - Writing records
  - Failure Cleanup
  - Storing Metadata
- Shipping JARs Automatically
- Handling Bad Records
11. Pig on Tez
- What Is Tez?
- Running Pig on Tez
- Potential Differences When Running on Tez
  - UDFs
  - Using PigRunner
  - Testing and Debugging
    - Tez execution plan
    - Tez UI
    - Other changes
- Pig on Tez Internals
  - Multiple Backends in Pig
  - The Tez Optimizer
  - Operators and Implementation
    - order by
    - Skew join
    - rank
    - Merge join
  - Automatic Parallelism
    - Operator-dependent parallelism estimation
    - Deferred parallelism estimation
    - order by and skew joins
    - Dynamic parallelism
12. Pig and Other Members of the Hadoop Community
- Pig and Hive
  - HCatalog
  - WebHCat
- Cascading
- Spark
- NoSQL Databases
  - HBase
  - Accumulo
  - Cassandra
- DataFu
- Oozie
13. Use Cases and Programming Examples
- Sparse Tuples
- k-Means
- intersect and except
- Pig at Yahoo!
  - Apache Pig Use Cases at Yahoo!
  - Large-Scale ETL with Apache Pig
  - Features That Make Pig Attractive
    - Multiquery optimization
    - Macros
    - Skew joins and distributed order by
    - Nested foreach
    - Jython UDFs
    - Public availability of UDFs
    - Data formats
    - HCatalog integration
    - Scale and stability
  - Pig on Tez
  - Moving Forward
- Pig at Particle News
  - Compute Arrival Rate and Conversion Rate
  - Compute Sessions Triggered by a Push
A. Built-in User Defined Functions and PiggyBank
- Built-in UDFs
  - Built-in Load and Store Functions
  - Built-in Evaluation and Filter Functions
    - Built-in math UDFs
    - Built-in aggregate UDFs
    - Built-in chararray and bytearray UDFs
    - Built-in datetime UDFs
    - Built-in complex type UDFs
    - Built-in filter functions
    - Miscellaneous built-in UDFs
- PiggyBank
Index