Big Data for Chimps. A Guide to Massive-Scale Data Processing in Practice - Helion
ISBN: 978-14-919-2390-0
stron: 220, Format: ebook
Data wydania: 2015-09-28
Księgarnia: Helion
Cena książki: 126,65 zł (poprzednio: 147,27 zł)
Oszczędzasz: 14% (-20,62 zł)
Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll gain a practical, actionable view of big data by working with real data and real problems.
Perfect for beginners, this book’s approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, you’ll also learn how to use Apache Pig to process data.
- Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
- Dive into map/reduce mechanics and build your first map/reduce job in Python
- Understand how to run chains of map/reduce jobs in the form of Pig scripts
- Use a real-world dataset—baseball performance statistics—throughout the book
- Work with examples of several analytic patterns, and learn when and where you might use them
Osoby które kupowały "Big Data for Chimps. A Guide to Massive-Scale Data Processing in Practice", wybierały także:
- Scala for Machine Learning - Second Edition 186,88 zł, (29,90 zł -84%)
- QlikView for Developers 186,88 zł, (29,90 zł -84%)
- Oracle Business Intelligence Enterprise Edition 12c - Second Edition 186,88 zł, (29,90 zł -84%)
- SQL Server 2016 Developer's Guide 186,88 zł, (29,90 zł -84%)
- Blockchain Development with Hyperledger 175,88 zł, (29,90 zł -83%)
Spis treści
Big Data for Chimps. A Guide to Massive-Scale Data Processing in Practice eBook -- spis treści
- Preface
- What This Book Covers
- Who This Book Is For
- Who This Book Is Not For
- What This Book Does Not Cover
- Theory: Chimpanzee and Elephant
- Practice: Hadoop
- Example Code
- A Note on Python and MrJob
- Helpful Reading
- Feedback
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- I. Introduction: Theory and Tools
- 1. Hadoop Basics
- Chimpanzee and Elephant Start a Business
- Map-Only Jobs: Process Records Individually
- Pig Latin Map-Only Job
- Setting Up a Docker Hadoop Cluster
- Run the Job
- Wrapping Up
- 2. MapReduce
- Chimpanzee and Elephant Save Christmas
- Trouble in Toyland
- Chimpanzees Process Letters into Labeled Toy Forms
- Pygmy Elephants Carry Each Toy Form to the Appropriate Workbench
- Example: Reindeer Games
- UFO Data
- Group the UFO Sightings by Reporting Delay
- Mapper
- Reducer
- Plot the Data
- Reindeer Conclusion
- Hadoop Versus Traditional Databases
- The MapReduce Haiku
- Map Phase, in Light Detail
- Group-Sort Phase, in Light Detail
- Reduce Phase, in Light Detail
- Wrapping Up
- Chimpanzee and Elephant Save Christmas
- 3. A Quick Look into Baseball
- The Data
- Acronyms and Terminology
- The Rules and Goals
- Performance Metrics
- Wrapping Up
- 4. Introduction to Pig
- Pig Helps Hadoop Work with Tables, Not Records
- Wikipedia Visitor Counts
- Fundamental Data Operations
- Control Operations
- Pipelinable Operations
- Structural Operations
- LOAD Locates and Describes Your Data
- Simple Types
- Complex Type 1, Tuples: Fixed-Length Sequence of Typed Fields
- Complex Type 2, Bags: Unbounded Collection of Tuples
- Defining the Schema of a Transformed Record
- STORE Writes Data to Disk
- Development Aid Commands
- DESCRIBE
- DUMP
- SAMPLE
- ILLUSTRATE
- EXPLAIN
- Pig Functions
- Piggybank
- Apache DataFu
- Wrapping Up
- Pig Helps Hadoop Work with Tables, Not Records
- II. Tactics: Analytic Patterns
- 5. Map-Only Operations
- Pattern in Use
- Eliminating Data
- Selecting Records That Satisfy a Condition: FILTER and Friends
- Selecting Records That Satisfy Multiple Conditions
- Selecting or Rejecting Records with a null Value
- Selecting Records That Match a Regular Expression (MATCHES)
- Pattern in use
- Matching Records Against a Fixed List of Lookup Values
- Pattern in use
- Project Only Chosen Columns by Name
- Using a FOREACH to Select, Rename, and Reorder fields
- Pattern in use
- Extracting a Random Sample of Records
- Pattern in use
- Extracting a Consistent Sample of Records by Key
- Pattern in use
- Sampling Carelessly by Only Loading Some part- Files
- Selecting a Fixed Number of Records with LIMIT
- Other Data Elimination Patterns
- Using a FOREACH to Select, Rename, and Reorder fields
- Transforming Records
- Transforming Records Individually Using FOREACH
- A Nested FOREACH Allows Intermediate Expressions
- Formatting a String According to a Template
- Assembling Literals with Complex Types
- Parsing a date
- Assembling a bag
- Manipulating the Type of a Field
- Ints and Floats and Rounding, Oh My!
- Calling a User-Defined Function from an External Package
- Operations That Break One Table into Many
- Directing Data Conditionally into Multiple Dataflows (SPLIT)
- Demonstration in Pig
- Directing Data Conditionally into Multiple Dataflows (SPLIT)
- Operations That Treat the Union of Several Tables as One
- Treating Several Pig Relation Tables as a Single Table (Stacking Rowsets)
- Wrapping Up
- 6. Grouping Operations
- Grouping Records into a Bag by Key
- Pattern in Use
- Counting Occurrences of a Key
- Pattern in use
- Representing a Collection of Values with a Delimited String
- Pattern in use
- Representing a Complex Data Structure with a Delimited String
- Pattern in use
- Representing a Complex Data Structure with a JSON-Encoded String
- Pattern in use
- Does God hate Cleveland?
- Group and Aggregate
- Aggregating Statistics of a Group
- Pattern in use
- Completely Summarizing a Field
- Pattern in use
- Summarizing Aggregate Statistics of a Full Table
- Pattern in use
- Summarizing a String Field
- Pattern in use
- Aggregating Statistics of a Group
- Calculating the Distribution of Numeric Values with a Histogram
- Pattern in Use
- Binning Data for a Histogram
- Histogram of career games played
- Choosing a Bin Size
- Bin size too large
- Bin size too small
- Bin size just right
- Interpreting Histograms and Quantiles
- Games played: linear
- Games played: log-log plot
- Binning Data into Exponentially Sized Buckets
- Pattern in use
- Creating Pig Macros for Common Stanzas
- Distribution of Games Played
- Extreme Populations and Confounding Factors
- Distribution of birth and death day of year
- Baseball player deaths
- Baseball player births
- Dont Trust Distributions at the Tails
- Calculating a Relative Distribution Histogram
- Pattern in use
- Reinjecting Global Values
- Calculating a Histogram Within a Group
- Pattern in use
- Dumping Readable Results
- Pattern in use
- The Summing Trick
- Counting Conditional Subsets of a GroupThe Summing Trick
- Summarizing Multiple Subsets of a Group Simultaneously
- Pattern in use
- Testing for Absence of a Value Within a Group
- Pattern in use
- Wrapping Up
- References
- Grouping Records into a Bag by Key
- 7. Joining Tables
- Matching Records Between Tables (Inner Join)
- Joining Records in a Table with Directly Matching Records from Another Table (Direct Inner Join)
- Disambiguating field names with ::
- Body type versus slugging average
- Joining Records in a Table with Directly Matching Records from Another Table (Direct Inner Join)
- How a Join Works
- A Join Is a COGROUP+FLATTEN
- A Join Is a MapReduce Job with a Secondary Sort on the Table Name
- Pattern in use
- Handling nulls and Nonmatches in Joins and Groups
- Pattern in use: inner join
- Enumerating a Many-to-Many Relationship
- Joining a Table with Itself (Self-Join)
- Joining Records Without Discarding Nonmatches (Outer Join)
- Pattern in Use
- Joining Tables That Do Not Have a Foreign-Key Relationship
- Pattern in use
- Joining on an Integer Table to Fill Holes in a List
- Pattern in use
- Selecting Only Records That Lack a Match in Another Table (Anti-Join)
- Selecting Only Records That Possess a Match in Another Table (Semi-Join)
- An Alternative to Anti-Join: Using a COGROUP
- Wrapping Up
- Matching Records Between Tables (Inner Join)
- 8. Ordering Operations
- Preparing Career Epochs
- Sorting All Records in Total Order
- Sorting by Multiple Fields
- Sorting on an Expression (You Cant)
- Sorting Case-Insensitive Strings
- Dealing with nulls When Sorting
- Floating Values to the Top or Bottom of the Sort Order
- Pattern in use
- Sorting Records Within a Group
- Pattern in Use
- Selecting Rows with the Top-K Values for a Field
- Top K Within a Group
- Numbering Records in Rank Order
- Finding Records Associated with Maximum Values
- Shuffling a Set of Records
- Wrapping Up
- 9. Duplicate and Unique Records
- Handling Duplicates
- Eliminating Duplicate Records from a Table
- Eliminating Duplicate Records from a Group
- Eliminating All But One Duplicate Based on a Key
- Selecting Records with Unique (or with Duplicate) Values for a Key
- Set Operations
- Set Operations on Full Tables
- Distinct Union
- Distinct Union (Alternative Method)
- Set Intersection
- Set Difference
- Symmetric Set Difference: (AB)+(BA)
- Set Equality
- Set Operations Within Groups
- Constructing a Sequence of Sets
- Set Operations Within a Group
- Wrapping Up
- Handling Duplicates
- Index