Learning Apache Drill. Query and Analyze Distributed Data Sources with SQL - Helion

ebook

Autor: Charles Givre, Paul Rogers
ISBN: 978-14-920-3275-5
stron: 332, Format: ebook
Data wydania: 2018-11-02
Księgarnia: Helion

Cena książki: 186,15 zł (poprzednio: 216,45 zł)
Oszczędzasz: 14% (-30,30 zł)

Osoby, które kupiły tę książkę, wybierały także »

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight.

Use Drill to clean, prepare, and summarize delimited data for further analysis
Query file types including logfiles, Parquet, JSON, and other complex formats
Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL
Connect to Drill programmatically using a variety of languages
Use Drill even with challenging or ambiguous file formats
Perform sophisticated analysis by extending Drill’s functionality with user-defined functions
Facilitate data analysis for network security, image metadata, and machine learning

Osoby które kupowały "Learning Apache Drill. Query and Analyze Distributed Data Sources with SQL", wybierały także:

Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 747,50 zł, (29,90 zł -96%)
Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi 427,14 zł, (29,90 zł -93%)
Jak zhakowa 125,00 zł, (10,00 zł -92%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych 186,88 zł, (29,90 zł -84%)

Spis treści

Learning Apache Drill. Query and Analyze Distributed Data Sources with SQL eBook -- spis treści

Preface
- Who Should Read This Book
- Why We Wrote This Book
- Navigating This Book
- Online Resources
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- Special Thanks from Charles
- Special Thanks from Paul
1. Introduction to Apache Drill
- What Is Apache Drill?
  - Drill Is Versatile
  - Drill Is Easy to Use
    - Drill does not require you to define a schema
  - A Word About Drills Performance
  - A Very Brief History of Big Data
    - Hadoop
  - Drill in the Big Data Ecosystem
  - Comparing Drill with Similar Tools
2. Installing and Running Drill
- Preparing Your Machine for Drill
  - Special Configuration Instructions for Windows Installations
- Installing Drill on Windows
  - Starting Drill on a Windows Machine
- Installing Drill in Embedded Mode on macOS or Linux
  - Starting Drill on macOS or Linux in Embedded Mode
- Installing Drill in Distributed Mode on macOS or Linux
  - Preparing Your Cluster for Drill
  - Starting Drill in Distributed Mode
- Connecting to the Cluster
- Conclusion
3. Overview of Apache Drill
- The Apache Hadoop Ecosystem
  - Drill Is a Low-Latency Query Engine
  - Distributed Processing with HDFS
  - Elements of a Drill System
  - Drill Operation: The 30,000-Foot View
  - Drill Is a Query Engine, Not a Database
- Drill Operation Overview
  - Drill Components
  - SQL Session State
  - Statement Preparation
    - Parsing and semantic analysis
    - Logical and physical plans
    - Distribution
  - Statement Execution
    - Data representation
  - Low-Latency Features
    - Long-lived Drillbits
    - Code generation
    - Network exchanges
- Conclusion
4. Querying Delimited Data
- Ways of Querying Data with Drill
  - Other Interfaces
- Drill SQL Query Format
  - Choosing a Data Source
  - Defining a Workspace
  - Specifying a Default Data Source
  - Accessing Columns in a Query
  - Delimited Data with Column Headers
  - Table Functions
  - Querying Directories
    - Directory functions
- Understanding Drill Data Types
- Cleaning and Preparing Data Using String Manipulation Functions
  - Complex Data Conversion Functions
    - Reformatting numbers
- Working with Dates and Times in Drill
  - Converting Strings to Dates
  - Reformatting Dates
  - Date Arithmetic and Manipulation
  - Date and Time Functions in Drill
- Creating Views
- Data Analysis Using Drill
  - Summarizing Data with Aggregate Functions
    - Other analytic functions: Window functions
    - Comparison of aggregate and window analytic functions
- Common Problems in Querying Delimited Data
  - Spaces in Column Names
  - Illegal Characters in Column Headers
  - Reserved Words in Column Names
- Conclusion
5. Analyzing Complex and Nested Data
- Arrays and Maps
  - Arrays in Drill
  - Accessing Maps (KeyValue Pairs) in Drill
  - Querying Nested Data
    - Data types in JSON files
    - Formats of nested data
      - Querying record-oriented files
      - Using the FLATTEN() function to query split JSON files
      - Querying column-oriented JSON files with KVGEN()
- Analyzing Log Files with Drill
  - Configuring Drill to Read HTTPD Web Server Logs
  - Querying Web Server Logs
    - Analyzing user agent strings
    - Analyzing URLs and query strings
  - Other Log Analysis with Drill
- Conclusion
6. Connecting Drill to Data Sources
- Querying Multiple Data Sources
  - Configuring a New Storage Plug-in
  - Connecting Drill to a Relational Database
    - Configuring Drill to query an RDBMS
      - Microsoft SQL Server
      - MySQL
      - Oracle
      - PostgreSQL
      - SQLite
    - Querying an RDBMS from Drill
    - Other uses of the drill JDBC storage plug-in
  - Querying Data in Hadoop from Drill
  - Connecting to and Querying HBase from Drill
    - Querying data from HBase
  - Querying Hive Data from Drill
    - Connecting Drill to Hive
      - Connecting to Hive with a remote metastore
  - Connecting to and Querying Streaming Data with Drill and Kafka
    - Querying streaming data
    - Improving the performance of Kafka queries
  - Connecting to and Querying Kudu
  - Connecting to and Querying MongoDB from Drill
  - Connecting Drill to Cloud Storage
    - Querying data on Amazon S3
      - Getting access credentials for S3
    - Querying Minio datastores from drill
    - Connecting to other cloud storage services
  - Querying Time Series Data from Drill and OpenTSDB
    - Special considerations for time series data
- Conclusion
7. Connecting to Drill
- Understanding Drills Interfaces
  - JDBC and Drill
  - ODBC and Drill
    - Installing the ODBC driver
      - Configuring ODBC on Linux or macOS
      - Configuring ODBC on Windows
  - Drills REST Interface
- Connecting to Drill with Python
  - Using drillpy to Query Drill
  - Connecting to Drill Using pydrill
    - Other functionality of pydrill
  - Other Ways of Connecting to Drill from Python
- Connecting to Drill Using R
  - Querying Drill from R Using sergeant
    - Accessing other functionality in R
- Connecting to Drill Using Java
- Querying Drill with PHP
  - Using the Connector
  - Querying Drill from PHP
  - Interacting with Drill from PHP
- Querying Drill Using Node.js
- Using Drill as a Data Source in BI Tools
  - Exploring Data with Apache Zeppelin and Drill
    - Configuring Zeppelin to query Drill
    - Querying Drill from a Zeppelin notebook
    - Adding interactivity in Zeppelin
  - Exploring Data with Apache Superset
    - Configuring Superset to work with Drill
    - Building a demonstration visualization using Drill and Superset
- Conclusion
8. Data Engineering with Drill
- Schema-on-Read
  - The SQL Relational Model
  - Data Life Cycle: Data Exploration to Production
  - Schema Inference
- Data Source Inference
  - Storage Plug-ins
  - Storage Configurations
  - Workspaces
  - Querying Directories
  - Default Schema
- File Type Inference
  - Format Plug-ins and Format Configuration
  - Format Inference
  - File Format Variations
- Schema Inference Overview
- Distributed File Scans
  - Schema Inference for Delimited Data
    - CSV with header
    - Explicit projection
    - TypeOf functions
    - Casts to specify types
  - CSV Summary
    - CSV without a header row
    - Explicit projection
  - Schema Inference for JSON
    - JSON column names
    - JSON scalar types
  - Ambiguous Numeric Schemas
    - Mixed string and number types
    - Missing values
    - Leading null values
    - Null versus missing values in JSON output
- Aligning Schemas Across Files
- JSON Objects
  - JSON Lists in Drill
  - JSON Summary
- Using Drill with the Parquet File Format
  - Schema Evolution in Parquet
- Partitioning Data Directories
  - Defining a Table Workspace
- Working with Queries in Production
  - Capturing Schema Mapping in Views
  - Running Challenging Queries in Scripts
- Conclusion
9. Deploying Drill in Production
- Installing Drill
  - Prerequisites
  - Production Installation
    - Creating a Site Directory
  - Configuring ZooKeeper
    - Advanced ZooKeeper configuration
  - Configuring Memory
  - Configuring Logging
  - Testing the Installation
  - Distributing Drill Binaries and Configuration
    - Installing clush
    - Distributing Drill files
  - Starting the Drill Cluster
- Configuring Storage
  - Working with Apache Hadoop HDFS
    - Simple HDFS integration
    - Full HDFS integration
  - Working with Amazon S3
    - Access keys with Hadoop
    - Standalone Drill
    - Distributing the configuration
    - Defining the Amazon S3 storage configuration
    - Troubleshooting
- Admission Control
- Additional Configuration
  - User-Defined Functions and Custom Plug-ins
  - Security
  - Logging Levels
  - Controlling CPU Usage
- Monitoring
  - Monitoring the Drill Process
  - Monitoring JMX Metrics
  - Monitoring Queries
- Other Deployment Options
  - MapR Installer
  - Drill-on-YARN
  - Docker
- Conclusion
10. Setting Up Your Development Environment
- Installing Maven
- Creating the Drill Build Environment
  - Setting Up Git and Getting the Source Code
  - Building Drill from Source
- Installing the IDE
- Conclusion
11. Writing Drill User-Defined Functions
- Use Case: Finding and Filtering Valid Credit Card Numbers
- How User-Defined Functions Work in Drill
- Structure of a Simple Drill UDF
  - The pom.xml File
    - Including dependencies
  - The Function File
    - Defining input parameters
    - Setting the output value
    - Accessing data in holder objects
  - The Simple Function API
  - Putting It All Together
- Building and Installing Your UDF
  - Statically Installing a UDF
  - Dynamically Installing a UDF
- Complex Functions: UDFs That Return Maps or Arrays
  - Example: Extracting User Agent Metadata
  - The ComplexWriter
- Writing Aggregate User-Defined Functions
  - The Aggregate Function API
  - Example Aggregate UDF: Kendalls Rank Correlation Coefficient
- Conclusion
12. Writing a Format Plug-in
- The Example Regex Format Plug-in
- Creating the Easy Format Plug-in
  - Creating the Maven pom.xml File
  - Creating the Plug-in Package
  - Drill Module Configuration
  - Format Plug-in Configuration
  - Cautions Before Getting Started
- Creating the Regex Plug-in Configuration Class
  - Copyright Headers and Code Format
  - Testing the Configuration
  - Fixing Configuration Problems
  - Troubleshooting
- Creating the Format Plug-in Class
  - Creating a Test File
  - Configuring RAT
  - Efficient Debugging
  - Creating the Unit Test
  - How Drill Finds Your Plug-in
- The Record Reader
  - Testing the Reader Shell
  - Logging
  - Error Handling
  - Setup
  - Regex Parsing
  - Defining Column Names
  - Projection
  - Column Projection Accounting
  - Project None
  - Project All
  - Project Some
  - Opening the File
  - Record Batches
  - Drills Columnar Structure
  - Defining Vectors
  - Reading Data
  - Loading Data into Vectors
  - Releasing Resources
- Testing the Reader
  - Testing the Wildcard Case
  - Testing Explicit Projection
  - Testing Empty Projection
  - Scaling Up
- Additional Details
  - File Chunks
  - Default Format Configuration
  - Next Steps
  - Production Build
  - Contributing to Drill: The Pull Request
  - Maintaining Your Branch
  - Create a Plug-In Project
- Conclusion
13. Unique Uses of Drill
- Finding Photos Taken Within a Geographic Region
- Drilling Excel Files
  - The pom.xml File
  - The Excel Custom Record Reader
  - Using the Excel Format Plug-in
- Network Packet Analysis (PCAP) with Drill
  - Examples of Queries Using PCAP Data Files
    - Automating the process using an aggregate function
- Analyzing Twitter Data with Drill
- Using Drill in a Machine Learning Pipeline
  - Making Predictions Within Drill
  - Building and Serializing a Model
  - Writing the UDF Wrapper
  - Making Predictions Using the UDF
- Conclusion
A. List of Drill Functions
- Aggregate and Window Functions
  - Window Functions
- Cryptological and Hashing Functions
- Data Conversion Functions
- Geospatial Functions
- Math and Trigonometric Functions
- Networking Functions
- Null Handling Functions
- String Manipulation Functions
- Approximate String Matching Functions
  - Phonetic Functions
  - String Distance Functions
B. Drill Formatting Strings
Index