Architecting Modern Data Platforms. A Guide to Enterprise Hadoop at Scale - Helion

ebook

Autor: Jan Kunigk, Ian Buss, Paul Wilkinson
ISBN: 978-14-919-6922-9
stron: 636, Format: ebook
Data wydania: 2018-12-05
Księgarnia: Helion

Cena książki: 271,15 zł (poprzednio: 319,00 zł)
Oszczędzasz: 15% (-47,85 zł)

Osoby, które kupiły tę książkę, wybierały także »

There’s a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you’ll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.

Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You’ll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:

Infrastructure: Look at all component layers in a modern data platform, from the server to the data center, to establish a solid foundation for data in your enterprise
Platform: Understand aspects of deployment, operation, security, high availability, and disaster recovery, along with everything you need to know to integrate your platform with the rest of your enterprise IT
Taking Hadoop to the cloud: Learn the important architectural aspects of running a big data platform in the cloud while maintaining enterprise security and high availability

Osoby które kupowały "Architecting Modern Data Platforms. A Guide to Enterprise Hadoop at Scale", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Architecting Modern Data Platforms. A Guide to Enterprise Hadoop at Scale eBook -- spis treści

Foreword
Preface
- Some Misconceptions
- Some General Trends
  - Horizontal Scaling
  - Adoption of Open Source
  - Embracing Cloud Compute
  - Decoupled Compute and Storage
- What Is This Book About?
- Who Should Read This Book?
- The Road Ahead
- Conventions Used in This Book
- OReilly Safari
- How to Contact Us
- Acknowledgments
1. Big Data Technology Primer
- A Tour of the Landscape
  - Core Components
    - HDFS
    - YARN
    - Apache ZooKeeper
    - Apache Hive Metastore
    - Going deeper
  - Computational Frameworks
    - Hadoop MapReduce
    - Apache Spark
      - Going deeper
  - Analytical SQL Engines
    - Apache Hive
      - Going deeper
    - Apache Impala
      - Going deeper
      - Also consider
  - Storage Engines
    - Apache HBase
      - Going deeper
      - Also consider
    - Apache Kudu
      - Going deeper
    - Apache Solr
      - Going deeper
      - Also consider
    - Apache Kafka
      - Going deeper
  - Ingestion
  - Orchestration
    - Apache Oozie
    - Also consider
- Summary
I. Infrastructure
2. Clusters
- Reasons for Multiple Clusters
  - Multiple Clusters for Resiliency
    - Sizing resilient clusters
  - Multiple Clusters for Software Development
    - Variation in cluster sizing
  - Multiple Clusters for Workload Isolation
    - Sizing multiple clusters for workload isolation
  - Multiple Clusters for Legal Separation
  - Multiple Clusters and Independent Storage and Compute
- Multitenancy
  - Requirements for Multitenancy
- Sizing Clusters
  - Sizing by Storage
    - Sizing HDFS by storage
    - Sizing Kafka by storage
    - Sizing Kudu by storage
  - Sizing by Ingest Rate
  - Sizing by Workload
- Cluster Growth
  - The Drivers of Cluster Growth
  - Implementing Cluster Growth
- Data Replication
  - Replication for Software Development
  - Replication and Workload Isolation
- Summary
3. Compute and Storage
- Computer Architecture for Hadoop
  - Commodity Servers
  - Server CPUs and RAM
    - The role of the x86 architecture
    - Threads and cores in Hadoop
  - Nonuniform Memory Access
    - Why is NUMA important for big data?
  - CPU Specifications
  - RAM
- Commoditized Storage Meets the Enterprise
  - Modularity of Compute and Storage
  - Everything Is Java
  - Replication or Erasure Coding?
  - Alternatives
- Hadoop and the Linux Storage Stack
  - User Space
  - Important System Calls
  - The Linux Page Cache
  - Short-Circuit and Zero-Copy Reads
  - Filesystems
- Erasure Coding Versus Replication
  - Discussion
    - Network performance
    - Write performance
    - Locality optimization
    - Read performance
  - Guidance
- Low-Level Storage
  - Storage Controllers
    - RAID?
    - Controller cache
      - Read-ahead caching
      - Write-back caching
    - Guidelines
  - Disk Layer
    - SAS, Nearline SAS, or SATA (or SSDs)?
    - Disk sizes
    - Disk cache
- Server Form Factors
  - Form Factor Comparison
  - Guidance
- Workload Profiles
- Cluster Configurations and Node Types
  - Master Nodes
  - Worker Nodes
  - Utility Nodes
  - Edge Nodes
  - Small Cluster Configurations
  - Medium Cluster Configurations
  - Large Cluster Configurations
- Summary
4. Networking
- How Services Use a Network
  - Remote Procedure Calls (RPCs)
    - Implementations and architectures
    - Platform services and their RPCs
    - Process control
    - Latency
      - Latency and cluster services
  - Data Transfers
    - Replication
    - Shuffles
  - Monitoring
  - Backup
  - Consensus
- Network Architectures
  - Small Cluster Architectures
    - Single switch
      - Implementation
  - Medium Cluster Architectures
    - Stacked networks
      - Resiliency
      - Performance
      - Determining oversubscription in stacked networks
      - Stacked network cabling considerations
      - Implementation
    - Fat-tree networks
      - Scalability
      - Resiliency
      - Implementation
  - Large Cluster Architectures
    - Modular switches
    - Spine-leaf networks
      - Scalability
      - Resilient spine-leaf networks
      - Implementation
- Network Integration
  - Reusing an Existing Network
  - Creating an Additional Network
    - Edge-connected networks
- Network Design Considerations
  - Layer 1 Recommendations
  - Layer 2 Recommendations
  - Layer 3 Recommendations
- Summary
5. Organizational Challenges
- Who Runs It?
- Is It Infrastructure, Middleware, or an Application?
- Case Study: A Typical Business Intelligence Project
  - The Traditional Approach
  - Typical Team Setup
    - Architect
    - Analyst
    - Software developer
    - Administrator
    - Systems engineer
  - Compartmentalization of IT
  - Revised Team Setup for Hadoop in the Enterprise
    - Big data architect
    - Data scientist
    - Big data engineer
  - Solution Overview with Hadoop
  - New Team Setup
  - Split Responsibilities
  - Do I Need DevOps?
  - Do I Need a Center of Excellence/Competence?
- Summary
6. Datacenter Considerations
- Why Does It Matter ?
- Basic Datacenter Concepts
  - Cooling
  - Power
  - Network
  - Rack Awareness and Rack Failures
  - Failure Domain Alignment
- Space and Racking Constraints
- Ingest and Intercluster Connectivity
  - Software
  - Hardware
- Replacements and Repair
  - Operational Procedures
- Typical Pitfalls
  - Networking
  - Cluster Spanning
    - Nonstandard use of rack awareness
    - Bandwidth impairment
    - Quorum spanning with two datacenters
    - Quorum spanning with three datacenters
    - Alternative solutions
- Summary
II. Platform
7. Provisioning Clusters
- Operating Systems
  - OS Choices
  - OS Configuration for Hadoop
  - Automated Configuration Example
- Service Databases
  - Required Databases
  - Database Integration Options
  - Database Considerations
- Hadoop Deployment
  - Hadoop Distributions
  - Installation Choices
  - Distribution Architecture
  - Installation Process
- Summary
8. Platform Validation
- Testing Methodology
- Useful Tools
- Hardware Validation
  - CPU
    - Validation approaches
  - Disks
    - Sequential I/O performance
    - Disk health
  - Network
    - Measuring latency
    - Latency under load
    - Measuring throughput
    - Throughput under load
- Hadoop Validation
  - HDFS Validation
    - Single writes and reads
    - Distributed writes and reads
  - General Validation
    - TeraGen
      - Disk-only tests
      - Disk and network tests
    - TeraSort
- Validating Other Components
  - Operations Validation
- Summary
9. Security
- In-Flight Encryption
  - TLS Encryption
    - TLS and Java
    - TLS and non-Java processes
    - X.509
  - SASL Quality of Protection
  - Enabling in-Flight Encryption
- Authentication
  - Kerberos
    - Principals
    - Accessing services
    - Keytabs
    - Kerberos over HTTP
    - Cross-realm trusts
  - LDAP Authentication
  - Delegation Tokens
  - Impersonation
- Authorization
  - Group Resolution
  - Superusers and Supergroups
    - Restricting superusers
    - Supergroups
  - Hadoop Service Level Authorization
  - Centralized Security Management
  - HDFS
  - YARN
  - ZooKeeper
  - Hive
  - Impala
  - HBase
  - Solr
  - Kudu
  - Oozie
  - Hue
  - Kafka
  - Sentry
- At-Rest Encryption
  - Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
  - HDFS Transparent Data Encryption
    - Encrypting and decrypting files in encryption zones
    - Authorizing key operations
    - KMS implementations
  - Encrypting Temporary Files
- Summary
10. Integration with Identity Management Providers
- Integration Areas
- Integration Scenarios
  - Scenario 1: Writing a File to HDFS
  - Scenario 2: Submitting a Hive Query
  - Scenario 3: Running a Spark Job
- Integration Providers
- LDAP Integration
  - Background
  - LDAP Security
  - Load Balancing
  - Application Integration
  - Linux Integration
    - SSSD
- Kerberos Integration
  - Kerberos Clients
  - KDC Integration
    - Setting up cross-realm trusts
      - One-way trust between MIT KDC and AD
      - One-way trusts between MIT KDCs
    - Local cluster KDC
    - Local cluster KDC and corporate user KDC
    - Corporate KDC
- Certificate Management
  - Signing Certificates
  - Converting Certificates
  - Wildcard Certificates
  - Automation
- Summary
11. Accessing and Interacting with Clusters
- Access Mechanisms
  - Programmatic Access
  - Command-Line Access
  - Web UIs
- Access Topologies
  - Interaction Patterns
  - Proxy Access
    - HTTP proxies
    - SOCKS proxies
    - Service proxies
  - Load Balancing
  - Edge Node Interactions
    - HDFS
    - YARN
    - MapReduce
    - Spark
    - Hive
    - Impala
    - HBase
    - Solr
    - Oozie
    - Kudu
- Access Security
  - Administration Gateways
- Workbenches
  - Hue
  - Notebooks
- Landing Zones
- Summary
12. High Availability
- High Availability Defined
  - Lateral/Service HA
  - Vertical/Systemic HA
- Measuring Availability
  - Percentages
  - Percentiles
- Operating for HA
  - Monitoring
  - Playbooks and Postmortems
- HA Building Blocks
  - Quorums
  - Load Balancing
    - DNS round robin
    - Virtual IP
    - Dedicated load balancers
      - Session persistence
      - Hardware versus software
    - Security considerations
  - Database HA
    - Clustering
    - Replication
    - Supported databases
  - Ancillary Services
    - Essentials
    - Identity management providers
- General Considerations
  - Separation of Master and Worker Processes
  - Separation of Identical Service Roles
  - Master Servers in Separate Failure Domains
  - Balanced Master Configurations
  - Optimized Server Configurations
- High Availability of Cluster Services
  - ZooKeeper
    - Failover
    - Deployment considerations
  - HDFS
    - HA configurations
    - Manual failover
    - Automatic failover
    - Quorum Journal Manager mode
    - Security
    - Deployment recommendations
  - YARN
    - Manual failover
    - Automatic failover
    - Deployment recommendations
  - HBase
    - HMaster HA
    - Region replication
    - Deployment considerations
  - KMS
    - Deployment considerations
  - Hive
    - Metastore
    - HiveServer2
    - HA architecture
    - Deployment considerations
  - Impala
    - Impala daemons
    - Catalog server
    - Statestore
    - Architecting for HA
    - Deployment considerations
  - Solr
    - Deployment considerations
  - Kafka
    - Deployment considerations
  - Oozie
    - Deployment considerations
  - Hue
    - Deployment options
  - Other Services
  - Autoconfiguration
- Summary
13. Backup and Disaster Recovery
- Context
  - Many Distributed Systems
  - Policies and Objectives
  - Failure Scenarios
  - Suitable Data Sources
  - Strategies
    - Replication
    - Snapshots
    - Backups
    - Rack awareness and high availability
  - Data Types
  - Consistency
  - Validation
  - Summary
- Data Replication
  - HBase
  - Cluster Management Tools
  - Kafka
  - Summary
- Hadoop Cluster Backups
  - Databases
  - Subsystems
    - Cloudera Manager
    - Apache Ambari
    - HDFS
    - Hive Metastore
    - HBase
    - YARN
    - Oozie
    - Apache Sentry
    - Apache Ranger
    - Hue
  - Case Study: Automating Backups with Oozie
    - Introduction
    - Subflow: HDFS
    - Subflow: HBase
    - Subflow: Database
    - Backup workflow
- Restore
- Summary
III. Taking Hadoop to the Cloud
14. Basics of Virtualization for Hadoop
- Compute Virtualization
  - Virtual Machine Distribution
  - Anti-Affinity Groups
- Storage Virtualization
  - Virtualizing Local Storage
  - SANs
  - Object Storage and Network-Attached Storage
    - Network-attached storage
    - Object storage
- Network Virtualization
- Cluster Life Cycle Models
- Summary
15. Solutions for Private Clouds
- OpenStack
  - Automation and Integration
  - Life Cycle and Storage
  - Isolation
  - Summary
- OpenShift
  - Automation
  - Life Cycle and Storage
  - Isolation
  - Summary
- VMware and Pivotal Cloud Foundry
- Do It Yourself?
  - Automation
  - Isolation
  - Life Cycle Model
  - Summary
- Object Storage for Private Clouds
  - EMC Isilon
  - Ceph
    - Object storage
    - CephFS
    - Remote block storage
    - Summary
- Summary
16. Solutions in the Public Cloud
- Key Things to Know
- Cloud Providers
  - AWS
    - AWS instance types
    - AWS storage options
    - Amazon Elastic MapReduce
    - Caveats and service limits
  - Microsoft Azure
    - Azure instance types
    - Azure storage options
    - HDInsight
    - Caveats and service limits
  - Google Cloud Platform
    - Instance types
    - Storage options
    - Cloud Dataproc
    - Caveats and service limits
- Implementing Clusters
  - Instances
    - CPU-heavy instances
    - Balanced instances
    - Memory-heavy instances
    - Instances summary
  - Storage and Life Cycle Models
    - Suspendable clusters
    - One-off clusters
    - Sticky clusters
    - Storage compatibility
    - Storage and life cycle summary
  - Network Architecture
  - High Availability
    - The requirement for HA
    - Compute availability
      - Cluster availability
      - Instance availability
    - Data availability
    - Network availability
    - Service availability
      - Databases
      - Load balancers
- Summary
17. Automated Provisioning
- Long-Lived Clusters
  - Configuration and Templating
  - Deployment Phases
    - Environment configuration
    - Instance provisioning
    - Instance configuration
    - Cluster installation and configuration
    - Post-install tasks
  - Vendor Solutions
    - Cloudera Director
    - Ongoing management
  - One-Click Deployments
  - Homegrown Automation
  - Hooking Into a Provisioning Life Cycle
  - Scaling Up and Down
  - Deploying with Security
    - Integrating with a Kerberos KDC
    - TLS
- Transient Clusters
- Sharing Metadata Services
- Summary
18. Security in the Cloud
- Assessing the Risk
- Risk Model
  - Environmental Risks
    - Mitigation
  - Deployment Risks
    - Mitigation
- Identity Provider Options for Hadoop
  - Option A: Cloud-Only Self-Contained ID Services
  - Option B: Cloud-Only Shared ID Services
  - Option C: On-Premises ID Services
- Object Storage Security and Hadoop
  - Identity and Access Management
  - Amazon Simple Storage Service
    - Hadoop integration
      - Temporary security credentials
      - Persistent credentials
      - Environment variables
      - Instance roles
      - Anonymous access
    - Further information
  - GCP Cloud Storage
    - Hadoop integration
      - Service account
      - User account
      - Further information
  - Microsoft Azure
    - Disk storage
    - Blob storage
    - ADLS
    - Hadoop integration
      - Azure Blob storage
      - ADLS
    - Further information
- Auditing
- Encryption for Data at Rest
  - Requirements for Key Material
  - Options for Encryption in the Cloud
  - On-Premises Key Persistence
  - Encryption via the Cloud Provider
    - Cloud Key Management Services
    - Server-side and client-side encryption
    - BYOK
    - Encryption in AWS
    - Encryption in Microsoft Azure
    - Encryption in GCP
  - Encryption Feature and Interoperability Summary
  - Recommendations and Summary for Cloud Encryption
- Encrypting Data in Flight in the Cloud
- Perimeter Controls and Firewalling
  - GCP
    - Example implementation
  - AWS
    - Example implementation
  - Azure
    - Use case implementation
- Summary
A. Backup Onboarding Checklist
- Backup Onboarding Checklist
  - Backup
- Services
  - Cloudera Manager
  - HDFS
  - HBase
  - Hive/Impala
  - Sqoop
  - Oozie
  - Hue
  - Sentry
Index