Foundations for Architecting Data Solutions. Managing Successful Data Projects - Helion

ebook

Autor: Ted Malaska, Jonathan Seidman
ISBN: 978-14-920-3869-6
stron: 190, Format: ebook
Data wydania: 2018-08-29
Księgarnia: Helion

Cena książki: 143,65 zł (poprzednio: 167,03 zł)
Oszczędzasz: 14% (-23,38 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Bazy danych

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necessary to start, architect, and develop successful big data projects.

Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.

Start the planning process by considering the key data project types
Use guidelines to evaluate and select data management solutions
Reduce risk related to technology, your team, and vague requirements
Explore system interface design using APIs, REST, and pub/sub systems
Choose the right distributed storage system for your big data system
Plan and implement metadata collections for your data architecture
Use data pipelines to ensure data integrity from source to final storage
Evaluate the attributes of various engines for processing the data you collect

Osoby które kupowały "Foundations for Architecting Data Solutions. Managing Successful Data Projects", wybierały także:

Oracle Database 12c. Programowanie w języku PL/SQL 149,00 zł, (89,40 zł -40%)
Bazy danych. Podstawy projektowania i języka SQL 34,89 zł, (26,17 zł -25%)
Head First PHP & MySQL. Edycja polska 99,00 zł, (74,25 zł -25%)
MySQL. Mechanizmy wewnętrzne bazy danych 37,00 zł, (27,75 zł -25%)
Metody i techniki odkrywania wiedzy. Narzędzia CAQDAS w procesie analizy danych jakościowych 28,99 zł, (22,90 zł -21%)

Spis treści

Foundations for Architecting Data Solutions. Managing Successful Data Projects eBook -- spis treści

Preface
- Who This Book Is For
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
1. Key Data Project Types and Considerations
- Major Data Project Types
- Data Pipelines and Data Staging
  - Primary Considerations and Risk Management
    - Source data consumption
      - Embedded code
      - Agents
      - Interfaces
    - Risk management for data consumption
      - Version management
      - Impacts from source failures
      - Protection from sources that behave poorly
    - Data delivery guarantees
    - Data management and governance
      - Data model management
      - Regulatory concerns
    - Latency and delivery confirmation
      - Latency
      - Delivery confirmation
    - Risk management for data delivery
    - Access patterns
      - Batch jobs with large scans
      - Streaming jobs with large scans
      - Point requests
      - Searchable access
    - Risk management for access patterns
  - Pipeline and Staging Team Makeup
- Data Processing and Analysis
  - Primary Considerations and Risk Management
    - Defining the problems to be solved
    - Risk management for problem definition
    - Implementing and operationalizing solutions
      - Building a robust solution
      - Operationalizing solutions
  - Data Processing and Analytics Team Makeup
- Application Development
  - Primary Considerations and Risk Management
    - Latency and throughput
      - Race conditions
      - Asynchronous versus synchronous operations
      - Performance consistency
    - Risk management for latency
      - State locality
      - Client
      - Server
      - Datacenter
      - Multidatacenter
    - Risk management for locality
    - Availability
    - Risk management for availability
  - Application Development Team Makeup
- Summary
2. Evaluating and Selecting Data Management Solutions
- Stages of Open Source Projects
  - Private Incubation Stage
  - Release Stage
  - Curing Cancer Stage
  - Broken Promises Stage
  - Hardening Stage
  - Enterprise Stage
  - Decline and Slow Death Stage
- Common Life Cycles for Open Source Projects
  - Open Sourcing a Dead Product
  - The Follower
- Evaluating Benchmarks
- Considerations for Technology Selection
  - Understanding the Building Blocks
  - Looking to a Guide for Advice
  - Using Analysts
  - Looking to Market Trends
- Summary
3. Managing Risk in Data Projects
- Categories of Risk
  - Technology Risk
  - Team Risk
  - Requirements Risk
- Managing Risk
  - Categorizing Risk in Your Architecture
  - Technology Risk
  - Strength of the Team
  - Other Teams
  - Requirements Risk
  - Tying This All Together
    - Assigning risk weightings
    - Minimizing risk
- Using Prototypes and Proofs of Concept
  - Build Two to Three Ways
  - Build PoCs and Then Throw Them Away
  - Deployment Considerations
- Using Interfaces
- Start Building Early
- Test Often and Keep Records
- Monitoring and Alerting
- Communicating Risk
  - Collaborate and Gain Buy-In
  - Share the Risk
- Using Risk as a Negotiation Tool
- Summary
4. Interface Design
- The Human Body
  - The Human Body Versus a Data Architecture
    - Peripheral nervous system
    - Central nervous system
    - Senses
    - Controllable systems
    - Human parts summary
  - Decoupling
  - Decoupling Considerations
  - Specialization
- What Makes a Good Interface Design
  - The Contract
  - The Abstraction
    - Nonprogramming language interface
    - Code interface implementations
  - Versioning
  - Being Defensive
  - Documentation and Naming for Interfaces
- Nonfunctional Considerations
  - Availability
  - Response-Time Guarantees
  - Load Capacity
  - Using Testing to Determine SLAs
- Common Interface Examples
  - PublishSubscribe
    - Enterprise Service Bus
  - RequestResponse Asynchronous Example
  - RequestResponse Synchronous Example
- Summary
5. Distributed Storage Systems
- Attributes of Distributed Storage Systems
  - Storage System Genealogy
  - Partitioning
    - Centralized partitioning
    - Range partitioning
    - Hash partitioning
  - Mutation Options
    - Append only
    - File versus record
    - Record size
    - Mutation latency
  - Read Paths
    - Indexing
    - Row-based versus columnar storage
    - Partitioning
  - Availability Versus Consistency
    - The CAP theorem
    - Choosing availability: eventual consistency
    - Choosing consistency: strongly consistent
  - Primary Use Cases
    - Large scans
    - Random access to data
    - Cubing
    - Time series
    - High mutability
- Storage System Breakdown
  - HDFS
    - Genealogy
    - Partitioning
    - Mutation Options
    - Optimal read path
    - Primary use case
  - S3 and Object Stores
    - Genealogy
    - Partitioning
    - Mutation options
    - Optimal read path
    - Primary use case
  - Apache HBase
    - Genealogy
    - Partitioning
    - Mutation options
    - Optimized read path
    - Availability versus consistency
    - Primary use case
  - Apache Cassandra
    - Genealogy
    - Partitioning
    - Mutation options
    - Optimized read path
    - Primary use case
  - Elasticsearch and Apache Solr
    - Genealogy
    - Partitioning
    - Mutation options
    - Optimized read path
    - Primary use case
  - Newcomers: Apache Kudu and CockroachDB
    - Kudu
    - CockroachDB
  - In-Memory Storage Systems
    - Druid.io
    - Redis
- Summary
6. The Meta of Enterprise Data
- Reasons to Care About Metadata
  - Visibility
  - Relationships
  - Regulation
    - Regulatory responses
      - Right to personal information
      - Right to be forgotten
      - Restriction on applications of data
      - Exposure impact assessment
- Types of Metadata in a Data Architecture
  - Data at Rest
  - Data in Motion
    - Batch delivery
    - Streaming or microbatching
    - Application operations
    - Post transformation
    - Metadata to capture for data in motion
      - Paths
      - Sources
      - Data movement
      - Transformations
      - Destinations
  - Metadata for Source Data
  - Metadata About Data Processing
  - Reports and Dashboards
- Metadata Collection
  - Declarative Metadata Collection
    - What metadata should you declare?
  - Discovery of Metadata
    - How to handle the undocumented
- Metadata Management in Practice
- Summary
7. Ensuring Data Integrity
- Examples of Building Data Pipelines to Ensure Data Integrity
  - Predefined Data Pipelines
    - Batch pipeline
    - Streaming pipeline
- Validation of Data Pipelines
  - Row Counts
  - Distinct Count
  - Full-Byte Comparison
  - Checksum Comparison
- Summary
8. Data Processing
- Attributes of Processing Engines
  - DAG Management
    - External DAG management
    - Internal DAG management
  - Compute Isolation
    - Node-level isolation
    - Container-level isolation
    - Task-level isolation
    - Hidden isolation
    - Isolation considerations
  - Performance
  - Fault Tolerance
    - No fault tolerance
    - Executor recovery
    - Full job recovery
  - Interaction Model
  - Batch and/or Streaming
- Data Processing over Time
- Summary
Index