The Enterprise Big Data Lake. Delivering the Promise of Big Data and Data Science - Helion

ebook

Autor: Alex Gorelik
ISBN: 978-14-919-3150-9
stron: 224, Format: ebook
Data wydania: 2019-02-21
Księgarnia: Helion

Cena książki: 211,65 zł (poprzednio: 246,10 zł)
Oszczędzasz: 14% (-34,45 zł)

Osoby, które kupiły tę książkę, wybierały także »

Enterprises are experimenting with using Hadoop to build Big Data Lakes, but many projects are stalling or failing because the approaches that worked at Internet companies have to be adopted for the enterprise. This practical handbook guides managers and IT professionals from the initial research and decision-making process through planning, choosing products, and implementing, maintaining, and governing the modern data lake.

You'll explore various approaches to starting and growing a Data Lake, including Data Warehouse off-loading, analytical sandboxes, and "Data Puddles." Author Alex Gorelik shows you methods for setting up different tiers of data, from raw untreated landing areas to carefully managed and summarized data. You'll learn how to enable self-service to help users find, understand, and provision data; how to provide different interfaces to users with different skill levels; and how to do all of that in compliance with enterprise data governance policies.

Osoby które kupowały "The Enterprise Big Data Lake. Delivering the Promise of Big Data and Data Science", wybierały także:

Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 747,50 zł, (29,90 zł -96%)
Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi 427,14 zł, (29,90 zł -93%)
Jak zhakowa 125,00 zł, (10,00 zł -92%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych 186,88 zł, (29,90 zł -84%)

Spis treści

The Enterprise Big Data Lake. Delivering the Promise of Big Data and Data Science eBook -- spis treści

Preface
- Who Should Read This Book?
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Introduction to Data Lakes
- Data Lake Maturity
  - Data Puddles
  - Data Ponds
- Creating a Successful Data Lake
  - The Right Platform
  - The Right Data
  - The Right Interface
    - Providing data at the right level of expertise
    - Getting to the data
  - The Data Swamp
- Roadmap to Data Lake Success
  - Standing Up a Data Lake
  - Organizing the Data Lake
  - Setting Up the Data Lake for Self-Service
    - Finding and understanding the data
    - Accessing and provisioning the data
    - Preparing the data
    - Analysis and visualization
- Data Lake Architectures
  - Data Lakes in the Public Cloud
  - Logical Data Lakes
    - Virtualization versus a catalog-based logical data lake
- Conclusion
2. Historical Perspective
- The Drive for Self-Service DataThe Birth of Databases
- The Analytics ImperativeThe Birth of Data Warehousing
- The Data Warehouse Ecosystem
  - Storing and Querying the Data
    - Dimensional modeling and star schemas
    - Slowly changing dimensions
    - Massively parallel processing (MPP) systems
    - Data warehouse (DW) appliances
    - Columnar stores
    - In-memory databases
  - Loading the DataData Integration Tools
    - ETL
    - ETL versus ELT
    - Federation, EII, and data virtualization tools
  - Organizing and Managing the Data
    - Data quality tools
    - MDM systems
    - Data modeling tools
    - Metadata repositories
    - Data governance tools
  - Consuming the Data
    - Advanced analytics
- Conclusion
3. Introduction to Big Data and Data Science
- Hadoop Leads the Historic Shift to Big Data
  - The Hadoop File System
  - How Processing and Storage Interact in a MapReduce Job
  - Schema on Read
  - Hadoop Projects
- Data Science
- What Should Your Analytics Organization Focus On?
- Machine Learning
  - Explainability
  - Change Management
- Conclusion
4. Starting a Data Lake
- The What and Why of Hadoop
- Preventing Proliferation of Data Puddles
- Taking Advantage of Big Data
  - Leading with Data Science
  - Strategy 1: Offload Existing Functionality
  - Strategy 2: Data Lakes for New Projects
  - Strategy 3: Establish a Central Point of Governance
  - Which Way Is Right for You?
- Conclusion
5. From Data Ponds/Big Data Warehouses to Data Lakes
- Essential Functions of a Data Warehouse
  - Dimensional Modeling for Analytics
  - Integrating Data from Disparate Sources
  - Preserving History Using Slowly Changing Dimensions
  - Limitations of the Data Warehouse as a Historical Repository
- Moving to a Data Pond
  - Keeping History in a Data Pond
  - Implementing Slowly Changing Dimensions in a Data Pond
    - Denormalizing attributes to preserve state
    - Preserving state using snapshots
- Growing Data Ponds into a Data LakeLoading Data Thats Not in the Data Warehouse
  - Raw Data
  - External Data
  - Internet of Things (IoT) and Other Streaming Data
- Real-Time Data Lakes
- The Lambda Architecture
- Data Transformations
- Target Systems
  - Data Warehouses
  - Operational Data Stores
  - Real-Time Applications and Data Products
- Conclusion
6. Optimizing for Self-Service
- The Beginnings of Self-Service
- Business Analysts
  - Finding and Understanding DataDocumenting the Enterprise
  - Establishing Trust
    - Data quality
    - Lineage (provenance)
    - Stewardship
  - Provisioning
  - Preparing Data for Analysis
- Data Wrangling in the Data Lake
  - Situating Data Preparation in Hadoop
  - Common Use Cases for Data Preparation
    - Use case: Self-service automation for analytics or business applications
    - Customer example
    - Use case: Preparation for IT operationalization
    - Customer example
    - Use case: Exploratory analytics and machine learning
    - Customer example
- Analyzing and Visualizing
- The New World of Self-Service Business Intelligence
  - The New Analytic Workflow
  - Gatekeepers to Shopkeepers
  - Governing Self-Service
- Conclusion
7. Architecting the Data Lake
- Organizing the Data Lake
  - Landing or Raw Zone
  - Gold Zone
  - Work Zone
  - Sensitive Zone
    - Deidentification
- Multiple Data Lakes
  - Advantages of Keeping Data Lakes Separate
  - Advantages of Merging the Data Lakes
- Cloud Data Lakes
- Virtual Data Lakes
  - Data Federation
  - Big Data Virtualization
  - Eliminating Redundancy
- Conclusion
8. Cataloging the Data Lake
- Organizing the Data
  - Technical Metadata
    - Data profiling
    - Profiling hierarchical data
  - Business Metadata
    - Glossaries, taxonomies, and ontologies
    - Industry ontologies
    - Folksonomies
- Tagging
  - Automated Cataloging
- Logical Data Management
  - Sensitive Data Management and Access Control
    - Automated and manual vetting
  - Data Quality
    - Tag-based data quality rules
    - Annotation quality
    - Curation quality
    - Data set quality
- Relating Disparate Data
- Establishing Lineage
- Data Provisioning
- Tools for Building a Catalog
  - Tool Comparison
- The Data Ocean
- Conclusion
9. Governing Data Access
- Authorization or Access Control
- Tag-Based Data Access Policies
- Deidentifying Sensitive Data
  - Data Sovereignty and Regulatory Compliance
- Self-Service Access Management
  - Provisioning Data
- Conclusion
10. Industry-Specific Perspectives
- Big Data in Financial Services
  - Consumers, Digitization, and Data Are Changing Finance as We Know It
  - Saving the Bank
  - New Opportunities Offered by New Data
  - Key Processes in Making Use of the Data Lake
    - Data inventory and cataloging
    - Entity resolution and fuzzy matching
    - Analytics and modeling
- Value Added by Data Lakes in Financial Services
- Data Lakes in the Insurance Industry
- Smart Cities
- Big Data in Medicine
Index