Databricks Certified Data Engineer Associate Study Guide - Helion

ebook

Autor: Derar Alhussein
ISBN: 9781098166793
stron: 408, Format: ebook
Data wydania: 2024-04-24
Księgarnia: Helion

Cena książki: 237,15 zł (poprzednio: 285,72 zł)
Oszczędzasz: 17% (-48,57 zł)

Osoby, które kupiły tę książkę, wybierały także »

Data engineers proficient in Databricks are currently in high demand. As organizations gather more data than ever before, skilled data engineers on platforms like Databricks become critical to business success. The Databricks Data Engineer Associate certification is proof that you have a complete understanding of the Databricks platform and its capabilities, as well as the essential skills to effectively execute various data engineering tasks on the platform.

In this comprehensive study guide, you will build a strong foundation in all topics covered on the certification exam, including the Databricks Lakehouse and its tools and benefits. You'll also learn to develop ETL pipelines in both batch and streaming modes. Moreover, you'll discover how to orchestrate data workflows and design dashboards while maintaining data governance. Finally, you'll dive into the finer points of exactly what's on the exam and learn to prepare for it with mock tests.

Author Derar Alhussein teaches you not only the fundamental concepts but also provides hands-on exercises to reinforce your understanding. From setting up your Databricks workspace to deploying production pipelines, each chapter is carefully crafted to equip you with the skills needed to master the Databricks Platform. By the end of this book, you'll know everything you need to ace the Databricks Data Engineer Associate certification exam with flying colors, and start your career as a certified data engineer from Databricks!

You'll learn how to:

Use the Databricks Platform and Delta Lake effectively
Perform advanced ETL tasks using Apache Spark SQL
Design multi-hop architecture to process data incrementally
Build production pipelines using Delta Live Tables and Databricks Jobs
Implement data governance using Databricks SQL and Unity Catalog

Derar Alhussein is a senior data engineer with a master's degree in data mining. He has over a decade of hands-on experience in software and data projects, including large-scale projects on Databricks. He currently holds eight certifications from Databricks, showcasing his proficiency in the field. Derar is also an experienced instructor, with a proven track record of success in training thousands of data engineers, helping them to develop their skills and obtain professional certifications.

Osoby które kupowały "Databricks Certified Data Engineer Associate Study Guide", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 125,00 zł, (15,00 zł -88%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 93,75 zł, (15,00 zł -84%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Databricks Certified Data Engineer Associate Study Guide eBook -- spis treści

Preface
- Why I Wrote This Book
- Who This Book Is For
- What You Will Learn
- What Not to Expect
- GitHub Repository and Community
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- How to Contact the Author
- Acknowledgments
1. Getting Started with Databricks
- Introducing the Databricks Platform
  - Understanding the Databricks Platform
  - High-Level Architecture of the Databricks Lakehouse
  - Deployment of Databricks Resources
  - Apache Spark on Databricks
  - Databricks File System (DBFS)
- Setting Up a Databricks Workspace
- Exploring the Databricks Workspace
  - Overview of the Workspace Interface
    - Sidebar
    - Top bar
  - Navigating the Workspace Browser
  - Importing Book Materials
    - Option 1: Git folders
    - Option 2: DBC files
- Creating Clusters
  - All-Purpose Clusters
  - Job Clusters
  - Databricks Pools
    - Understanding cluster pools
    - Cost considerations
  - Creating All-Purpose Clusters
    - 1. Navigating to the Compute tab
    - 2. Initiating the cluster creation
    - 3. Naming your cluster
    - 4. Setting the cluster policy
    - 5. Configuring the cluster: Single-node versus multi-node
    - 6. Configuring the access mode
    - 7. Performance: Selecting the Databricks Runtime version
    - 8. Enabling Photon
    - 9. Configuring worker nodes
    - 10. Configuring the driver node
    - 11. Enabling auto-termination
    - 12. Reviewing the cluster configuration
    - 13. Creating the cluster
  - Managing Your Cluster
    - Controlling your cluster
    - Managing your cluster
- Working with Notebooks
  - Creating a New Notebook
  - Setting the Notebook Language
  - Executing Code
    - Running code cells
    - Managing cells
  - Magic Commands
    - Language magic command
    - Markdown magic command
    - Enhancing notebook navigation with Markdown
    - Run magic command
    - FS magic command
  - Databricks Utilities
    - Displaying the output
    - Comparison: %fs magic command versus dbutils
  - Download Notebooks
  - Notebook Versioning
    - Accessing version history
    - Restoring a previous version
- Versioning with Git
  - Setting Up Git Integration
    - Prerequisites
    - Configuring Git integration
  - Creating Git Folders
  - Managing Git Branches
  - Committing and Pushing Changes
  - Pulling Changes from GitHub
    - Synchronizing with merged pull requests
    - Pulling changes
- Conclusion
- Sample Exam Questions
  - Conceptual Questions
  - Code-Based Questions
2. Managing Data with Delta Lake
- Introducing Delta Lake
  - What Is Delta Lake?
  - Delta Lake Transaction Log
  - Understanding Delta Lake Functionality
    - Writing and reading scenario
      - Write operation by Alice
      - Read operation by Bob
    - Updating scenario
    - Concurrent writes and reads scenario
    - Failed writes scenario
  - Delta Lake Advantages
- Working with Delta Lake Tables
  - Creating Tables
  - Catalog Explorer
  - Inserting Data
  - Exploring the Table Directory
  - Updating Delta Lake Tables
  - Exploring Table History
- Exploring Delta Time Travel
  - Querying Older Versions
    - Querying by timestamp
    - Querying by version number
  - Rolling Back to Previous Versions
- Optimizing Delta Lake Tables
  - Z-Order Indexing
- Vacuuming
  - Vacuuming in Action
- Dropping Delta Lake Tables
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
3. Mastering Relational Entities in Databricks
- Understanding Relational Entities
  - Databases in Databricks
    - Default database
    - Creating databases
    - Custom-location databases
  - Tables in Databricks
    - Managed tables
    - External tables
- Putting Relational Entities into Practice
  - Working in the Default Schema
    - Creating managed tables
    - Creating external tables
    - Dropping tables
  - Working in a New Schema
    - Creating a new database
    - Creating tables in the new database
    - Dropping tables
  - Working In a Custom-Location Schema
    - Creating the database
    - Creating tables
    - Dropping tables
- Setting Up Delta Tables
  - CTAS Statements
  - Comparing CREATE TABLE and CTAS
    - Schema declaration
    - Populating data
  - Table Constraints
  - Cloning Delta Lake Tables
    - Deep cloning
    - Shallow cloning
    - Data integrity in cloning
- Exploring Views
  - View Types
    - Stored views
    - Temporary views
    - Global temporary views
  - Comparison of View Types
    - Creation syntax
    - Accessibility
    - Lifetime
  - Dropping Views
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
4. Transforming Data with Apache Spark
- Querying Data Files
  - Querying JSON Format
  - Querying Using the text Format
  - Querying Using binaryFile Format
  - Querying Non-Self-Describing Formats
  - Registering Tables from Files with CTAS
  - Registering Tables on Foreign Data Sources
    - Example 1: CSV
    - Example 2: database
    - Limitation
    - Impact of not having a Delta table
    - Hybrid approach
- Writing to Tables
  - Replacing Data
    - 1. CREATE OR REPLACE TABLE statement
    - 2. INSERT OVERWRITE
  - Appending Data
  - Merging Data
- Performing Advanced ETL Transformations
  - Dealing with Nested JSON Data
  - Parsing JSON into Struct Type
  - Interacting with Struct Types
  - Flattening Struct Types
  - Leveraging the explode Function
  - Aggregating Unique Values
  - Mastering Join Operations in Spark SQL
  - Exploring Set Operations in Spark SQL
    - Union operation
    - Intersect operation
    - Minus operation
  - Changing Data Perspectives
- Working with Higher-Order Functions
  - Filter Function
  - Transform Function
- Developing SQL UDFs
  - Creating UDFs
  - Applying UDFs
  - Understanding UDFs
  - Complex Logic UDFs
  - Dropping UDFs
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
5. Processing Incremental Data
- Streaming Data with Apache Spark
  - What Is a Data Stream?
  - Spark Structured Streaming
    - The append-only requirement of streaming sources
    - Delta Lake as streaming source
      - DataStreamReader
      - DataStreamWriter
  - Streaming Query Configurations
    - Trigger Intervals
      - Continuous mode: Near-real-time processing
      - Triggered mode: Incremental batch processing
    - Output Modes
      - Append mode
      - Complete mode
    - Checkpointing
  - Structured Streaming Guarantees
    - Fault recovery
    - Exactly-once semantics
  - Unsupported operations
- Implementing Structured Streaming
  - Streaming Data Manipulations in SQL
    - Applying transformations
    - Persisting streaming data
  - Streaming Data Manipulations in Python
- Incremental Data Ingestion
  - Introducing Data Ingestion
  - COPY INTO Command
  - Auto Loader
    - Implementation
    - Schema management
  - Comparison of Ingestion Mechanisms
    - File volume
    - Efficiency
  - Auto Loader in Action
    - Setting up Auto Loader
    - Observing Auto Loader
    - Exploring table history
    - Cleaning up
- Medallion Architecture
  - Introducing Medallion Architecture
    - The layered approach
      - Bronze layer
      - Silver layer
      - Gold layer
    - Benefits of Medallion Architectures
  - Building Medallion Architectures
    - Establishing the bronze layer
      - Configuring Auto Loader
      - Creating a static lookup table
    - Transitioning to the silver layer
    - Advancing to the gold layer
    - Stopping active streams
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
6. Building Production Pipelines
- Exploring Delta Live Tables
  - Introducing Delta Live Tables
    - Benefits of Delta Live Tables
    - Comparison of DLT and Spark Structured Streaming
      - Syntax
      - SQL support
      - Data quality control
    - DLT object types
      - Streaming tables
      - Materialized views
      - Live views
  - DLT Expectations
  - Implementing DLT Pipelines
    - Bronze layer
      - Creating a streaming table
      - Creating a materialized view
    - Silver layer
    - Gold layer
  - Configuring DLT Pipelines
    - General configurations
      - Source code
      - Destination
      - Compute
    - Advanced configurations
    - Running DLT pipelines
      - Production mode
      - Development mode
      - Data quality metrics
    - Modifying DLT pipelines
      - Full refresh
    - Examining DLT pipelines
- Capturing Data Changes
  - Definition
  - CDC Feed
  - CDC Sources
    - Databases with built-in CDC features
    - CDC agents
  - CDC Feed Delivery
  - CDC in DLT
    - APPLY CHANGES INTO command
    - Advantages of APPLY CHANGES INTO
    - Disadvantages of APPLY CHANGES INTO
  - Processing Change Data Capture
  - Extending DLT Pipelines with New Notebooks
- Orchestrating Workflows
  - Introducing Databricks Jobs
  - Creating Databricks Jobs
    - Task 1: Landing data
    - Task 2: DLT pipeline
    - Task 3: Output exploration
  - Configuring Job Settings
    - Scheduling the job
    - Setting job notifications
    - Managing permissions
  - Running the Job
    - Reviewing task results
    - Task 1: Landing data
    - Task 2: DLT pipeline
    - Task 3: Output exploration
  - Debugging Jobs
    - Repairing runs
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
7. Exploring Databricks SQL
- What Is Databricks SQL?
- Creating SQL Warehouses
  - Configuring a SQL Warehouse
  - SQL Endpoints
- Designing Dashboards
  - Creating a New Dashboard
    - Creating data sources
    - Designing visualizations
    - Defining filters
  - Sharing a Dashboard
  - Publishing a Dashboard
  - Republishing a New Version
- Managing SQL Queries
  - Writing a SQL Query
  - Saving a Query
  - Scheduling a Query
  - Browsing Saved Queries
- Setting Up Alerts
  - Creating an Alert
  - Scheduling the Alert
- Conclusion
- Sample Exam Questions
  - Conceptual Questions
8. Implementing Data Governance
- What Is Data Governance?
- Managing Data Security in the Hive Metastore
  - Granting Permissions
    - Data object types
    - Object privileges
      - SELECT privilege
      - MODIFY privilege
      - CREATE privilege
      - READ_METADATA privilege
      - USAGE privilege
      - ALL PRIVILEGES
    - Granting privileges by role
  - Advanced Privilege Management
    - REVOKE operation
    - DENY operation
    - SHOW GRANTS operation
  - Managing Permissions with Databricks SQL
    - Adding users
    - Adding groups
    - Creating data objects
    - Configuring object permissions
      - Granting privileges to a group
      - Granting privileges to an individual user
      - Reviewing assigned permissions
    - Managing permissions in Catalog Explorer
      - Reviewing and modifying permissions
      - Granting new permissions
      - Revoking permissions
      - Managing permissions for database objects
      - Limitations of the Catalog Explorer
      - Query History
- Governing Data with Unity Catalog
  - What Is Unity Catalog?
  - Unity Catalog Architecture
  - Key Architectural Changes
  - UC Three-Level Namespace
  - Data Object Hierarchy
  - Detailed Hierarchical Structure
  - Identity Management
    - Users
    - Service principals
    - Groups
    - Identity federation
  - UC Security Model
  - Accessing the Hive Metastore
  - Unity Catalog Features
  - Unity Catalog in Action
    - Enabling workspaces for Unity Catalog
      - Verifying Unity Catalog enablement
      - Manual enabling of Unity Catalog
        
        Accessing account console
        
        Creating a new metastore
        
        Assigning existing metastore
    - Running Unity Catalog workloads
      - Creating a UC-compliant cluster
      - Managing data catalogs
        
        Creating a new catalog
        
        Verifying the created catalog
        
        Granting permissions
        
        Creating schemas
        
        Managing Delta tables
        
        Dropping tables
- Conclusion
- Sample Exam Questions
  - Conceptual Question
  - Code-Based Question
9. Certification Overview
- Exploring the Exam Format
  - Key Topics Covered
  - Out-of-Scope Topics
  - Code Snippet Language
- Registering for the Exam
  - Registration Fee
  - Exam Platform Overview
  - Scheduling the Exam
- Troubleshooting and Support
- Getting Ready for the Assessment
  - Exam Proctoring
  - Exam Result
- Practice Exams
  - Official Databricks Practice Exam
  - Interactive Practice Exams
- Seeking Assistance
- Final Thoughts
A. Signing Up for Databricks
- Deploying Databricks on Microsoft Azure
- Deploying Databricks on Amazon Web Services
- Additional Workspaces and Account Management
- Deploying Databricks on Google Cloud Platform
B. Databricks Community Edition
C. Answers to Sample Exam Questions
- Chapter 1: Getting Started with Databricks
- Chapter 2: Managing Data with Delta Lake
- Chapter 3: Mastering Relational Entities in Databricks
- Chapter 4: Transforming Data with Apache Spark
- Chapter 5: Processing Incremental Data
- Chapter 6: Building Production Pipelines
- Chapter 7: Exploring Databricks SQL
- Chapter 8: Implementing Data Governance
Index