AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice - Helion

ebook

Autor: Sakti Mishra, Dylan Qu, Anusha Challa
ISBN: 9781098170035
stron: 476, Format: ebook
Data wydania: 2025-08-25
Księgarnia: Helion

Cena książki: 169,14 zł (poprzednio: 198,99 zł)
Oszczędzasz: 15% (-29,85 zł)

Osoby, które kupiły tę książkę, wybierały także »

There's no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future.

Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you're a data engineer, data analyst, or machine learning engineer, you'll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you'll learn how to:

Ingest, transform, and orchestrate data pipelines effectively
Select the ideal data store, design efficient data models, and manage data lifecycles
Analyze data rigorously and maintain high data quality standards
Implement robust authentication, authorization, and data governance protocols
Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices

Osoby które kupowały "AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice eBook -- spis treści

Preface
- What This Book Isnt
- What This Book Is About
- Who Should Read This Book
- How This Book Is Organized
- Accessing the Books Images Online
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Certification Essentials
- Who Is a Data Engineer?
- Becoming an AWS Data Engineer Associate
- Exam Topics
- Exam Format
- Registering for the Exam
- Exam-Style Questions
- Think Like an AWS Solutions Architect: Translating a Real-World Problem-Solving Framework into Certification
  - The Solutions Architects Problem-Solving Framework
  - Real-World Example: Designing a Serverless Stream Analytics Platform to Detect Fraud
  - How This Thought Process Applies to Certification Questions
- Study Plan
- Conclusion
2. Prerequisite Knowledge for Aspiring Data Engineers
- Databases and Types of Databases
  - What Is a Database?
  - What Is a Database Management System?
- Types of Databases
  - Hierarchical Databases
  - Relational Databases
  - NoSQL Databases
- OLTP Versus OLAP
- Overview of Big Data
- Distributed Processing Frameworks for Big Data
  - MapReduce
  - Spark
  - Flink
  - Hive
  - Presto
  - Trino
- What Is a Data Lake?
- What Is a Data Warehouse?
- Data Warehouse Versus Data Lake
- ETL Versus ELT
- Different Ways to Process Data
  - Batch Processing Pipeline
  - Real-Time Stream Processing
  - Event-Driven Processing
- High-Level Architecture Overview of Data Processing Pipelines
- Working with Code Repositories
  - What Is a Code Repository?
  - How to Work with Code Repositories
- CI/CD
- Cloud Computing and AWS
- What Is Cloud Computing?
- An Overview of Amazon Web Services
- Getting Started with AWS
  - How to Set Up an AWS Account
  - Configure Access with AWS IAM
  - Create an IAM User for Authentication
  - Add Permissions to Authorize the User
  - What Is an IAM Policy?
  - What Is an IAM Role?
  - Best Practices to Follow with AWS IAM
- Conclusion
- Resources
3. Overview of AWS Analytics and Auxiliary Services
- AWS Analytics Services
  - Amazon Kinesis Data Streams
  - Amazon Data Firehose
  - Amazon Managed Service for Apache Flink
  - Amazon Managed Streaming for Apache Kafka
  - Reference Architecture: Streaming Analytics Pattern with Apache Flink and MSK
  - AWS Glue
  - AWS Glue DataBrew
  - Amazon Athena
  - Amazon EMR
  - Amazon Redshift
  - Amazon QuickSight
  - Reference Architecture: Lakehouse with Glue, Redshift, and Athena
  - Amazon OpenSearch Service
  - Amazon DataZone
  - AWS Lake Formation
- Auxiliary Services for Analytics
  - Application Integration
  - Compute and Containers
  - Database
  - Storage
  - Machine Learning
  - Migration and Transfer
  - Networking and Content Delivery
  - Security, Identity, and Compliance
  - Management Governance
  - Developer Tools
  - Cloud Financial Management
  - AWS Well-Architected Tool
- Conclusion
- Additional Resources
4. Data Ingestion and Transformation
- Data Ingestion
- Real-Time Streaming Data Ingestion
  - Kinesis Data Streams Versus Amazon MSK
  - Sample Streaming Ingestion Use Cases
    - Ingesting streaming data from IoT devices into a data lake
    - Ingesting click streams into a data warehouse for real-time reporting
    - Streaming Amazon DynamoDB data into a centralized data lake
    - Ingesting AWS logs into log analytics solutions
- Ingesting Data Using Zero-ETL Integrations
- Ingesting Data from Databases with CDC Using AWS Data Migration Service
  - Supported Sources for AWS DMS
  - Supported Targets for AWS DMS
  - Sample Use Cases
    - Ingesting data into an Amazon S3 data lake using DMS
    - Ingesting data into Amazon Redshift using DMS
    - Converting schema using DMS Schema Conversion
    - Ingesting files from on premises
    - Ingesting third-party datasets
- Best Practices for Data Ingestion
  - Best Practices for Streaming Ingestion
  - Best Practices for Choosing Data Stream Capacity Mode
  - Best Practices for Sharding
  - Best Practices for Consuming Data from KDS
  - Best Practices for Amazon MSK
    - Amazon MSK provisioned cluster versus serverless
    - Amazon MSK serverless cluster
    - General practices when using Amazon MSK
  - Best Practices for Amazon Data Firehose
  - Best Practices for AWS DMS Replication Instances and Tasks
  - Best Practices for AWS DMS Tasks with Amazon Redshift Target
- Data Transformation
  - Batch Data Transformation
  - Streaming Data Transformation
- Data Transformation Using AWS Glue
  - Glue Connectors
  - Glue Bookmarks
  - Data Processing Units
  - Worker Type
  - Glue Jobs
  - Data Sources and Destinations
    - Glue Studio
    - Glue Studio notebooks
    - AWS Glue interactive sessions
  - Best Practices for AWS Glue
- Data Transformation Using Amazon EMR
  - Storage
  - Deployment Options
  - Instance Types
  - Best Practices for Amazon EMR
- AWS Glue Versus Amazon EMR Options
- SQL-Based Data Transformation Using Amazon Redshift
  - Amazon Redshift Compute
  - Amazon Redshift Storage
  - SQL Data Transformations
    - Amazon Redshift materialized views
    - Amazon Redshift stored procedures
- Amazon Managed Service for Apache Flink
- Amazon Data Firehose for Transformation
- AWS Lambda for Transformation
- Choosing the Right Streaming Transformation Service
- Choosing the Right Batch Transformation Service
- Data Preparation for Nontechnical Personas
  - Fill Missing Values
  - Identify Duplicate Records
  - Formatting Functions
  - Integrating Data from Multiple Sources
  - Nesting and Unnesting Data Structures
  - Protecting Sensitive Data
  - Other Data Preparation Transformations
- Orchestrating Data Pipelines
  - AWS Step Functions
  - Managed Workflows for Apache Airflow
  - Sample Use Case
  - AWS Glue Workflows
  - Sample Use Case
  - Amazon Redshift Scheduler
  - Amazon EventBridge
  - Sample Use Case
  - Choosing the Right Orchestration Service
- Conclusion
- Practice Questions
- Additional Resources
5. Data Store Management
- Choosing a Data Store
  - AWS Core Storage Services
  - AWS Cloud Databases
- Data Storage Formats for Data Lakes
  - Row-Based File Formats
  - Column-Based File Formats
  - Table Formats
- Building a Data Strategy with Multiple Data Stores
- Data Cataloging Systems
  - Components of Metadata and Data Catalogs
  - Populating an AWS Glue Data Catalog
    - Using Glue crawlers
    - Defining metadata manually
    - Integrating with other AWS services
    - Migrating from an existing Hive catalog
  - Data Catalog Best Practices
    - Establish a consistent naming convention
    - Secure the Data Catalog
    - Manage schema changes effectively
    - Monitor schema changes
    - Use crawlers effectively
    - Optimize performance with Glue Data Catalog
  - Enriching Data Catalogs with Data Classification
- Managing the Lifecycle of Data
  - Selecting Storage Solutions for Hot and Cold Data
  - Example: Building a Petabyte-Scale Log Analytics Solution on AWS
  - Storage Tier Decisions for Different Access Patterns
  - Defining Data Retention Policy and Archiving Strategies
  - Performing COPY and UNLOAD Operations to Move Data Between Amazon S3 and Amazon Redshift
- Optimizing Data Management with Amazon S3
  - Overview of S3 Storage Classes
    - Frequently accessed storage classes
    - Infrequently accessed storage classes
    - Rarely accessed storage classes
    - Storage class for changing or unknown access patterns
  - Choosing the Right Storage Class
  - S3 Intelligent-Tiering
  - Managing the Data Lifecycle with Amazon S3 Lifecycle
  - Monitoring the Amazon S3 Data Lifecycle
    - S3 Storage Lens
    - Storage Class Analysis
    - AWS Cost Explorer
  - Expiring Snapshots from Open Table Formats
  - Archiving Data from Amazon DynamoDB to Amazon S3
  - Ensuring S3 Data Resiliency with S3 Versioning
  - Enabling Versioning on an S3 Bucket
  - S3 Versioning and Object Lifecycle Management
- Designing Data Models and Schema
  - Introduction to Data Modeling
  - Data Modeling Strategies for Amazon Redshift
    - Common schema design patterns
    - Logical data modeling in Amazon Redshift
    - Physical data modeling in Amazon Redshift: Choosing the best distribution style
    - Physical data modeling in Amazon Redshift: Choosing the best sort key
    - Additional best practices for data modeling with Amazon Redshift
  - Data Modeling Strategies for Amazon DynamoDB
    - NoSQL versus relational data modeling
    - Example use case: Ecommerce website
    - Core concepts of DynamoDB
    - Selecting the right partition key
    - Selecting the right sort key
    - Utilizing global secondary indexes and local secondary indexes
    - Common use cases and considerations
  - Data Modeling Strategies for Data Lakes
    - Raw data layer: The landing zone for raw data
    - Stage data layer: Cleansed and conformed data
    - Analytics data layer: Curated and aggregated data
  - Amazon S3 Data Lake Best Practices
    - Partition your data
    - Bucket your data
    - Use compression
    - Optimize file size
    - Use columnar file formats
    - Use open table formats
- Conclusion
- Practice Questions
- Additional Resources
6. Data Operations and Support
- Amazon QuickSight
  - Data Sources
  - Datasets
  - Refreshing SPICE Datasets
  - Visualizations
  - Presentation Formats
  - QuickSight GenBI Capabilities (QuickSight Q)
    - Generate stories
    - Create executive summaries
    - Enhanced dashboard Q&A
- SQL Analytics Using Amazon Athena
  - Choice of Querying Engine
    - Trino SQL
    - Spark SQL/PySpark
  - Workgroups
  - Capacity Reservations
  - Athena Federated SQL
  - Use Cases
  - DDL Capabilities
  - Best Practices When Using Amazon Athena
- SQL Analytics Using Amazon Redshift
  - SQL Functions
  - Semi-Structured Data Analysis
  - Geospatial Data Analysis
  - Query Data from Data Lake
  - Analyzing Data from Operational Data Stores Using Amazon Redshift
  - Redshift ML and Generative AI
  - User-Defined Functions
- Analyzing Data Using Notebooks
  - AWS Glue Interactive Sessions
  - Amazon EMR Notebooks
- Data Pipeline Resiliency
  - Monitoring
    - Monitoring metrics using CloudWatch
    - CloudWatch dashboards
    - Monitoring API calls with CloudTrail
    - Monitoring logs and traces
    - Monitoring using system tables
  - Alerting
    - CloudWatch Alarms
    - Alarm state
    - Notifications
  - Event-Driven Pipeline Maintenance with EventBridge
  - Ensuring Data Quality and Reliability: Deequ and DQDL
    - AWS Glue Data Quality
    - AWS Glue Data Quality DQDL syntax
    - Composite rules
    - Using Deequ with Amazon EMR
  - Automated Data Quality Checks and Error Handling
  - Troubleshooting and Performance Tuning
    - Connection timed out errors
    - Access denied exceptions
    - Throttling errors
    - Resource constraints
  - CI/CD Pipelines
    - Continuous integration (CI)
    - Continuous deployment (CD)
  - Version Control and Collaboration
  - Infrastructure as Code
    - AWS CloudFormation
    - AWS Serverless Application Model
    - AWS Cloud Development Kit (AWS CDK)
    - Choosing the right IaC solution
  - Disaster Recovery and High Availability
    - HA for Amazon EMR clusters on EC2
    - HA for Amazon Redshift provisioned clusters
    - Availability Zone (AZ) failure recovery
    - Backup and restore
    - Region failure recovery
    - HA for Amazon MSK
    - HA for Amazon OpenSearch
- Cost Optimization for Data Pipelines
  - Leveraging Serverless Services
  - Autoscaling
  - Tiered Storage
  - Columnar Formats
  - Monitor and Control Data Transfer Costs
  - Follow Cost Optimization Best Practices
- Conclusion
- Practice Questions
- Additional Resources
7. Data Security and Governance
- Network Security
  - Amazon VPC Overview
  - Security Groups Overview
  - Best Practices for Configuring Security Groups for Your Workloads
  - Configuring a VPC and Security Group for an Amazon EMR Cluster
  - Managed Services Versus Unmanaged Services
  - VPC Endpoints Overview
    - Redshift-managed VPC endpoints
    - OpenSearch Servicemanaged VPC endpoints
- User Authentication and Authorization
  - Authenticating Users with IAM Credentials
  - IAM Role-Based Authentication and Authorization
  - Service-Linked Roles
  - Managed Versus Self-Managed Policies
  - Enable Single Sign-on with AWS IAM Identity Center
    - IAM Identity Center integration with AWS Lake Formation
    - IAM Identity Center integration with Amazon DataZone
- Data Security and Privacy
  - Secure Data in Amazon S3
  - Manage Database Credentials
  - Data Encryption and Decryption and Managing the Encryption Keys
  - Managing Encryption Keys with AWS KMS
    - Enabling encryption and managing keys in AWS
    - Best practices for managing keys with AWS KMS
  - Enabling Encryption in AWS Analytics Services
    - AWS Glue
    - Amazon EMR
    - Amazon Redshift
  - Sensitive Data Detection and Redaction
    - Integrating Amazon Macie for data at rest
    - Integrating AWS Glue sensitive data detection
  - Fine-Grained Access Control with AWS Lake Formation
    - Register the data lake location
    - Granting permission to Glue Data Catalog databases, tables, and views
    - Name-based access control
    - Tag-based access control
    - Row- and column-based data filtering
    - Best practices to integrate AWS Lake Formation
    - Best practices for cross-account sharing
    - Best practices for tag-based access control
  - Database Security in Amazon Redshift
    - Manage permissions with GRANT and REVOKE
    - Role-based access control
    - Row-level security
    - Dynamic data masking
  - Fine-Grained Access Control in Amazon QuickSight
    - Access control with IAM policies
    - Access control with Lake Formation
- Data Governance
  - Metadata Management and Technical Catalog
    - AWS Glue Data Catalog
    - AWS Glue crawler
    - Amazon DataZone business glossary
  - Data Sharing
    - Share within a single AWS account
    - Multiaccount, hub-and-spoke model for data sharing
    - Data mesh with centralized governance
    - Cross-organization or business-to-business data sharing
    - Exposing data as a product in a data marketplace
  - Data Quality
  - Data Profiling
  - Data Lifecycle Management
  - Data Lineage
    - Amazon DataZone
    - Building lineage solutions with AWS Glue, Amazon Neptune, and Spline
    - Amazon SageMaker ML Lineage Tracking
  - Logging and Auditing
    - Amazon CloudWatch
    - Amazon OpenSearch Service
    - Amazon S3
    - Logging and auditing in Amazon Redshift
    - Amazon Managed Service for Prometheus and Grafana
    - AWS CloudTrail to audit actions or API invocations
    - Analyzing CloudTrail logs using CloudTrail Lake
  - Analyzing Logs Using AWS Services
    - Amazon Athena
    - Amazon CloudWatch Log Insights
    - AWS CloudTrail Insights
    - Amazon OpenSearch Dashboards
    - Processing logs with Amazon EMR or AWS Glue
    - Auditing AWS configuration changes with AWS Config
- Conclusion
- Practice Questions
- Additional Resources
8. Implementing Batch and Streaming Pipelines
- Data Processing Pipeline
- Implementing a Batch Processing Pipeline
  - Use Case and Architecture Overview
  - Overview of Input Dataset
  - Step-by-Step Implementation Guide
    - Create Amazon S3 buckets
    - Create Amazon Redshift cluster
    - Create Glue data connection for the Redshift cluster
    - Create AWS Glue PySpark ETL job
    - Create Amazon QuickSight execution role using AWS IAM
    - Sign up for and manage Amazon QuickSight
    - Create Amazon QuickSight visualization
  - Best Practices and Optimization Techniques
- Implementing a Real-Time Streaming Pipeline
  - Use Case and Architecture Overview
  - Step-by-Step Implementation Guide
    - Creating a Kinesis data stream
    - Setting up Amazon Kinesis Data Generator
    - Create Amazon S3 buckets for an Iceberg data lake and a streaming checkpoint
    - Creating an EMR Studio and EMR Serverless application
    - Creating VPC endpoints for Kinesis Data Streams, Amazon S3, and EMR Serverless
    - Submitting the Spark Streaming job to the EMR Serverless application
- Conclusion
- Resources
9. Practice Exam
10. Whats New in AWS for Data Engineers
- Amazon SageMaker Unified Studio
- Amazon SageMaker Catalog
- Amazon SageMaker Lakehouse
- Amazon SageMaker AI
- Amazon S3 Tables
- Amazon S3 Metadata
- Improving the Developer Experience with Generative AI
  - Generative AIPowered Code Generation with Amazon Q Developer
  - Automated Script Upgrade in AWS Glue
  - GenAI-Powered Troubleshooting for Spark in AWS Glue
- Conclusion
- Resources
A. Solutions to the Practice Questions
- Chapter 4
- Chapter 5
- Chapter 6
- Chapter 7
- Chapter 9
Index