AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice - Helion

ISBN: 9781098170035
stron: 476, Format: ebook
Data wydania: 2025-08-25
Księgarnia: Helion
Cena książki: 169,14 zł (poprzednio: 198,99 zł)
Oszczędzasz: 15% (-29,85 zł)
There's no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future.
Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you're a data engineer, data analyst, or machine learning engineer, you'll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you'll learn how to:
- Ingest, transform, and orchestrate data pipelines effectively
- Select the ideal data store, design efficient data models, and manage data lifecycles
- Analyze data rigorously and maintain high data quality standards
- Implement robust authentication, authorization, and data governance protocols
- Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices
Osoby które kupowały "AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice", wybierały także:
- Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 747,50 zł, (29,90 zł -96%)
- Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi 427,14 zł, (29,90 zł -93%)
- Jak zhakowa 125,00 zł, (10,00 zł -92%)
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych 186,88 zł, (29,90 zł -84%)
Spis treści
AWS Certified Data Engineer Associate Study Guide. In-Depth Guidance and Practice eBook -- spis treści
- Preface
- What This Book Isnt
- What This Book Is About
- Who Should Read This Book
- How This Book Is Organized
- Accessing the Books Images Online
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Certification Essentials
- Who Is a Data Engineer?
- Becoming an AWS Data Engineer Associate
- Exam Topics
- Exam Format
- Registering for the Exam
- Exam-Style Questions
- Think Like an AWS Solutions Architect: Translating a Real-World Problem-Solving Framework into Certification
- The Solutions Architects Problem-Solving Framework
- Real-World Example: Designing a Serverless Stream Analytics Platform to Detect Fraud
- How This Thought Process Applies to Certification Questions
- Study Plan
- Conclusion
- 2. Prerequisite Knowledge for Aspiring Data Engineers
- Databases and Types of Databases
- What Is a Database?
- What Is a Database Management System?
- Types of Databases
- Hierarchical Databases
- Relational Databases
- NoSQL Databases
- OLTP Versus OLAP
- Overview of Big Data
- Distributed Processing Frameworks for Big Data
- MapReduce
- Spark
- Flink
- Hive
- Presto
- Trino
- What Is a Data Lake?
- What Is a Data Warehouse?
- Data Warehouse Versus Data Lake
- ETL Versus ELT
- Different Ways to Process Data
- Batch Processing Pipeline
- Real-Time Stream Processing
- Event-Driven Processing
- High-Level Architecture Overview of Data Processing Pipelines
- Working with Code Repositories
- What Is a Code Repository?
- How to Work with Code Repositories
- CI/CD
- Cloud Computing and AWS
- What Is Cloud Computing?
- An Overview of Amazon Web Services
- Getting Started with AWS
- How to Set Up an AWS Account
- Configure Access with AWS IAM
- Create an IAM User for Authentication
- Add Permissions to Authorize the User
- What Is an IAM Policy?
- What Is an IAM Role?
- Best Practices to Follow with AWS IAM
- Conclusion
- Resources
- Databases and Types of Databases
- 3. Overview of AWS Analytics and Auxiliary Services
- AWS Analytics Services
- Amazon Kinesis Data Streams
- Amazon Data Firehose
- Amazon Managed Service for Apache Flink
- Amazon Managed Streaming for Apache Kafka
- Reference Architecture: Streaming Analytics Pattern with Apache Flink and MSK
- AWS Glue
- AWS Glue DataBrew
- Amazon Athena
- Amazon EMR
- Amazon Redshift
- Amazon QuickSight
- Reference Architecture: Lakehouse with Glue, Redshift, and Athena
- Amazon OpenSearch Service
- Amazon DataZone
- AWS Lake Formation
- Auxiliary Services for Analytics
- Application Integration
- Compute and Containers
- Database
- Storage
- Machine Learning
- Migration and Transfer
- Networking and Content Delivery
- Security, Identity, and Compliance
- Management Governance
- Developer Tools
- Cloud Financial Management
- AWS Well-Architected Tool
- Conclusion
- Additional Resources
- AWS Analytics Services
- 4. Data Ingestion and Transformation
- Data Ingestion
- Real-Time Streaming Data Ingestion
- Kinesis Data Streams Versus Amazon MSK
- Sample Streaming Ingestion Use Cases
- Ingesting streaming data from IoT devices into a data lake
- Ingesting click streams into a data warehouse for real-time reporting
- Streaming Amazon DynamoDB data into a centralized data lake
- Ingesting AWS logs into log analytics solutions
- Ingesting Data Using Zero-ETL Integrations
- Ingesting Data from Databases with CDC Using AWS Data Migration Service
- Supported Sources for AWS DMS
- Supported Targets for AWS DMS
- Sample Use Cases
- Ingesting data into an Amazon S3 data lake using DMS
- Ingesting data into Amazon Redshift using DMS
- Converting schema using DMS Schema Conversion
- Ingesting files from on premises
- Ingesting third-party datasets
- Best Practices for Data Ingestion
- Best Practices for Streaming Ingestion
- Best Practices for Choosing Data Stream Capacity Mode
- Best Practices for Sharding
- Best Practices for Consuming Data from KDS
- Best Practices for Amazon MSK
- Amazon MSK provisioned cluster versus serverless
- Amazon MSK serverless cluster
- General practices when using Amazon MSK
- Best Practices for Amazon Data Firehose
- Best Practices for AWS DMS Replication Instances and Tasks
- Best Practices for AWS DMS Tasks with Amazon Redshift Target
- Data Transformation
- Batch Data Transformation
- Streaming Data Transformation
- Data Transformation Using AWS Glue
- Glue Connectors
- Glue Bookmarks
- Data Processing Units
- Worker Type
- Glue Jobs
- Data Sources and Destinations
- Glue Studio
- Glue Studio notebooks
- AWS Glue interactive sessions
- Best Practices for AWS Glue
- Data Transformation Using Amazon EMR
- Storage
- Deployment Options
- Instance Types
- Best Practices for Amazon EMR
- AWS Glue Versus Amazon EMR Options
- SQL-Based Data Transformation Using Amazon Redshift
- Amazon Redshift Compute
- Amazon Redshift Storage
- SQL Data Transformations
- Amazon Redshift materialized views
- Amazon Redshift stored procedures
- Amazon Managed Service for Apache Flink
- Amazon Data Firehose for Transformation
- AWS Lambda for Transformation
- Choosing the Right Streaming Transformation Service
- Choosing the Right Batch Transformation Service
- Data Preparation for Nontechnical Personas
- Fill Missing Values
- Identify Duplicate Records
- Formatting Functions
- Integrating Data from Multiple Sources
- Nesting and Unnesting Data Structures
- Protecting Sensitive Data
- Other Data Preparation Transformations
- Orchestrating Data Pipelines
- AWS Step Functions
- Managed Workflows for Apache Airflow
- Sample Use Case
- AWS Glue Workflows
- Sample Use Case
- Amazon Redshift Scheduler
- Amazon EventBridge
- Sample Use Case
- Choosing the Right Orchestration Service
- Conclusion
- Practice Questions
- Additional Resources
- 5. Data Store Management
- Choosing a Data Store
- AWS Core Storage Services
- AWS Cloud Databases
- Data Storage Formats for Data Lakes
- Row-Based File Formats
- Column-Based File Formats
- Table Formats
- Building a Data Strategy with Multiple Data Stores
- Data Cataloging Systems
- Components of Metadata and Data Catalogs
- Populating an AWS Glue Data Catalog
- Using Glue crawlers
- Defining metadata manually
- Integrating with other AWS services
- Migrating from an existing Hive catalog
- Data Catalog Best Practices
- Establish a consistent naming convention
- Secure the Data Catalog
- Manage schema changes effectively
- Monitor schema changes
- Use crawlers effectively
- Optimize performance with Glue Data Catalog
- Enriching Data Catalogs with Data Classification
- Managing the Lifecycle of Data
- Selecting Storage Solutions for Hot and Cold Data
- Example: Building a Petabyte-Scale Log Analytics Solution on AWS
- Storage Tier Decisions for Different Access Patterns
- Defining Data Retention Policy and Archiving Strategies
- Performing COPY and UNLOAD Operations to Move Data Between Amazon S3 and Amazon Redshift
- Optimizing Data Management with Amazon S3
- Overview of S3 Storage Classes
- Frequently accessed storage classes
- Infrequently accessed storage classes
- Rarely accessed storage classes
- Storage class for changing or unknown access patterns
- Choosing the Right Storage Class
- S3 Intelligent-Tiering
- Managing the Data Lifecycle with Amazon S3 Lifecycle
- Monitoring the Amazon S3 Data Lifecycle
- S3 Storage Lens
- Storage Class Analysis
- AWS Cost Explorer
- Expiring Snapshots from Open Table Formats
- Archiving Data from Amazon DynamoDB to Amazon S3
- Ensuring S3 Data Resiliency with S3 Versioning
- Enabling Versioning on an S3 Bucket
- S3 Versioning and Object Lifecycle Management
- Overview of S3 Storage Classes
- Designing Data Models and Schema
- Introduction to Data Modeling
- Data Modeling Strategies for Amazon Redshift
- Common schema design patterns
- Logical data modeling in Amazon Redshift
- Physical data modeling in Amazon Redshift: Choosing the best distribution style
- Physical data modeling in Amazon Redshift: Choosing the best sort key
- Additional best practices for data modeling with Amazon Redshift
- Data Modeling Strategies for Amazon DynamoDB
- NoSQL versus relational data modeling
- Example use case: Ecommerce website
- Core concepts of DynamoDB
- Selecting the right partition key
- Selecting the right sort key
- Utilizing global secondary indexes and local secondary indexes
- Common use cases and considerations
- Data Modeling Strategies for Data Lakes
- Raw data layer: The landing zone for raw data
- Stage data layer: Cleansed and conformed data
- Analytics data layer: Curated and aggregated data
- Amazon S3 Data Lake Best Practices
- Partition your data
- Bucket your data
- Use compression
- Optimize file size
- Use columnar file formats
- Use open table formats
- Conclusion
- Practice Questions
- Additional Resources
- Choosing a Data Store
- 6. Data Operations and Support
- Amazon QuickSight
- Data Sources
- Datasets
- Refreshing SPICE Datasets
- Visualizations
- Presentation Formats
- QuickSight GenBI Capabilities (QuickSight Q)
- Generate stories
- Create executive summaries
- Enhanced dashboard Q&A
- SQL Analytics Using Amazon Athena
- Choice of Querying Engine
- Trino SQL
- Spark SQL/PySpark
- Workgroups
- Capacity Reservations
- Athena Federated SQL
- Use Cases
- DDL Capabilities
- Best Practices When Using Amazon Athena
- Choice of Querying Engine
- SQL Analytics Using Amazon Redshift
- SQL Functions
- Semi-Structured Data Analysis
- Geospatial Data Analysis
- Query Data from Data Lake
- Analyzing Data from Operational Data Stores Using Amazon Redshift
- Redshift ML and Generative AI
- User-Defined Functions
- Analyzing Data Using Notebooks
- AWS Glue Interactive Sessions
- Amazon EMR Notebooks
- Data Pipeline Resiliency
- Monitoring
- Monitoring metrics using CloudWatch
- CloudWatch dashboards
- Monitoring API calls with CloudTrail
- Monitoring logs and traces
- Monitoring using system tables
- Alerting
- CloudWatch Alarms
- Alarm state
- Notifications
- Event-Driven Pipeline Maintenance with EventBridge
- Ensuring Data Quality and Reliability: Deequ and DQDL
- AWS Glue Data Quality
- AWS Glue Data Quality DQDL syntax
- Composite rules
- Using Deequ with Amazon EMR
- Automated Data Quality Checks and Error Handling
- Troubleshooting and Performance Tuning
- Connection timed out errors
- Access denied exceptions
- Throttling errors
- Resource constraints
- CI/CD Pipelines
- Continuous integration (CI)
- Continuous deployment (CD)
- Version Control and Collaboration
- Infrastructure as Code
- AWS CloudFormation
- AWS Serverless Application Model
- AWS Cloud Development Kit (AWS CDK)
- Choosing the right IaC solution
- Disaster Recovery and High Availability
- HA for Amazon EMR clusters on EC2
- HA for Amazon Redshift provisioned clusters
- Availability Zone (AZ) failure recovery
- Backup and restore
- Region failure recovery
- HA for Amazon MSK
- HA for Amazon OpenSearch
- Monitoring
- Cost Optimization for Data Pipelines
- Leveraging Serverless Services
- Autoscaling
- Tiered Storage
- Columnar Formats
- Monitor and Control Data Transfer Costs
- Follow Cost Optimization Best Practices
- Conclusion
- Practice Questions
- Additional Resources
- Amazon QuickSight
- 7. Data Security and Governance
- Network Security
- Amazon VPC Overview
- Security Groups Overview
- Best Practices for Configuring Security Groups for Your Workloads
- Configuring a VPC and Security Group for an Amazon EMR Cluster
- Managed Services Versus Unmanaged Services
- VPC Endpoints Overview
- Redshift-managed VPC endpoints
- OpenSearch Servicemanaged VPC endpoints
- User Authentication and Authorization
- Authenticating Users with IAM Credentials
- IAM Role-Based Authentication and Authorization
- Service-Linked Roles
- Managed Versus Self-Managed Policies
- Enable Single Sign-on with AWS IAM Identity Center
- IAM Identity Center integration with AWS Lake Formation
- IAM Identity Center integration with Amazon DataZone
- Data Security and Privacy
- Secure Data in Amazon S3
- Manage Database Credentials
- Data Encryption and Decryption and Managing the Encryption Keys
- Managing Encryption Keys with AWS KMS
- Enabling encryption and managing keys in AWS
- Best practices for managing keys with AWS KMS
- Enabling Encryption in AWS Analytics Services
- AWS Glue
- Amazon EMR
- Amazon Redshift
- Sensitive Data Detection and Redaction
- Integrating Amazon Macie for data at rest
- Integrating AWS Glue sensitive data detection
- Fine-Grained Access Control with AWS Lake Formation
- Register the data lake location
- Granting permission to Glue Data Catalog databases, tables, and views
- Name-based access control
- Tag-based access control
- Row- and column-based data filtering
- Best practices to integrate AWS Lake Formation
- Best practices for cross-account sharing
- Best practices for tag-based access control
- Database Security in Amazon Redshift
- Manage permissions with GRANT and REVOKE
- Role-based access control
- Row-level security
- Dynamic data masking
- Fine-Grained Access Control in Amazon QuickSight
- Access control with IAM policies
- Access control with Lake Formation
- Data Governance
- Metadata Management and Technical Catalog
- AWS Glue Data Catalog
- AWS Glue crawler
- Amazon DataZone business glossary
- Data Sharing
- Share within a single AWS account
- Multiaccount, hub-and-spoke model for data sharing
- Data mesh with centralized governance
- Cross-organization or business-to-business data sharing
- Exposing data as a product in a data marketplace
- Data Quality
- Data Profiling
- Data Lifecycle Management
- Data Lineage
- Amazon DataZone
- Building lineage solutions with AWS Glue, Amazon Neptune, and Spline
- Amazon SageMaker ML Lineage Tracking
- Logging and Auditing
- Amazon CloudWatch
- Amazon OpenSearch Service
- Amazon S3
- Logging and auditing in Amazon Redshift
- Amazon Managed Service for Prometheus and Grafana
- AWS CloudTrail to audit actions or API invocations
- Analyzing CloudTrail logs using CloudTrail Lake
- Analyzing Logs Using AWS Services
- Amazon Athena
- Amazon CloudWatch Log Insights
- AWS CloudTrail Insights
- Amazon OpenSearch Dashboards
- Processing logs with Amazon EMR or AWS Glue
- Auditing AWS configuration changes with AWS Config
- Metadata Management and Technical Catalog
- Conclusion
- Practice Questions
- Additional Resources
- Network Security
- 8. Implementing Batch and Streaming Pipelines
- Data Processing Pipeline
- Implementing a Batch Processing Pipeline
- Use Case and Architecture Overview
- Overview of Input Dataset
- Step-by-Step Implementation Guide
- Create Amazon S3 buckets
- Create Amazon Redshift cluster
- Create Glue data connection for the Redshift cluster
- Create AWS Glue PySpark ETL job
- Create Amazon QuickSight execution role using AWS IAM
- Sign up for and manage Amazon QuickSight
- Create Amazon QuickSight visualization
- Best Practices and Optimization Techniques
- Implementing a Real-Time Streaming Pipeline
- Use Case and Architecture Overview
- Step-by-Step Implementation Guide
- Creating a Kinesis data stream
- Setting up Amazon Kinesis Data Generator
- Create Amazon S3 buckets for an Iceberg data lake and a streaming checkpoint
- Creating an EMR Studio and EMR Serverless application
- Creating VPC endpoints for Kinesis Data Streams, Amazon S3, and EMR Serverless
- Submitting the Spark Streaming job to the EMR Serverless application
- Conclusion
- Resources
- 9. Practice Exam
- 10. Whats New in AWS for Data Engineers
- Amazon SageMaker Unified Studio
- Amazon SageMaker Catalog
- Amazon SageMaker Lakehouse
- Amazon SageMaker AI
- Amazon S3 Tables
- Amazon S3 Metadata
- Improving the Developer Experience with Generative AI
- Generative AIPowered Code Generation with Amazon Q Developer
- Automated Script Upgrade in AWS Glue
- GenAI-Powered Troubleshooting for Spark in AWS Glue
- Conclusion
- Resources
- A. Solutions to the Practice Questions
- Chapter 4
- Chapter 5
- Chapter 6
- Chapter 7
- Chapter 9
- Index





