97 Things Every Data Engineer Should Know - Helion

ebook

Autor: Tobias Macey
ISBN: 9781492062363
stron: 264, Format: ebook
Data wydania: 2021-06-11
Księgarnia: Helion

Cena książki: 143,65 zł (poprzednio: 167,03 zł)
Oszczędzasz: 14% (-23,38 zł)

Osoby, które kupiły tę książkę, wybierały także »

Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.

Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.

Topics include:

The Importance of Data Lineage - Julien Le Dem
Data Security for Data Engineers - Katharine Jarmul
The Two Types of Data Engineering and Data Engineers - Jesse Anderson
Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
The End of ETL as We Know It - Paul Singman
Building a Career as a Data Engineer - Vijay Kiran
Modern Metadata for the Modern Data Stack - Prukalpa Sankar
Your Data Tests Failed! Now What? - Sam Bail

Osoby które kupowały "97 Things Every Data Engineer Should Know", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

97 Things Every Data Engineer Should Know eBook -- spis treści

Preface
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. A (Book) Case for Eventual Consistency
- Denise Koessler Gosnell, PhD
2. A/B and How to Be
- Sonia Mehta
3. About the Storage Layer
- Julien Le Dem
4. Analytics as the Secret Glue for Microservice Architectures
- Elias Nema
5. Automate Your Infrastructure
- Christiano Anderson
6. Automate Your Pipeline Tests
- Tom White
  - Build an End-to-End Test of the Whole Pipeline
  - Use a Small Amount of Representative Data
  - Prefer Textual Data Formats over Binary
  - Ensure That Tests Can Be Run Locally
  - Make Tests Deterministic
  - Make It Easy to Add More Tests
7. Be Intentional About the Batching Model in Your Data Pipelines
- Raghotham Murthy
  - Data Time Window Batching Model
  - Arrival Time Window Batching Model
  - ATW and DTW Batching in the Same Pipeline
8. Beware of Silver-Bullet Syndrome
- Thomas Nield
9. Building a Career as a Data Engineer
- Vijay Kiran
10. Business Dashboards for Data Pipelines
- Valliappa (Lak) Lakshmanan
11. Caution: Data Science Projects Can Turn into the Emperors New Clothes
- Shweta Katre
12. Change Data Capture
- Raghotham Murthy
13. Column Names as Contracts
- Emily Riederer
14. Consensual, Privacy-Aware Data Collection
- Katharine Jarmul
  - Attach Consent Metadata
  - Track Data Provenance
  - Drop or Encrypt Sensitive Fields
15. Cultivate Good Working Relationships with Data Consumers
- Ido Shlomo
  - Dont Let Consumers Solve Engineering Problems
  - Adapt Your Expectations
  - Understand Consumers Jobs
16. Data Engineering != Spark
- Jesse Anderson
  - Batch and Real-Time Systems
  - Computation Component
  - Storage Component
  - NoSQL Databases
  - Messaging Component
17. Data Engineering for Autonomy and Rapid Innovation
- Jeff Magnusson
  - Implement Reusable Patterns in the ETL Framework
  - Choose a Framework and Tool Set Accessible Within the Organization
  - Move the Logic to the Edges of the Pipelines
  - Create and Support Staging Tables
  - Bake Data-Flow Logic into Tooling and Infrastructure
18. Data Engineering from a Data Scientists Perspective
- Bill Franks
  - Database Administration, ETL, and Such
  - Why the Need for Data Engineers?
  - Whats the Future?
19. Data Pipeline Design Patterns for Reusability and Extensibility
- Mukul Sood
20. Data Quality for Data Engineers
- Katharine Jarmul
21. Data Security for Data Engineers
- Katharine Jarmul
  - Learn About Security
  - Monitor, Log, and Test Access
  - Encrypt Data
  - Automate Security Tests
  - Ask for Help
22. Data Validation Is More Than Summary Statistics
- Emily Riederer
23. Data Warehouses Are the Past, Present, and Future
- James Densmore
24. Defining and Managing Messages in Log-Centric Architectures
- Boris Lublinsky
25. Demystify the Source and Illuminate the Data Pipeline
- Meghan Kwartler
26. Develop Communities, Not Just Code
- Emily Riederer
27. Effective Data Engineering in the Cloud World
- Dipti Borkar
  - Disaggregated Data Stack
  - Orchestrate, Orchestrate, Orchestrate
  - Copying Data Creates Problems
  - S3 Compatibility
  - SQL and Structured Data Are Still In
28. Embrace the Data Lake Architecture
- Vinoth Chandar
  - Common Pitfalls
  - Data Lakes
  - Advantages
  - Implementation
29. Embracing Data Silos
- Bin Fan and Amelia Wong
  - Why Data Silos Exist
  - Embracing Data Silos
30. Engineering Reproducible Data Science Projects
- Dr. Tianhui Michael Li
31. Five Best Practices for Stable Data Processing
- Christian Lauer
  - Prevent Errors
  - Set Fair Processing Times
  - Use Data-Quality Measurement Jobs
  - Ensure Transaction Security
  - Consider Dependency on Other Systems
  - Conclusion
32. Focus on Maintainability and Break Up Those ETL Tasks
- Chris Moradi
33. Friends Dont Let Friends Do Dual-Writes
- Gunnar Morling
34. Fundamental Knowledge
- Pedro Marcelino
35. Getting the Structured Back into SQL
- Elias Nema
36. Give Data Products a Frontend with Latent Documentation
- Emily Riederer
37. How Data Pipelines Evolve
- Chris Heinzmann
38. How to Build Your Data Platform like a Product
- Barr Moses and Atul Gupte
  - Align Your Products Goals with the Goals of the Business
  - Gain Feedback and Buy-in from the Right Stakeholders
  - Prioritize Long-Term Growth and Sustainability over Short-Term Gains
  - Sign Off on Baseline Metrics for Your Data and How You Measure It
39. How to Prevent a Data Mutiny
- Sean Knapp
40. Know the Value per Byte of Your Data
- Dhruba Borthakur
41. Know Your Latencies
- Dhruba Borthakur
42. Learn to Use a NoSQL Database, but Not like an RDBMS
- Kirk Kirkconnell
43. Let the Robots Enforce the Rules
- Anthony Burdi
44. Listen to Your Usersbut Not Too Much
- Amanda Tomlinson
45. Low-Cost Sensors and the Quality of Data
- Dr. Shivanand Prabhoolall Guness
46. Maintain Your Mechanical Sympathy
- Tobias Macey
47. Metadata Data
- Jonathan Seidman
48. Metadata Services as a Core Component of the Data Platform
- Lohit VijayaRenu
  - Discoverability
  - Security Control
  - Schema Management
  - Application Interface and Service Guarantee
49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
- Einat Orr
50. Modern Metadata for the Modern Data Stack
- Prukalpa Sankar
  - Data Assets > Tables
  - Complete Data Visibility, Not Piecemeal Solutions
  - Built for Metadata That Itself Is Big Data
  - Embedded Collaboration at Its Heart
51. Most Data Problems Are Not Big Data Problems
- Thomas Nield
52. Moving from Software Engineering to Data Engineering
- John Salinas
53. Observability for Data Engineers
- Barr Moses
  - How Good Data Turns Bad
  - Introducing Data Observability
54. Perfect Is the Enemy of Good
- Bob Haffner
55. Pipe Dreams
- Scott Haines
56. Preventing the Data Lake Abyss
- Scott Haines
  - Establishing Data Contracts
  - From Generic Data Lake to Data Structure Store
57. Prioritizing User Experience in Messaging Systems
- Jowanza Joseph
58. Privacy Is Your Problem
- Stephen Bailey, PhD
59. QA and All Its Sexiness
- Sonia Mehta
60. Seven Things Data Engineers Need to Watch Out for in ML Projects
- Dr. Sandeep Uttamchandani
61. Six Dimensions for Picking an Analytical Data Warehouse
- Gleb Mezhanskiy
  - Scalability
  - Price Elasticity
  - Interoperability
  - Querying and Transformation Features
  - Speed
  - Zero Maintenance
62. Small Files in a Big Data World
- Adi Polak
  - What Are Small Files, and Why Are They a Problem?
  - Why Does It Happen?
  - Detect and Mitigate
  - Conclusion
  - References
63. Streaming Is Different from Batch
- Dean Wampler, PhD
64. Tardy Data
- Ariel Shaqed
65. Tech Should Take a Back Seat for Data Project Success
- Andrew Stevenson
66. Ten Must-Ask Questions for Data-Engineering Projects
- Haidar Hadi
  - Question 1: What Are the Touch Points?
  - Question 2: What Are the Granularities?
  - Question 3: What Are the Input and Output Schemas?
  - Question 4: What Is the Algorithm?
  - Question 5: Do You Need Backfill Data?
  - Question 6: When Is the Project Due Date?
  - Question 7: Why Was That Due Date Set?
  - Question 8: Which Hosting Environment?
  - Question 9: What Is the SLA?
  - Question 10: Who Will Be Taking Over This Project?
67. The Data Pipeline Is Not About Speed
- Rustem Feyzkhanov
68. The Dos and Donts of Data Engineering
- Christopher Bergh
  - Dont Be a Hero
  - Dont Rely on Hope
  - Dont Rely on Caution
  - Do DataOps
69. The End of ETL as We Know It
- Paul Singman
  - Replacing ETL with Intentional Data Transfer
  - Agreeing on a Data Model Contract
  - Removing Data Processing Latencies
  - Taking the First Steps
70. The Haiku Approach to Writing Software
- Mitch Seymour
  - Understand the Constraints Up Front
  - Start Strong Since Early Decisions Can Impact the Final Product
  - Keep It as Simple as Possible
  - Engage the Creative Side of Your Brain
71. The Hidden Cost of Data Input/Output
- Lohit VijayaRenu
  - Data Compression
  - Data Format
  - Data Serialization
72. The Holy War Between Proprietary and Open Source Is a Lie
- Paige Roberts
73. The Implications of the CAP Theorem
- Paul Doran
74. The Importance of Data Lineage
- Julien Le Dem
75. The Many Meanings of Missingness
- Emily Riederer
76. The Six Words That Will Destroy Your Career
- Bartosz Mikulski
77. The Three Invaluable Benefits of Open Source for Testing Data Quality
- Tom Baeyens
78. The Three Rs of Data Engineering
- Tobias Macey
  - Reliability
  - Reproducibility
  - Repeatability
  - Conclusion
79. The Two Types of Data Engineering and Data Engineers
- Jesse Anderson
  - Types of Data Engineering
  - Types of Data Engineers
  - Why These Differences Matter to You
80. The Yin and Yang of Big Data Scalability
- Paul Brebner
81. Threading and Concurrency in Data Processing
- Matthew Housley, PhD
  - Operating System Threading
  - Threading Overhead
  - Solving the C10K Problem
  - Scaling Is Not a Magic Bullet
  - Further Reading
82. Three Important Distributed Programming Concepts
- Adi Polak
  - MapReduce Algorithm
  - Distributed Shared Memory Model
  - Message Passing/Actors Model
  - Conclusions
83. Time (Semantics) Wont Wait
- Marta Paes Moreira and Fabian Hueske
84. Tools Dont Matter, Patterns and Practices Do
- Bas Geerdink
85. Total Opportunity Cost of Ownership
- Joe Reis
86. Understanding the Ways Different Data Domains Solve Problems
- Matthew Seal
87. What Is a Data Engineer? Clue: Were Data Science Enablers
- Lewis Gavin
  - AI and Machine Learning Models Require Data
  - Clean Data == Better Model
  - Finally Building a Model
  - A Model Is Useful Only If Someone Will Use It
  - So What Am I Getting At?
88. What Is a Data Mesh, and How Not to Mesh It Up
- Barr Moses and Lior Gavish
  - Why Use a Data Mesh?
  - The Final Link: Observability
89. What Is Big Data?
- Ami Levin
90. What to Do When You Dont Get Any Credit
- Jesse Anderson
91. When Our Data Science Team Didnt Produce Value
- Joel Nantais
92. When to Avoid the Naive Approach
- Nimrod Parasol
93. When to Be Cautious About Sharing Data
- Thomas Nield
94. When to Talk and When to Listen
- Steven Finkelstein
95. Why Data Science Teams Need Generalists, Not Specialists
- Eric Colson
96. With Great Data Comes Great Responsibility
- Lohit VijayaRenu
  - Put Yourself in the Users Shoes
  - Ensure Ethical Use of User Information
  - Watch Your Data Footprint
97. Your Data Tests Failed! Now What?
- Sam Bail, PhD
  - System Response
  - Logging and Alerting
  - Alert Response
  - Stakeholder Communication
  - Root Cause Identification
  - Issue Resolution
Contributors
Index