97 Things Every Data Engineer Should Know - Helion
ISBN: 9781492062363
stron: 264, Format: ebook
Data wydania: 2021-06-11
Księgarnia: Helion
Cena książki: 143,65 zł (poprzednio: 167,03 zł)
Oszczędzasz: 14% (-23,38 zł)
Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.
Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.
Topics include:
- The Importance of Data Lineage - Julien Le Dem
- Data Security for Data Engineers - Katharine Jarmul
- The Two Types of Data Engineering and Data Engineers - Jesse Anderson
- Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
- The End of ETL as We Know It - Paul Singman
- Building a Career as a Data Engineer - Vijay Kiran
- Modern Metadata for the Modern Data Stack - Prukalpa Sankar
- Your Data Tests Failed! Now What? - Sam Bail
Osoby które kupowały "97 Things Every Data Engineer Should Know", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
97 Things Every Data Engineer Should Know eBook -- spis treści
- Preface
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. A (Book) Case for Eventual Consistency
- Denise Koessler Gosnell, PhD
- 2. A/B and How to Be
- Sonia Mehta
- 3. About the Storage Layer
- Julien Le Dem
- 4. Analytics as the Secret Glue for Microservice Architectures
- Elias Nema
- 5. Automate Your Infrastructure
- Christiano Anderson
- 6. Automate Your Pipeline Tests
- Tom White
- Build an End-to-End Test of the Whole Pipeline
- Use a Small Amount of Representative Data
- Prefer Textual Data Formats over Binary
- Ensure That Tests Can Be Run Locally
- Make Tests Deterministic
- Make It Easy to Add More Tests
- Tom White
- 7. Be Intentional About the Batching Model in Your Data Pipelines
- Raghotham Murthy
- Data Time Window Batching Model
- Arrival Time Window Batching Model
- ATW and DTW Batching in the Same Pipeline
- Raghotham Murthy
- 8. Beware of Silver-Bullet Syndrome
- Thomas Nield
- 9. Building a Career as a Data Engineer
- Vijay Kiran
- 10. Business Dashboards for Data Pipelines
- Valliappa (Lak) Lakshmanan
- 11. Caution: Data Science Projects Can Turn into the Emperors New Clothes
- Shweta Katre
- 12. Change Data Capture
- Raghotham Murthy
- 13. Column Names as Contracts
- Emily Riederer
- 14. Consensual, Privacy-Aware Data Collection
- Katharine Jarmul
- Attach Consent Metadata
- Track Data Provenance
- Drop or Encrypt Sensitive Fields
- Katharine Jarmul
- 15. Cultivate Good Working Relationships with Data Consumers
- Ido Shlomo
- Dont Let Consumers Solve Engineering Problems
- Adapt Your Expectations
- Understand Consumers Jobs
- Ido Shlomo
- 16. Data Engineering != Spark
- Jesse Anderson
- Batch and Real-Time Systems
- Computation Component
- Storage Component
- NoSQL Databases
- Messaging Component
- Jesse Anderson
- 17. Data Engineering for Autonomy and Rapid Innovation
- Jeff Magnusson
- Implement Reusable Patterns in the ETL Framework
- Choose a Framework and Tool Set Accessible Within the Organization
- Move the Logic to the Edges of the Pipelines
- Create and Support Staging Tables
- Bake Data-Flow Logic into Tooling and Infrastructure
- Jeff Magnusson
- 18. Data Engineering from a Data Scientists Perspective
- Bill Franks
- Database Administration, ETL, and Such
- Why the Need for Data Engineers?
- Whats the Future?
- Bill Franks
- 19. Data Pipeline Design Patterns for Reusability and Extensibility
- Mukul Sood
- 20. Data Quality for Data Engineers
- Katharine Jarmul
- 21. Data Security for Data Engineers
- Katharine Jarmul
- Learn About Security
- Monitor, Log, and Test Access
- Encrypt Data
- Automate Security Tests
- Ask for Help
- Katharine Jarmul
- 22. Data Validation Is More Than Summary Statistics
- Emily Riederer
- 23. Data Warehouses Are the Past, Present, and Future
- James Densmore
- 24. Defining and Managing Messages in Log-Centric Architectures
- Boris Lublinsky
- 25. Demystify the Source and Illuminate the Data Pipeline
- Meghan Kwartler
- 26. Develop Communities, Not Just Code
- Emily Riederer
- 27. Effective Data Engineering in the Cloud World
- Dipti Borkar
- Disaggregated Data Stack
- Orchestrate, Orchestrate, Orchestrate
- Copying Data Creates Problems
- S3 Compatibility
- SQL and Structured Data Are Still In
- Dipti Borkar
- 28. Embrace the Data Lake Architecture
- Vinoth Chandar
- Common Pitfalls
- Data Lakes
- Advantages
- Implementation
- Vinoth Chandar
- 29. Embracing Data Silos
- Bin Fan and Amelia Wong
- Why Data Silos Exist
- Embracing Data Silos
- Bin Fan and Amelia Wong
- 30. Engineering Reproducible Data Science Projects
- Dr. Tianhui Michael Li
- 31. Five Best Practices for Stable Data Processing
- Christian Lauer
- Prevent Errors
- Set Fair Processing Times
- Use Data-Quality Measurement Jobs
- Ensure Transaction Security
- Consider Dependency on Other Systems
- Conclusion
- Christian Lauer
- 32. Focus on Maintainability and Break Up Those ETL Tasks
- Chris Moradi
- 33. Friends Dont Let Friends Do Dual-Writes
- Gunnar Morling
- 34. Fundamental Knowledge
- Pedro Marcelino
- 35. Getting the Structured Back into SQL
- Elias Nema
- 36. Give Data Products a Frontend with Latent Documentation
- Emily Riederer
- 37. How Data Pipelines Evolve
- Chris Heinzmann
- 38. How to Build Your Data Platform like a Product
- Barr Moses and Atul Gupte
- Align Your Products Goals with the Goals of the Business
- Gain Feedback and Buy-in from the Right Stakeholders
- Prioritize Long-Term Growth and Sustainability over Short-Term Gains
- Sign Off on Baseline Metrics for Your Data and How You Measure It
- Barr Moses and Atul Gupte
- 39. How to Prevent a Data Mutiny
- Sean Knapp
- 40. Know the Value per Byte of Your Data
- Dhruba Borthakur
- 41. Know Your Latencies
- Dhruba Borthakur
- 42. Learn to Use a NoSQL Database, but Not like an RDBMS
- Kirk Kirkconnell
- 43. Let the Robots Enforce the Rules
- Anthony Burdi
- 44. Listen to Your Usersbut Not Too Much
- Amanda Tomlinson
- 45. Low-Cost Sensors and the Quality of Data
- Dr. Shivanand Prabhoolall Guness
- 46. Maintain Your Mechanical Sympathy
- Tobias Macey
- 47. Metadata Data
- Jonathan Seidman
- 48. Metadata Services as a Core Component of the Data Platform
- Lohit VijayaRenu
- Discoverability
- Security Control
- Schema Management
- Application Interface and Service Guarantee
- Lohit VijayaRenu
- 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
- Einat Orr
- 50. Modern Metadata for the Modern Data Stack
- Prukalpa Sankar
- Data Assets > Tables
- Complete Data Visibility, Not Piecemeal Solutions
- Built for Metadata That Itself Is Big Data
- Embedded Collaboration at Its Heart
- Prukalpa Sankar
- 51. Most Data Problems Are Not Big Data Problems
- Thomas Nield
- 52. Moving from Software Engineering to Data Engineering
- John Salinas
- 53. Observability for Data Engineers
- Barr Moses
- How Good Data Turns Bad
- Introducing Data Observability
- Barr Moses
- 54. Perfect Is the Enemy of Good
- Bob Haffner
- 55. Pipe Dreams
- Scott Haines
- 56. Preventing the Data Lake Abyss
- Scott Haines
- Establishing Data Contracts
- From Generic Data Lake to Data Structure Store
- Scott Haines
- 57. Prioritizing User Experience in Messaging Systems
- Jowanza Joseph
- 58. Privacy Is Your Problem
- Stephen Bailey, PhD
- 59. QA and All Its Sexiness
- Sonia Mehta
- 60. Seven Things Data Engineers Need to Watch Out for in ML Projects
- Dr. Sandeep Uttamchandani
- 61. Six Dimensions for Picking an Analytical Data Warehouse
- Gleb Mezhanskiy
- Scalability
- Price Elasticity
- Interoperability
- Querying and Transformation Features
- Speed
- Zero Maintenance
- Gleb Mezhanskiy
- 62. Small Files in a Big Data World
- Adi Polak
- What Are Small Files, and Why Are They a Problem?
- Why Does It Happen?
- Detect and Mitigate
- Conclusion
- References
- Adi Polak
- 63. Streaming Is Different from Batch
- Dean Wampler, PhD
- 64. Tardy Data
- Ariel Shaqed
- 65. Tech Should Take a Back Seat for Data Project Success
- Andrew Stevenson
- 66. Ten Must-Ask Questions for Data-Engineering Projects
- Haidar Hadi
- Question 1: What Are the Touch Points?
- Question 2: What Are the Granularities?
- Question 3: What Are the Input and Output Schemas?
- Question 4: What Is the Algorithm?
- Question 5: Do You Need Backfill Data?
- Question 6: When Is the Project Due Date?
- Question 7: Why Was That Due Date Set?
- Question 8: Which Hosting Environment?
- Question 9: What Is the SLA?
- Question 10: Who Will Be Taking Over This Project?
- Haidar Hadi
- 67. The Data Pipeline Is Not About Speed
- Rustem Feyzkhanov
- 68. The Dos and Donts of Data Engineering
- Christopher Bergh
- Dont Be a Hero
- Dont Rely on Hope
- Dont Rely on Caution
- Do DataOps
- Christopher Bergh
- 69. The End of ETL as We Know It
- Paul Singman
- Replacing ETL with Intentional Data Transfer
- Agreeing on a Data Model Contract
- Removing Data Processing Latencies
- Taking the First Steps
- Paul Singman
- 70. The Haiku Approach to Writing Software
- Mitch Seymour
- Understand the Constraints Up Front
- Start Strong Since Early Decisions Can Impact the Final Product
- Keep It as Simple as Possible
- Engage the Creative Side of Your Brain
- Mitch Seymour
- 71. The Hidden Cost of Data Input/Output
- Lohit VijayaRenu
- Data Compression
- Data Format
- Data Serialization
- Lohit VijayaRenu
- 72. The Holy War Between Proprietary and Open Source Is a Lie
- Paige Roberts
- 73. The Implications of the CAP Theorem
- Paul Doran
- 74. The Importance of Data Lineage
- Julien Le Dem
- 75. The Many Meanings of Missingness
- Emily Riederer
- 76. The Six Words That Will Destroy Your Career
- Bartosz Mikulski
- 77. The Three Invaluable Benefits of Open Source for Testing Data Quality
- Tom Baeyens
- 78. The Three Rs of Data Engineering
- Tobias Macey
- Reliability
- Reproducibility
- Repeatability
- Conclusion
- Tobias Macey
- 79. The Two Types of Data Engineering and Data Engineers
- Jesse Anderson
- Types of Data Engineering
- Types of Data Engineers
- Why These Differences Matter to You
- Jesse Anderson
- 80. The Yin and Yang of Big Data Scalability
- Paul Brebner
- 81. Threading and Concurrency in Data Processing
- Matthew Housley, PhD
- Operating System Threading
- Threading Overhead
- Solving the C10K Problem
- Scaling Is Not a Magic Bullet
- Further Reading
- Matthew Housley, PhD
- 82. Three Important Distributed Programming Concepts
- Adi Polak
- MapReduce Algorithm
- Distributed Shared Memory Model
- Message Passing/Actors Model
- Conclusions
- Adi Polak
- 83. Time (Semantics) Wont Wait
- Marta Paes Moreira and Fabian Hueske
- 84. Tools Dont Matter, Patterns and Practices Do
- Bas Geerdink
- 85. Total Opportunity Cost of Ownership
- Joe Reis
- 86. Understanding the Ways Different Data Domains Solve Problems
- Matthew Seal
- 87. What Is a Data Engineer? Clue: Were Data Science Enablers
- Lewis Gavin
- AI and Machine Learning Models Require Data
- Clean Data == Better Model
- Finally Building a Model
- A Model Is Useful Only If Someone Will Use It
- So What Am I Getting At?
- Lewis Gavin
- 88. What Is a Data Mesh, and How Not to Mesh It Up
- Barr Moses and Lior Gavish
- Why Use a Data Mesh?
- The Final Link: Observability
- Barr Moses and Lior Gavish
- 89. What Is Big Data?
- Ami Levin
- 90. What to Do When You Dont Get Any Credit
- Jesse Anderson
- 91. When Our Data Science Team Didnt Produce Value
- Joel Nantais
- 92. When to Avoid the Naive Approach
- Nimrod Parasol
- 93. When to Be Cautious About Sharing Data
- Thomas Nield
- 94. When to Talk and When to Listen
- Steven Finkelstein
- 95. Why Data Science Teams Need Generalists, Not Specialists
- Eric Colson
- 96. With Great Data Comes Great Responsibility
- Lohit VijayaRenu
- Put Yourself in the Users Shoes
- Ensure Ethical Use of User Information
- Watch Your Data Footprint
- Lohit VijayaRenu
- 97. Your Data Tests Failed! Now What?
- Sam Bail, PhD
- System Response
- Logging and Alerting
- Alert Response
- Stakeholder Communication
- Root Cause Identification
- Issue Resolution
- Sam Bail, PhD
- Contributors
- Index