Scaling Python with Dask - Helion

ebook

Autor: Holden Karau, Mika Kimmins
ISBN: 9781098119836
stron: 226, Format: ebook
Data wydania: 2023-07-19
Księgarnia: Helion

Cena książki: 245,65 zł (poprzednio: 285,64 zł)
Oszczędzasz: 14% (-39,99 zł)

Osoby, które kupiły tę książkę, wybierały także »

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.

Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.

With this book, you'll learn:

What Dask is, where you can use it, and how it compares with other tools
How to use Dask for batch data parallel processing
Key distributed system concepts for working with Dask
Methods for using Dask with higher-level APIs and building blocks
How to work with integrated libraries such as scikit-learn, pandas, and PyTorch
How to use Dask with GPUs

Osoby które kupowały "Scaling Python with Dask", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Scaling Python with Dask eBook -- spis treści

Preface
- A Note on Responsibility
- Conventions Used in This Book
- Online Figures
- License
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. What Is Dask?
- Why Do You Need Dask?
- Where Does Dask Fit in the Ecosystem?
  - Big Data
  - Data Science
  - Parallel to Distributed Python
  - Dask Community Libraries
    - Accelerated Python
    - SQL engines
    - Workflow scheduling
- What Dask Is Not
- Conclusion
2. Getting Started with Dask
- Installing Dask Locally
- Hello Worlds
  - Task Hello World
    - Sleepy task
    - Nested tasks
  - Distributed Collections
    - Dask arrays
    - Dask bags and a word count
  - Dask DataFrame (Pandas/What People Wish Big Data Was)
- Conclusion
3. How Dask Works: The Basics
- Execution Backends
  - Local Backends
  - Distributed (Dask Client and Scheduler)
    - Auto-scaling
    - Important limitations with the Dask client
    - Libraries and dependencies in distributed clusters
- Dasks Diagnostics User Interface
- Serialization and Pickling
- Partitioning/Chunking Collections
  - Dask Arrays
  - Dask Bags
  - Dask DataFrames
  - Shuffles
  - Partitions During Load
- Tasks, Graphs, and Lazy Evaluation
  - Lazy Evaluation
  - Task Dependencies
  - visualize
  - Intermediate Task Results
  - Task Sizing
  - When Task Graphs Get Too Large
  - Combining Computation
  - Persist, Caching, and Memoization
- Fault Tolerance
- Conclusion
4. Dask DataFrame
- How Dask DataFrames Are Built
- Loading and Writing
  - Formats
  - Filesystems
- Indexing
- Shuffles
  - Rolling Windows and map_overlap
  - Aggregations
  - Full Shuffles and Partitioning
    - Partitioning
- Embarrassingly Parallel Operations
- Working with Multiple DataFrames
  - Multi-DataFrame Internals
  - Missing Functionality
- What Does Not Work
- Whats Slower
- Handling Recursive Algorithms
- Re-computed Data
- How Other Functions Are Different
- Data Science with Dask DataFrame: Putting It Together
  - Deciding to Use Dask
  - Exploratory Data Analysis with Dask
  - Loading Data
  - Plotting Data
  - Inspecting Data
- Conclusion
5. Dasks Collections
- Dask Arrays
  - Common Use Cases
  - When Not to Use Dask Arrays
  - Loading/Saving
  - Whats Missing
  - Special Dask Functions
- Dask Bags
  - Common Use Cases
  - Loading and Saving Dask Bags
  - Loading Messy Data with a Dask Bag
  - Limitations
- Conclusion
6. Advanced Task Scheduling: Futures and Friends
- Lazy and Eager Evaluation Revisited
- Use Cases for Futures
- Launching Futures
- Future Life Cycle
- Fire-and-Forget
- Retrieving Results
- Nested Futures
- Conclusion
7. Adding Changeable/Mutable State with Dask Actors
- What Is the Actor Model?
- Dask Actors
  - Your First Actor (Its a Bank Account)
  - Scaling Dask Actors
  - Limitations
- When to Use Dask Actors
- Conclusion
8. How to Evaluate Dasks Components and Libraries
- Qualitative Considerations for Project Evaluation
  - Project Priorities
  - Community
  - Dask-Specific Best Practices
  - Up-to-Date Dependencies
  - Documentation
  - Openness to Contributions
  - Extensibility
- Quantitative Metrics for Open Source Project Evaluation
  - Release History
  - Commit Frequency (and Volume)
  - Library Usage
  - Code and Best Practices
- Conclusion
9. Migrating Existing Analytic Engineering
- Why Dask?
- Limitations of Dask
- Migration Road Map
  - Types of Clusters
  - Development: Considerations
    - DataFrame performance
    - Porting SQL to Dask
  - Deployment Monitoring
- Conclusion
10. Dask with GPUs and Other Special Resources
- Transparent Versus Non-transparent Accelerators
- Understanding Whether GPUs or TPUs Can Help
- Making Dask Resource-Aware
- Installing the Libraries
- Using Custom Resources Inside Your Dask Tasks
  - Decorators (Including Numba)
  - GPUs
- GPU Acceleration Built on Top of Dask
  - cuDF
  - BlazingSQL
  - cuStreamz
- Freeing Accelerator Resources
- Design Patterns: CPU Fallback
- Conclusion
11. Machine Learning with Dask
- Parallelizing ML
- When to Use Dask-ML
- Getting Started with Dask-ML and XGBoost
  - Feature Engineering
  - Model Selection and Training
  - When There Is No Dask-ML Equivalent
  - Use with Dasks joblib
  - XGBoost with Dask
- ML Models with Dask-SQL
- Inference and Deployment
  - Distributing Data and Models Manually
  - Large-Scale Inferences with Dask
- Conclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
- Factors to Consider in a Deployment Option
- Building Dask on a Kubernetes Deployment
- Dask on Ray
- Dask on YARN
- Dask on High-Performance Computing
  - Setting Up Dask in a Remote Cluster
  - Connecting a Local Machine to an HPC Cluster
- Dask JupyterLab Extension and Magics
  - Installing JupyterLab Extensions
  - Launching Clusters
  - UI
  - Watching Progress
- Understanding Dask Performance
  - Metrics in Distributed Computing
  - The Dask Dashboard
    - Task stream
    - Memory
    - Task progress
    - Task graph
  - Saving and Sharing Dask Metrics/Performance Logs
  - Advanced Diagnostics
- Scaling and Debugging Best Practices
  - Manual Scaling
  - Adaptive/Auto-scaling
  - Persist and Delete Costly Data
  - Dask Nanny
  - Worker Memory Management
  - Cluster Sizing
  - Chunking, Revisited
  - Avoid Rechunking
- Scheduled Jobs
- Deployment Monitoring
- Conclusion
A. Key System Concepts for Dask Users
- Testing
  - Manual Testing
  - Unit Testing
  - Integration Testing
  - Test-Driven Development
  - Property Testing
  - Working with Notebooks
  - Out-of-Notebook Testing
  - In-Notebook Testing: In-Line Assertions
- Data and Output Validation
- Peer-to-Peer Versus Centralized Distributed
- Methods of Parallelism
  - Task Parallelism
  - Data Parallelism
    - Shuffles and narrow versus wide transformations
    - Limitations
  - Load Balancing
- Network Fault Tolerance and CAP Theorem
- Recursion (Tail and Otherwise)
- Versioning and Branching: Code and Data
- Isolation and Noisy Neighbors
- Machine Fault Tolerance
- Scalability (Up and Down)
- Cache, Memory, Disk, and Networking: How the Performance Changes
- Hashing
- Data Locality
- Exactly Once Versus At Least Once
- Conclusion
B. Scalable DataFrames: A Comparison and Some History
- Tools
  - One Machine Only
    - Pandas
    - H2Os DataTable
    - Polars
  - Distributed
    - ASF Spark DataFrame
    - SparklingPandas
    - Spark Koalas/Spark pandas DataFrames
    - Cylon
    - Ibis
    - Modin
    - Vanilla Dask DataFrame
    - cuDF
- Conclusion
C. Debugging Dask
- Using Debuggers
- General Debugging Tips with Dask
- Native Errors
- Some Notes on Official Advice for Handling Bad Records
- Dask Diagnostics
- Conclusion
D. Streaming with Streamz and Dask
- Getting Started with Streamz on Dask
- Streaming Data Sources and Sinks
- Word Count
- GPU Pipelines on Dask Streaming
- Limitations, Challenges, and Workarounds
- Conclusion
Index