Genomics in the Cloud. Using Docker, GATK, and WDL in Terra - Helion
ISBN: 978-14-919-7514-5
stron: 496, Format: ebook
Data wydania: 2020-04-02
Księgarnia: Helion
Cena książki: 254,15 zł (poprzednio: 299,00 zł)
Oszczędzasz: 15% (-44,85 zł)
Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes—or over 50 million gigabytes—of genomic data, and they’re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud?
With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian O’Connor of the UC Santa Cruz Genomics Institute, guide you through the process. You’ll learn by working with real data and genomics algorithms from the field.
This book covers:
- Essential genomics and computing technology background
- Basic cloud computing operations
- Getting started with GATK, plus three major GATK Best Practices pipelines
- Automating analysis with scripted workflows using WDL and Cromwell
- Scaling up workflow execution in the cloud, including parallelization and cost optimization
- Interactive analysis in the cloud using Jupyter notebooks
- Secure collaboration and computational reproducibility using Terra
Osoby które kupowały "Genomics in the Cloud. Using Docker, GATK, and WDL in Terra", wybierały także:
- Docker. Kurs video. Zostań administratorem systemów IT 119,00 zł, (53,55 zł -55%)
- Docker. Kurs video. Praca z systemem konteneryzacji i Docker Swarm 89,00 zł, (40,05 zł -55%)
- DevOps dla zdesperowanych. Praktyczny poradnik przetrwania 67,00 zł, (33,50 zł -50%)
- DevOps w praktyce. Kurs video. Jenkins, Ansible, Terraform i Docker 198,98 zł, (99,49 zł -50%)
- Kubernetes. Tworzenie niezawodnych system 69,00 zł, (34,50 zł -50%)
Spis treści
Genomics in the Cloud. Using Docker, GATK, and WDL in Terra eBook -- spis treści
- Foreword
- Preface
- Purpose, Scope, and Intended Audience of This Book
- What You Will Learn from This Book
- What Computational Experience Is Needed for the Exercises?
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- Purpose, Scope, and Intended Audience of This Book
- 1. Introduction
- The Promises and Challenges of Big Data in Biology and Life Sciences
- Infrastructure Challenges
- Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
- Cloud-Hosted Data and Compute
- Platforms for Research in the Life Sciences
- Standardization and Reuse of Infrastructure
- Being FAIR
- Wrap-Up and Next Steps
- 2. Genomics in a Nutshell: A Primer for Newcomers to the Field
- Introduction to Genomics
- The Gene as a Discrete Unit of Inheritance (Sort Of)
- The Central Dogma of Biology: DNA to RNA to Protein
- The Origins and Consequences of DNA Mutations
- Genomics as an Inventory of Variation in and Among Genomes
- The Challenge of Genomic Scale, by the Numbers
- Genomic Variation
- The Reference Genome as Common Framework
- Physical Classification of Variants
- Single-nucleotide variants
- Insertions and deletions
- Copy-number variants
- Structural variants
- Germline Variants Versus Somatic Alterations
- Germline
- Somatic
- High-Throughput Sequencing Data Generation
- From Biological Sample to Huge Pile of Read Data
- High-throughput sequencing data formats
- Types of DNA Libraries: Choosing the Right Experimental Design
- Amplicon preparation
- Whole genome preparation
- Target enrichment: gene panels and exomes
- Whole genome versus exome
- From Biological Sample to Huge Pile of Read Data
- Data Processing and Analysis
- Mapping Reads to the Reference Genome
- Variant Calling
- Data Quality and Sources of Error
- Contamination and sample swaps
- Biochemical, physical, and software artifacts
- Functional Equivalence Pipeline Specification
- Wrap-Up and Next Steps
- Introduction to Genomics
- 3. Computing Technology Basics for Life Scientists
- Basic Infrastructure Components and Performance Bottlenecks
- Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
- Levels of Compute Organization: Core, Node, Cluster, and Cloud
- Low level: core
- Mid level: node/machine
- Top level: cluster and cloud
- Addressing Performance Bottlenecks
- Data storage and I/O operations: hard drive versus solid state
- Memory: cache or crash
- Specialized hardware and code optimizations: navigating the trade-offs
- Parallel Computing
- Parallelizing a Simple Analysis
- From Cores to Clusters and Clouds: Many Levels of Parallelism
- Trade-Offs of Parallelism: Speed, Efficiency, and Cost
- Pipelining for Parallelization and Automation
- Workflow Languages
- Popular Pipelining Languages for Genomics
- Workflow Management Systems
- Virtualization and the Cloud
- VMs and Containers
- Introducing the Cloud
- Clouds are not fluffy
- Evolution of cloud infrastructure and services
- Pros and cons of the cloud
- Categories of Research Use Cases for Cloud Services
- Lightweight development: Google Cloud Shell
- Intermediate development and analysis: single VM
- Batch analysis: multiple VMs via batch services
- Framework analysis: multiple VMs via framework services
- Wrap-Up and Next Steps
- Basic Infrastructure Components and Performance Bottlenecks
- 4. First Steps in the Cloud
- Setting Up Your Google Cloud Account and First Project
- Creating a Project
- Checking Your Billing Account and Activating Free Credits
- Running Basic Commands in Google Cloud Shell
- Logging in to the Cloud Shell VM
- Using gsutil to Access and Manage Files
- Pulling a Docker Image and Spinning Up the Container
- Mounting a Volume to Access the Filesystem from Within the Container
- Setting Up Your Own Custom VM
- Creating and Configuring Your VM Instance
- Name your VM
- Choose a region (important!) and zone (not so important)
- Select a machine type
- Specify a container? (nope)
- Customize the boot disk
- Logging into Your VM by Using SSH
- Checking Your Authentication
- Copying the Book Materials to Your VM
- Installing Docker on Your VM
- Setting Up the GATK Container Image
- Stopping Your VMto Stop It from Costing You Money
- Creating and Configuring Your VM Instance
- Configuring IGV to Read Data from GCS Buckets
- Wrap-Up and Next Steps
- Setting Up Your Google Cloud Account and First Project
- 5. First Steps with GATK
- Getting Started with GATK
- Operating Requirements
- Command-Line Syntax
- Multithreading with Spark
- Running GATK in Practice
- Docker setup and test invocation
- Running a real GATK command
- Running a Picard command within GATK4
- Getting Started with Variant Discovery
- Calling Germline SNPs and Indels with HaplotypeCaller
- HaplotypeCaller in a nutshell
- Running HaplotypeCaller and examining the output
- Generating an output BAM to troubleshoot a surprising call
- Filtering Based on Variant Context Annotations
- Understanding variant context annotations
- Plotting variant context annotation data
- Applying hard filters to germline SNPs and indels
- Calling Germline SNPs and Indels with HaplotypeCaller
- Introducing the GATK Best Practices
- Best Practices Workflows Covered in This Book
- Other Major Use Cases
- Best Practices Workflows Covered in This Book
- Wrap-Up and Next Steps
- Getting Started with GATK
- 6. GATK Best Practices for Germline Short Variant Discovery
- Data Preprocessing
- Mapping Reads to the Genome Reference
- Marking Duplicates
- Recalibrating Base Quality Scores
- Joint Discovery Analysis
- Overview of the Joint Calling Workflow
- Calling Variants per Sample to Generate GVCFs
- Consolidating GVCFs
- Applying Joint Genotyping to Multiple Samples
- Filtering the Joint Callset with Variant Quality Score Recalibration
- Refining Genotype Assignments and Adjusting Genotype Confidence
- Next Steps and Further Reading
- Single-Sample Calling with CNN Filtering
- Overview of the CNN Single-Sample Workflow
- Applying 1D CNN to Filter a Single-Sample WGS Callset
- Applying 2D CNN to Include Read Data in the Modeling
- Wrap-Up and Next Steps
- Data Preprocessing
- 7. GATK Best Practices for Somatic Variant Discovery
- Challenges in Cancer Genomics
- Somatic Short Variants (SNVs and Indels)
- Overview of the Tumor-Normal Pair Analysis Workflow
- Creating a Mutect2 PoN
- Running Mutect2 on the Tumor-Normal Pair
- Estimating Cross-Sample Contamination
- Filtering Mutect2 Calls
- Annotating Predicted Functional Effects with Funcotator
- Somatic Copy-Number Alterations
- Overview of the Tumor-Only Analysis Workflow
- Collecting Coverage Counts
- Creating a Somatic CNA PoN
- Applying Denoising
- Performing Segmentation and Call CNAs
- Additional Analysis Options
- Tumor-Normal pair analysis
- Allelic copy ratio analysis
- Wrap-Up and Next Steps
- 8. Automating Analysis Execution with Workflows
- Introducing WDL and Cromwell
- Installing and Setting Up Cromwell
- Your First WDL: Hello World
- Learning Basic WDL Syntax Through a Minimalist Example
- Running a Simple WDL with Cromwell on Your Google VM
- Interpreting the Important Parts of Cromwells Logging Output
- Adding a Variable and Providing Inputs via JSON
- Adding Another Task to Make It a Proper Workflow
- Your First GATK Workflow: Hello HaplotypeCaller
- Exploring the WDL
- Generating the Inputs JSON
- Running the Workflow
- Breaking the Workflow to Test Syntax Validation and Error Messaging
- Introducing Scatter-Gather Parallelism
- Exploring the WDL
- Generating a Graph Diagram for Visualization
- Wrap-Up and Next Steps
- 9. Deciphering Real Genomics Workflows
- Mystery Workflow #1: Flexibility Through Conditionals
- Mapping Out the Workflow
- Generating the graph diagram
- Identifying the code that corresponds to the diagram components
- Reverse Engineering the Conditional Switch
- How is the conditional logic set up?
- Does the conditional interfere with any assumptions were making anywhere else?
- How does the next task know what to run on?
- Can we use conditionals to manage default settings?
- Mapping Out the Workflow
- Mystery Workflow #2: Modularity and Code Reuse
- Mapping Out the Workflow
- Generating the graph diagram
- Identifying the code that corresponds to the diagram components
- Unpacking the Nesting Dolls
- What is the structure of a subworkflow?
- Where are the tasks defined?
- How is the subworkflow wired up?
- Mapping Out the Workflow
- Wrap-Up and Next Steps
- Mystery Workflow #1: Flexibility Through Conditionals
- 10. Running Single Workflows at Scale with Pipelines API
- Introducing the GCP Genomics Pipelines API Service
- Enabling Genomics API and Related APIs in Your Google Cloud Project
- Directly Dispatching Cromwell Jobs to PAPI
- Configuring Cromwell to Communicate with PAPI
- Running Scattered HaplotypeCaller via PAPI
- Monitoring Workflow Execution on Google Compute Engine
- Understanding and Optimizing Workflow Efficiency
- Granularity of Operations
- Balance of Time Versus Money
- Suggested Cost-Saving Optimizations
- Dynamic sizing for resource allocation
- File streaming to GATK4 tools
- Preemptible VM instances
- Platform-Specific Optimization Versus Portability
- Wrapping Cromwell and PAPI Execution with WDL Runner
- Setting Up WDL Runner
- Running the Scattered HaplotypeCaller Workflow with WDL Runner
- Monitoring WDL Runner Execution
- Wrap-Up and Next Steps
- Introducing the GCP Genomics Pipelines API Service
- 11. Running Many Workflows Conveniently in Terra
- Getting Started with Terra
- Creating an Account
- Creating a Billing Project
- Cloning the Preconfigured Workspace
- Running Workflows with the Cromwell Server in Terra
- Running a Workflow on a Single Sample
- Running a Workflow on Multiple Samples in a Data Table
- Monitoring Workflow Execution
- Locating Workflow Outputs in the Data Table
- Running the Same Workflow Again to Demonstrate Call Caching
- Running a Real GATK Best Practices Pipeline at Full Scale
- Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
- Examining the Preloaded Data
- Selecting Data and Configuring the Full-Scale Workflow
- Launching the Full-Scale Workflow and Monitoring Execution
- Options for Downloading Output Dataor Not
- Wrap-Up and Next Steps
- Getting Started with Terra
- 12. Interactive Analysis in Jupyter Notebook
- Introduction to Jupyter in Terra
- Jupyter Notebooks in General
- How Jupyter Notebooks Work in Terra
- Overview
- Accessing data
- Saving, stopping, and restarting
- Customizing your notebooks computing environment
- Sharing and collaboration
- Getting Started with Jupyter in Terra
- Inspecting and Customizing the Notebook Runtime Configuration
- Opening Notebook in Edit Mode and Checking the Kernel
- Running the Hello World Cells
- Python Hello World
- R Hello World using Python magic methods
- Command-line tool Hello World using Python magic methods
- Using gsutil to Interact with Google Cloud Storage Buckets
- Setting Up a Variable Pointing to the Germline Data in the Book Bucket
- Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
- Visualizing Genomic Data in an Embedded IGV Window
- Setting Up the Embedded IGV Browser
- Adding Data to the IGV Browser
- Setting Up an Access Token to View Private Data
- Running GATK Commands to Learn, Test, or Troubleshoot
- Running a Basic GATK Command: HaplotypeCaller
- Loading the Data (BAM and VCF) into IGV
- Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
- Visualizing Variant Context Annotation Data
- Exporting Annotations of Interest with VariantsToTable
- Loading R Script to Make Plotting Functions Available
- Making Density Plots for QUAL by Using makeDensityPlot
- Making a Scatter Plot of QUAL Versus DP
- Making a Scatter Plot Flanked by Marginal Density Plots
- Wrap-Up and Next Steps
- Introduction to Jupyter in Terra
- 13. Assembling Your Own Workspace in Terra
- Managing Data Inside and Outside of Workspaces
- The Workspace Bucket as Data Repository
- Accessing Private Data That You Manage Outside of Terra
- Accessing Data in the Terra Data Library
- Re-Creating the Tutorial Workspace from Base Components
- Creating a New Workspace
- Adding the Workflow to the Methods Repository and Importing It into the Workspace
- Creating a Configuration Quickly with a JSON File
- Adding the Data Table
- Filling in the Workspace Resource Data Table
- Creating a Workflow Configuration That Uses the Data Tables
- Adding the Notebook and Checking the Runtime Environment
- Documenting Your Workspace and Sharing It
- Starting from a GATK Best Practices Workspace
- Cloning a GATK Best Practices Workspace
- Examining GATK Workspace Data Tables to Understand How the Data Is Structured
- Getting to Know the 1000 Genomes High Coverage Dataset
- Copying Data Tables from the 1000 Genomes Workspace
- Using TSV Load Files to Import Data from the 1000 Genomes Workspace
- Running a Joint-Calling Analysis on the Federated Dataset
- Building a Workspace Around a Dataset
- Cloning the 1000 Genomes Data Workspace
- Importing a Workflow from Dockstore
- Configuring the Workflow to Use the Data Tables
- Wrap-Up and Next Steps
- Managing Data Inside and Outside of Workspaces
- 14. Making a Fully Reproducible Paper
- Overview of the Case Study
- Computational Reproducibility and the FAIR Framework
- Original Research Study and History of the Case Study
- Assessing the Available Information and Key Challenges
- Designing a Reproducible Implementation
- Generating a Synthetic Dataset as a Stand-In for the Private Data
- Overall Methodology
- Retrieving the Variant Data from 1000 Genomes Participants
- Creating Fake Exomes Based on Real People
- Mutating the Fake Exomes
- Generating the Definitive Dataset
- Re-Creating the Data Processing and Analysis Methodology
- Mapping and Variant Discovery
- Variant Effect Prediction, Prioritization, and Variant Load Analysis
- Analytical Performance of the New Implementation
- The Long, Winding Road to FAIRness
- Final Conclusions
- Overview of the Case Study
- Glossary
- Index