Practical Synthetic Data Generation. Balancing Privacy and the Broad Availability of Data - Helion
ISBN: 978-14-920-7269-0
stron: 166, Format: ebook
Data wydania: 2020-05-19
Księgarnia: Helion
Cena książki: 194,65 zł (poprzednio: 226,34 zł)
Oszczędzasz: 14% (-31,69 zł)
Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.
Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.
This book describes:
- Steps for generating synthetic data using multivariate normal distributions
- Methods for distribution fitting covering different goodness-of-fit metrics
- How to replicate the simple structure of original data
- An approach for modeling data structure to consider complex relationships
- Multiple approaches and metrics you can use to assess data utility
- How analysis performed on real data can be replicated with synthetic data
- Privacy implications of synthetic data and methods to assess identity disclosure
Osoby które kupowały "Practical Synthetic Data Generation. Balancing Privacy and the Broad Availability of Data", wybierały także:
- R i pakiet shiny. Kurs video. Interaktywne aplikacje w analizie danych 149,00 zł, (67,05 zł -55%)
- Apache NiFi. Kurs video. Automatyzacja przep 149,00 zł, (67,05 zł -55%)
- Web scraping. Kurs video. Zautomatyzowane pozyskiwanie danych z sieci 139,00 zł, (62,55 zł -55%)
- Data Science w Pythonie. Kurs video. Przetwarzanie i analiza danych 149,00 zł, (67,05 zł -55%)
- Excel 2013. Kurs video. Poziom drugi. Przetwarzanie i analiza danych 79,00 zł, (35,55 zł -55%)
Spis treści
Practical Synthetic Data Generation. Balancing Privacy and the Broad Availability of Data eBook -- spis treści
- Preface
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introducing Synthetic Data Generation
- Defining Synthetic Data
- Synthesis from Real Data
- Synthesis Without Real Data
- Synthesis and Utility
- The Benefits of Synthetic Data
- Efficient Access to Data
- Enabling Better Analytics
- Synthetic Data as a Proxy
- Learning to Trust Synthetic Data
- Synthetic Data Case Studies
- Manufacturing and Distribution
- Healthcare
- Data for cancer research
- Evaluating innovative digital health technologies
- Financial Services
- Synthetic data benchmarks
- Software testing
- Transportation
- Microsimulation models
- Data synthesis for autonomous vehicles
- Summary
- Defining Synthetic Data
- 2. Implementing Data Synthesis
- When to Synthesize
- Identifiability Spectrum
- Trade-Offs in Selecting PETs to Enable Data Access
- Decision Criteria
- PETs Considered
- Decision Framework
- Examples of Applying the Decision Framework
- Data Synthesis Projects
- Data Synthesis Steps
- Data Preparation
- The Data Synthesis Pipeline
- Synthesis Program Management
- Summary
- 3. Getting Started: Distribution Fitting
- Framing Data
- How Data Is Distributed
- Fitting Distributions to Real Data
- Generating Synthetic Data from a Distribution
- Measuring How Well Synthetic Data Fits a Distribution
- The Overfitting Dilemma
- A Little Light Weeding
- Summary
- 4. Evaluating Synthetic Data Utility
- Synthetic Data Utility Framework: Replication of Analysis
- Synthetic Data Utility Framework: Utility Metrics
- Comparing Univariate Distributions
- Comparing Bivariate Statistics
- Comparing Multivariate Prediction Models
- Distinguishability
- Summary
- 5. Methods for Synthesizing Data
- Generating Synthetic Data from Theory
- Sampling from a Multivariate Normal Distribution
- Inducing Correlations with Specified Marginal Distributions
- Copulas with Known Marginal Distributions
- Generating Realistic Synthetic Data
- Fitting Real Data to Known Distributions
- Using Machine Learning to Fit the Distributions
- Hybrid Synthetic Data
- Machine Learning Methods
- Deep Learning Methods
- Synthesizing Sequences
- Summary
- Generating Synthetic Data from Theory
- 6. Identity Disclosure in Synthetic Data
- Types of Disclosure
- Identity Disclosure
- Learning Something New
- Attribute Disclosure
- Inferential Disclosure
- Meaningful Identity Disclosure
- Defining Information Gain
- Bringing It All Together
- Unique Matches
- How Privacy Law Impacts the Creation and Use of Synthetic Data
- Issues Under the GDPR
- Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under the GDPR?
- Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under the GDPR?
- Does the GDPR regulate or otherwise affect (if at all) the resulting synthetic dataset?
- Issues Under the CCPA
- Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under the CCPA?
- Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under the CCPA?
- Does the CCPA regulate or otherwise affect (if at all) the resulting synthetic dataset?
- Issues Under HIPAA
- Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under HIPAA?
- Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under HIPAA?
- Does HIPAA regulate or otherwise affect (if at all) the resulting synthetic dataset?
- Article 29 Working Party Opinion
- Singling out
- Linkability
- Inference
- Closing comments on the Article 29 opinion
- Issues Under the GDPR
- Summary
- Types of Disclosure
- 7. Practical Data Synthesis
- Managing Data Complexity
- For Every Pre-Processing Step There Is a Post-Processing Step
- Field Types
- The Need for Rules
- Not All Fields Have to Be Synthesized
- Synthesizing Dates
- Synthesizing Geography
- Lookup Fields and Tables
- Missing Data and Other Data Characteristics
- Partial Synthesis
- Organizing Data Synthesis
- Computing Capacity
- A Toolbox of Techniques
- Synthesizing Cohorts Versus Full Datasets
- Continuous Data Feeds
- Privacy Assurance as Certification
- Performing Validation Studies to Get Buy-In
- Motivated Intruder Tests
- Who Owns Synthetic Data?
- Conclusions
- Managing Data Complexity
- Index