Hands-On Entity Resolution - Helion
ISBN: 9781098148447
stron: 198, Format: ebook
Data wydania: 2024-02-01
Księgarnia: Helion
Cena książki: 211,65 zł (poprzednio: 246,10 zł)
Oszczędzasz: 14% (-34,45 zł)
Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.
Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value.
With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers:
- Challenges in deduplicating and joining datasets
- Extracting, cleansing, and preparing datasets for matching
- Text matching algorithms to identify equivalent entities
- Techniques for deduplicating and joining datasets at scale
- Matching datasets containing persons and organizations
- Evaluating data matches
- Optimizing and tuning data matching algorithms
- Entity resolution using cloud APIs
- Matching using privacy-enhancing technologies
Osoby które kupowały "Hands-On Entity Resolution", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Hands-On Entity Resolution eBook -- spis treści
- Preface
- Who Should Read This Book
- Why I Wrote This Book
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introduction to Entity Resolution
- What Is Entity Resolution?
- Why Is Entity Resolution Needed?
- Main Challenges of Entity Resolution
- Lack of Unique Names
- Inconsistent Naming Conventions
- Data Capture Inconsistencies
- Worked Example
- Deliberate Obfuscation
- Match Permutations
- Blind Matching?
- The Entity Resolution Process
- Data Standardization
- Record Blocking
- Attribute Comparison
- Match Classification
- Clustering
- Canonicalization
- Worked Example
- Measuring Performance
- Getting Started
- 2. Data Standardization
- Sample Problem
- Environment Setup
- Acquiring Data
- Wikipedia Data
- TheyWorkForYou Data
- Adding Facebook links
- Cleansing Data
- Wikipedia
- TheyWorkForYou
- Attribute Comparison
- Constituency
- Measuring Performance
- Sample Calculation
- Summary
- 3. Text Matching
- Edit Distance Matching
- Levenshtein Distance
- Jaro Similarity
- Jaro-Winkler Similarity
- Phonetic Matching
- Metaphone
- Match Rating Approach
- Comparing the Techniques
- Sample Problem
- Full Similarity Comparison
- Measuring Performance
- Summary
- Edit Distance Matching
- 4. Probabilistic Matching
- Sample Problem
- Single Attribute Match Probability
- First Name Match Probability
- Last Name Match Probability
- Multiple Attribute Match Probability
- Probabilistic Models
- Bayes Theorem
- m Value
- u Value
- Lambda ( )Â Value
- Bayes Factor
- Fellegi-Sunter Model
- Match Weight
- Expectation-Maximization Algorithm
- Iteration 1
- Iteration 2
- Iteration 3
- Introducing Splink
- Configuring Splink
- Splink Performance
- Summary
- 5. Record Blocking
- Sample Problem
- Data Acquisition
- Wikipedia Data
- UK Companies House Data
- Data Standardization
- Wikipedia Data
- UK Companies House Data
- Record Blocking and Attribute Comparison
- Record Blocking with Splink
- Attribute Comparison
- Match Classification
- Measuring Performance
- Summary
- 6. Company Matching
- Sample Problem
- Data Acquisition
- Data Standardization
- Companies House Data
- Maritime and Coastguard Agency Data
- Record Blocking and Attribute Comparison
- Match Classification
- Measuring Performance
- Matching New Entities
- Summary
- 7. Clustering
- Simple Exact Match Clustering
- Approximate Match Clustering
- Sample Problem
- Data Acquisition
- Data Standardization
- Record Blocking and Attribute Comparison
- Data Analysis
- Expectation-Maximization Blocking Rules
- Match Classification and Clustering
- Cluster Visualization
- Cluster Analysis
- Summary
- 8. Scaling Up on Google Cloud
- Google Cloud Setup
- Setting Up Project Storage
- Creating a Dataproc Cluster
- Configuring a Dataproc Cluster
- Entity Resolution on Spark
- Measuring Performance
- Tidy Up!
- Summary
- Google Cloud Setup
- 9. Cloud Entity Resolution Services
- Introduction to BigQuery
- Enterprise Knowledge Graph API
- Schema Mapping
- Reconciliation Job
- Result Processing
- Entity Reconciliation Python Client
- Measuring Performance
- Summary
- 10. Privacy-Preserving Record Linkage
- An Introduction to Private Set Intersection
- How PSI Works
- PSI Protocol Based on ECDH
- Bloom Filters
- Bloom filter example
- Golomb-Coded Sets
- GCS example
- Bloom Filters
- Example: Using the PSI Process
- Environment Setup
- Google Cloud setup
- Option 1: Prebuilt PSI package
- Option 2: Build PSI package
- Server install
- Server Code
- Client Code
- Using raw encrypted server values
- Using Bloom filterencoded encrypted server values
- Using GCS-encoded encrypted server values
- Full MCA and Companies House Sample Example
- Environment Setup
- Summary
- 11. Further Considerations
- Data Considerations
- Unstructured Data
- Data Quality
- Temporal Equivalence
- Attribute Comparison
- Set Matching
- Geocoding Location Matching
- Aggregating Comparisons
- Post Processing
- Graphical Representation
- Real-Time Considerations
- Performance Evaluation
- Pairwise Approach
- Cluster-Based Approach
- Future of Entity Resolution
- Data Considerations
- Index