Training Data for Machine Learning - Helion
ISBN: 9781492094470
stron: 332, Format: ebook
Data wydania: 2023-11-08
Księgarnia: Helion
Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)
Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.
In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software--shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.
With this book, you'll learn how to:
- Work effectively with training data including schemas, raw data, and annotations
- Transform your work, team, or organization to be more AI/ML data-centric
- Clearly explain training data concepts to other staff, team members, and stakeholders
- Design, deploy, and ship training data for production-grade AI applications
- Recognize and correct new training-data-based failure modes such as data bias
- Confidently use automation to more effectively create training data
- Successfully maintain, operate, and improve training data systems of record
Osoby które kupowały "Training Data for Machine Learning", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Training Data for Machine Learning eBook -- spis treści
- Preface
- Who Should Read This Book?
- For the Technical Professional and Engineer
- For the Manager and Director
- For the Subject Matter Expert and Data Annotation Specialist
- For the Data Scientist
- Why I Wrote This Book
- How This Book Is Organized
- Themes
- The Basics and Getting Started
- Concepts and Theories
- Putting It All Together
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- Who Should Read This Book?
- 1. Training Data Introduction
- Training Data Intents
- What Can You Do With Training Data?
- What Is Training Data Most Concerned With?
- Schema
- Raw data
- Annotations
- Quality
- Integrations
- The human role
- Training Data Opportunities
- Business Transformation
- Training Data Efficiency
- Tooling Proficiency
- Process Improvement Opportunities
- Why Training Data Matters
- ML Applications Are Becoming Mainstream
- The Foundation of Successful AI
- Training Data Is Here to Stay
- Training Data Controls the ML Program
- New Types of Users
- Training Data in the Wild
- What Makes Training Data Difficult?
- The Art of Supervising Machines
- A New Thing for Data Science
- ML Program Ecosystem
- Raw data media types
- Data-Centric Machine Learning
- Failures
- History of Development Affects Training Data Too
- What Training Data Is Not
- Generative AI
- Human Alignment Is Human Supervision
- Summary
- Training Data Intents
- 2. Getting Up and Running
- Introduction
- Getting Up and Running
- Installation
- Tasks Setup
- Annotator Setup
- Portal (default)
- Embedded
- Data Setup
- Workflow Setup
- Data Catalog Setup
- Initial Usage
- Optimization
- Tools Overview
- Training Data for Machine Learning
- Growing Selection of Tools
- People, Process, and Data
- Embedded Supervision
- Human Computer Supervision
- Separation of End Concerns
- Standards
- Many Personas
- A Paradigm to Deliver Machine Learning Software
- Trade-Offs
- Costs
- Installed Versus Software as a Service
- Development System
- Sequentially dependent discoveries
- Scale
- Why is it useful to define scale?
- Transitioning from small to medium scale
- Large-scale thoughts
- Installation Options
- Packaging
- Storage
- Database
- Data configuration
- Annotation Interfaces
- Modeling Integration
- Multi-User versus Single-User Systems
- Integrations
- Scope
- Platform and suite solutions
- Decision-making process
- Cautions
- Point solutions
- Tools in between
- Hidden Assumptions
- Security
- Security architecture
- Attack surface
- Security configuration
- Security benefits
- User access
- Data science access
- Root-level access
- Open Source and Closed Source
- Choose an open source tool to get up and running quickly
- See the forest from the trees
- Capability over optimizations
- Ease of use in different flows
- Vastly different assumptions
- Look at settings, not first impressions
- Is it easy to use, or just lacking features?
- Customization is the name of the game
- History
- Open Source Standards
- Realizing the Need for Dedicated Tooling
- More usage, more demands
- Advent of new standards
- Summary
- 3. Schema
- Schema Deep Dive Introduction
- Labels and AttributesWhat Is It?
- What Do We Care About?
- Introduction to Labels
- Attributes Introduction
- Attribute concepts
- Schema complexity trade-off
- Attribute depth
- Attribute Complexity Exceeds Spatial Complexity
- The hidden background case
- Example of sharing attributes between labels
- Technical Overview
- Example of an attribute in relation to an instance
- Data representations for engineering
- Examples of attributes
- Technical example of an attribute
- Spatial RepresentationWhere Is It?
- Using Spatial Types to Prevent Social Bias
- One way to avoid spatial bias
- Joint responsibility
- Trade-Offs with Types
- Computer Vision Spatial Type Examples
- Full image tag
- Box (2D)
- Polygon
- Ellipse and circle
- Cuboid
- Types with multiple uses
- Other types
- Raster mask
- Polygons and raster masks
- Keypoint geometry
- Custom spatial templates
- Complex spatial types
- Using Spatial Types to Prevent Social Bias
- Relationships, Sequences, Time Series: When Is It?
- Sequences and Relationships
- When
- Guides and Instructions
- Judgment Calls
- Relation of Machine Learning Tasks to Training Data
- Semantic Segmentation
- Image Classification (Tags)
- Object Detection
- Pose Estimation
- Relationship of Tasks to Training Data Types
- General Concepts
- Instance Concept Refresher
- Upgrading Data Over Time
- The Boundary Between Modeling and Training Data
- Raw Data Concepts
- Summary
- 4. Data Engineering
- Introduction
- Who Wants the Data?
- Annotators
- Data scientists
- ML programs
- Application engineers
- Other stakeholders
- A Game of Telephone
- When a system of record is needed
- Planning a Great System
- Naive and Training DataCentric Approaches
- Naive approaches
- Training datacentric (system of record)
- The first steps
- Who Wants the Data?
- Raw Data Storage
- By Reference or by Value
- Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
- Data Storage: Where Does the Data Rest?
- External Reference Connection
- Raw Media (BLOB)Type Specific
- Images
- Video
- 3D
- Text
- Medical
- Geospatial
- Formatting and Mapping
- User-Defined Types (Compound Files)
- Defining DataMaps
- Ingest Wizards
- Organizing Data and Useful Storage
- Remote Storage
- Versioning
- Per-instance history
- Per file and per set
- Per-export snapshots
- Data Access
- Disambiguating Storage, Ingestion, Export, and Access
- File-Based Exports
- Streaming Data
- Streaming benefits
- Streaming drawbacks
- Example: Fetch and stream
- Queries Introduction
- Integrations with the Ecosystem
- Security
- Access Control
- Identity and Authorization
- Example of Setting Permissions
- Signed URLs
- Cloud connections and signed URLs
- Personally Identifiable Information
- PII-compliant data chain
- PII avoidance
- PII removal
- PII-compliant data chain
- Pre-Labeling
- Updating Data
- Pre-labeling gotchas
- Pre-labeling data prep process
- Updating Data
- Summary
- Introduction
- 5. Workflow
- Introduction
- Glue Between Tech and People
- Why Are Human Tasks Needed?
- Partnering with Non-Software Users in New Ways
- Getting Started with Human Tasks
- Basics
- Schemas Staying Power
- User Roles
- Training
- Gold Standard Training
- Task Assignment Concepts
- Do You Need to Customize the Interface?
- How Long Will the Average Annotator Be Using It?
- Tasks and Project Structure
- Quality Assurance
- Annotator Trust
- Annotators Are Partners
- Who supervises the data
- All training data has errors
- Annotator needs
- Common Causes of Training Data Errors
- Task Review Loops
- Standard review loop
- Consensus
- Analytics
- Annotation Metrics Examples
- Data Exploration
- Data exploration tool example
- Explore processes
- Explore examples
- Similar image reduction
- Models
- Using the Model to Debug the Humans
- Distinctions Between a Dataset, Model, and Model Run
- Getting Data to Models
- Dataflow
- Overview of Streaming
- Data Organization
- Folders and static organization
- Filters and dynamic organization
- Pipelines and Processes
- The dataset connection
- Sending a single file to that set
- Relating a dataset to a template
- Putting the whole example together
- Expanding the example
- Non-linear example
- Hooks
- Direct Annotation
- Business Process Integration
- Attributes
- Depth of Labeling
- Supervising Existing Data
- Interactive Automations
- Example: Semantic Segmentation Auto Bordering
- Video
- Motion
- Examples of tracking objects through time (time series)
- Static objects
- Persistent objects: football example
- Series example
- Video events
- Detecting sequence errors
- Common issues in video annotation
- Summary
- 6. Theories, Concepts, and Maintenance
- Introduction
- Theories
- A System Is Only as Useful as Its Schema
- Who Supervises the Data Matters
- Intentionally Chosen Data Is Best
- Working with Historical Data
- Training Data Is Like Code
- Surface Assumptions Around Usage of Your Training Data
- Use definitions and processes to protect against assumptions
- Human Supervision Is Different from Classic Datasets
- Discovery versus automation
- Discovery
- General Concepts
- Data Relevancy
- Overall system design
- Raw data collection
- Need for Both Qualitative and Quantitative Evaluations
- Iterations
- Prioritization: What to Label
- Transfer Learnings Relation to Datasets (Fine-Tuning)
- Per-Sample Judgment Calls
- Ethical and Privacy Considerations
- Bias
- Bias Is Hard to Escape
- Metadata
- Preventing Lost Metadata
- Train/Val/Test Is the Cherry on Top
- Data Relevancy
- Sample Creation
- Simple Schema for a Strawberry Picking System
- Geometric Representations
- Binary Classification
- Lets Manually Create Our First Set
- Upgraded Classification
- Where Is the Traffic Light?
- Maintenance
- Actions
- Increase schema depth to improve performance
- Better align the spatial type to the raw data
- Create more tasks
- Change the raw data
- Net Lift
- Levels of System Maturity of Training Data Operations
- Applied Versus Research Sets
- Actions
- Training Data Management
- Quality
- Completed Tasks
- Freshness
- Maintaining Set Metadata
- Task Management
- Summary
- 7. AI Transformation and Use Cases
- Introduction
- AI Transformation
- Seeing Your Day-to-Day Work as Annotation
- The Creative Revolution of Data-centric AI
- You Can Create New Data
- You Can Change What Data You Collect
- You Can Change the Meaning of the Data
- You Can Create!
- Think Step Function Improvement for Major Projects
- Build Your AI Data to Secure Your AI Present and Future
- Appoint a Leader: The Director of AI Data
- New Expectations People Have for the Future of AI
- Sometimes Proposals and Corrections, Sometimes Replacement
- Upstream Producers and Downstream Consumers
- Producer and consumer comparison
- Producer and consumer mindset
- Why is new structure needed?
- The budget
- The AI Directors background
- Director of Training Data role
- AI-focused company modifications
- Classic company modification
- Spectrum of Training Data Team Engagement
- Dedicated Producers and Other Teams
- Organizing Producers from Other Teams
- Director of AI data responsibilities
- Training Data Evangelist
- Training Data Production Manager(s)
- Annotation Producer
- Data Engineer
- Use Case Discovery
- Rubric for Good Use Cases
- Detailed rubric
- Adds a new capability use case
- Repeating use cases
- Specialists and experts
- Evaluating a Use Case Against the Rubric
- Automatic background removal
- Evaluation example
- Conceptual Effects of Use Cases
- Ongoing impact of use cases
- Rubric for Good Use Cases
- The New Crowd Sourcing: Your Own Experts
- Key Levers on Training Data ROI
- What the Annotated Data Represents
- Trade-Offs of Controlling Your Own Training Data
- The Need for Hardware
- Common Project Mistakes
- Modern Training Data Tools
- Think Learning Curve, Not Perfection
- New Training and Knowledge Are Required
- Everyone
- Annotators
- Managers
- Executives
- How Companies Produce and Consume Data
- Trap to Avoid: Premature Optimization in Training Data
- No Silver Bullets
- Culture of Training Data
- New Engineering Principles
- Summary
- 8. Automation
- Introduction
- Getting Started
- Motivation: When to Use These Methods?
- Check What Part of the Schema a Method Is Designed to Work On
- What Do People Actually Use?
- Commonly used techniques
- Domain-specific
- A note on ordering
- What Kind of Results Can I Expect?
- Common Confusions
- Fully automatic labeling for novel model creation
- Proprietary automatic methods
- User Interface Optimizations
- Risks
- Trade-Offs
- Nature of Automations
- Setup Costs
- How to Benchmark Well
- How to Scope the Automation Relative to the Problem
- Correction Time
- Subject Matter Experts
- Consider How the Automations Stack
- Pre-Labeling
- Standard Pre-Labeling
- Benefits
- Caveats
- Pre-Labeling a Portion of the Data Only
- Use off-the-shelf models
- Clear separation of concerns
- The one step early trick
- How to get started pre-labeling
- Standard Pre-Labeling
- Interactive Annotation Automation
- Creating Your Own
- Technical Setup Notes
- What Is a Watcher? (Observer Pattern)
- How to Use a Watcher
- Interactive Capturing of a Region of Interest
- Interactive Drawing Box to Polygon Using GrabCut
- Full Image Model Prediction Example
- Example: Person Detection for Different Attribute
- Quality Assurance Automation
- Using the Model to Debug the Humans
- Automated Checklist Example
- Domain-Specific Reasonableness Checks
- Data Discovery: What to Label
- Human Exploration
- Raw Data Exploration
- Metadata Exploration
- Adding Pre-Labeling-Based Metadata
- Augmentation
- Better Models Are Better than Better Augmentation
- To Augment or Not to Augment
- Training/runtime augmentation
- Patch and inject method (crop and inject)
- Simulation and Synthetic Data
- Simulations Still Need Human Review
- Media Specific
- What Methods Work with Which Media?
- Considerations
- Media-Specific Research
- Domain Specific
- Geometry-Based Labeling
- Multi-sensor labeling automationspatial
- Spatial labeling
- Heuristics-Based Labeling
- Geometry-Based Labeling
- Summary
- 9. Case Studies and Stories
- Introduction
- Industry
- A Security Startup Adopts Training Data Tools
- Quality Assurance at a Large-Scale Self-Driving Project
- Tricky schemas should be expanded, not shrunk
- Dont justify a clearly bad schema with domain-specific assumptions
- Tracking spatial quality and errors per image
- Regression and focused effort do not always solve specific problems
- Overfocus on complex instructions instead of fixing the schema
- Trade-offs of attempting to achieve perfection in nuanced domain-specific cases
- Understanding nuanced cases
- Learning from mistakes
- Define occlusion well
- Expand schemas
- Remember the null case
- Missing assumptions for language barriers
- Dont overfocus on spatial information
- Big-Tech Challenges
- Two annotation software teams
- Confusing the media types
- Non-queryable
- Different teams for annotations and raw media
- Moving toward a system of record
- Missing the big picture
- Solution
- Lets address loops
- Human in the loop
- The case for aligning teams around training data
- Insurance Tech Startup Lessons
- Will the production data match the training data?
- Too late to bring in training data software
- Stories
- Static Schema Prevented Innovation at Self Driving Firm
- Startup Didnt Change Schema and Wasted Effort
- Accident Prevention Startup Missed Data-Centric Approach
- Sports Startup Successfully Used Pre-Labeling
- An Academic Approach to Training Data
- Kaggle TSA Competition
- Keying in on training data
- How focusing on training data reveals commercial efficiencies
- Learning lessons and mistakes
- Kaggle TSA Competition
- Summary
- Index