Natural Language Annotation for Machine Learning - Helion

ebook

Autor: James Pustejovsky, Amber Stubbs
ISBN: 978-14-493-5976-8
stron: 342, Format: ebook
Data wydania: 2012-10-11
Księgarnia: Helion

Cena książki: 118,15 zł (poprzednio: 137,38 zł)
Oszczędzasz: 14% (-19,23 zł)

Osoby, które kupiły tę książkę, wybierały także »

Create your own natural language training corpus for machine learning. Whether you’re working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You don’t need any programming or linguistics experience to get started.

Using detailed examples at every step, you’ll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.

Define a clear annotation goal before collecting your dataset (corpus)
Learn tools for analyzing the linguistic content of your corpus
Build a model and specification for your annotation project
Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
Create a gold standard corpus that can be used to train and test ML algorithms
Select the ML algorithms that will process your annotated data
Evaluate the test results and revise your annotation task
Learn how to use lightweight software for annotating texts and adjudicating the annotations

This book is a perfect companion to O’Reilly’s Natural Language Processing with Python.

Osoby które kupowały "Natural Language Annotation for Machine Learning", wybierały także:

Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 124,17 zł, (14,90 zł -88%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 93,13 zł, (14,90 zł -84%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
Scrum. O zwinnym zarz 78,42 zł, (14,90 zł -81%)

Spis treści

Natural Language Annotation for Machine Learning eBook -- spis treści

Natural Language Annotation for Machine Learning
Preface
- Natural Language Annotation for Machine Learning
- Audience
- Organization of This Book
- Software Requirements
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
  - James Adds:
  - Amber Adds:
1. The Basics
- The Importance of Language Annotation
  - The Layers of Linguistic Description
  - What Is Natural Language Processing?
- A Brief History of Corpus Linguistics
  - What Is a Corpus?
  - Early Use of Corpora
  - Corpora Today
  - Kinds of Annotation
- Language Data and Machine Learning
  - Classification
  - Clustering
  - Structured Pattern Induction
- The Annotation Development Cycle
  - Model the Phenomenon
  - Annotate with the Specification
  - Train and Test the Algorithms over the Corpus
  - Evaluate the Results
  - Revise the Model and Algorithms
- Summary
2. Defining Your Goal and Dataset
- Defining Your Goal
  - The Statement of Purpose
  - Refining Your Goal: Informativity Versus Correctness
    - The scope of the annotation task
    - What will the annotation be used for?
    - What will the overall outcome be?
    - Where will the corpus come from?
    - How will the result be achieved?
- Background Research
  - Language Resources
  - Organizations and Conferences
  - NLP Challenges
- Assembling Your Dataset
  - The Ideal Corpus: Representative and Balanced
  - Collecting Data from the Internet
  - Eliciting Data from People
    - Read speech
    - Spontaneous speech
- The Size of Your Corpus
  - Existing Corpora
  - Distributions Within Corpora
- Summary
3. Corpus Analytics
- Basic Probability for Corpus Analytics
  - Joint Probability Distributions
  - Bayes Rule
- Counting Occurrences
  - Zipfs Law
  - N-grams
- Language Models
- Summary
4. Building Your Model and Specification
- Some Example Models and Specs
  - Film Genre Classification
  - Adding Named Entities
  - Semantic Roles
- Adopting (or Not Adopting) Existing Models
  - Creating Your Own Model and Specification: Generality Versus Specificity
  - Using Existing Models and Specifications
  - Using Models Without Specifications
- Different Kinds of Standards
  - ISO Standards
    - Annotation format standards
    - Annotation specification standards
  - Community-Driven Standards
  - Other Standards Affecting Annotation
- Summary
5. Applying and Adopting Annotation Standards
- Metadata Annotation: Document Classification
  - Unique Labels: Movie Reviews
  - Multiple Labels: Film Genres
- Text Extent Annotation: Named Entities
  - Inline Annotation
  - Stand-off Annotation by Tokens
  - Stand-off Annotation by Character Location
- Linked Extent Annotation: Semantic Roles
- ISO Standards and You
- Summary
6. Annotation and Adjudication
- The Infrastructure of an Annotation Project
- Specification Versus Guidelines
- Be Prepared to Revise
- Preparing Your Data for Annotation
  - Metadata
  - Preprocessed Data
  - Splitting Up the Files for Annotation
- Writing the Annotation Guidelines
  - Example 1: Single LabelsMovie Reviews
  - Example 2: Multiple LabelsFilm Genres
  - Example 3: Extent AnnotationsNamed Entities
  - Example 4: Link TagsSemantic Roles
- Annotators
- Choosing an Annotation Environment
- Evaluating the Annotations
  - Cohens Kappa ()
  - Fleisss Kappa ()
  - Interpreting Kappa Coefficients
  - Calculating in Other Contexts
- Creating the Gold Standard (Adjudication)
- Summary
7. Training: Machine Learning
- What Is Learning?
- Defining Our Learning Task
- Classifier Algorithms
  - Decision Tree Learning
  - Gender Identification
  - Nave Bayes Learning
    - Movie genre identification
    - Sentiment classification
  - Maximum Entropy Classifiers
  - Other Classifiers to Know About
- Sequence Induction Algorithms
- Clustering and Unsupervised Learning
- Semi-Supervised Learning
- Matching Annotation to Algorithms
- Summary
8. Testing and Evaluation
- Testing Your Algorithm
- Evaluating Your Algorithm
  - Confusion Matrices
  - Calculating Evaluation Scores
    - Percentage accuracy
    - Precision and recall
    - F-measure
    - Other evaluation metrics
  - Interpreting Evaluation Scores
- Problems That Can Affect Evaluation
  - Dataset Is Too Small
  - Algorithm Fits the Development Data Too Well
  - Too Much Information in the Annotation
- Final Testing Scores
- Summary
9. Revising and Reporting
- Revising Your Project
  - Corpus Distributions and Content
  - Model and Specification
  - Annotation
    - Guidelines
    - Annotators
    - Tools
  - Training and Testing
- Reporting About Your Work
  - About Your Corpus
  - About Your Model and Specifications
  - About Your Annotation Task and Annotators
  - About Your ML Algorithm
  - About Your Revisions
- Summary
10. Annotation: TimeML
- The Goal of TimeML
- Related Research
- Building the Corpus
- Model: Preliminary Specifications
  - Times
  - Signals
  - Events
  - Links
- Annotation: First Attempts
- Model: The TimeML Specification Used in TimeBank
  - Time Expressions
  - Events
  - Signals
  - Links
  - Confidence
- Annotation: The Creation of TimeBank
- TimeML Becomes ISO-TimeML
- Modeling the Future: Directions for TimeML
  - Narrative Containers
  - Expanding TimeML to Other Domains
  - Event Structures
- Summary
11. Automatic Annotation: Generating TimeML
- The TARSQI Components
  - GUTime: Temporal Marker Identification
  - EVITA: Event Recognition and Classification
  - GUTenLINK
  - Slinket
  - SputLink
  - Machine Learning in the TARSQI Components
- Improvements to the TTK
  - Structural Changes
  - Improvements to Temporal Entity Recognition: BTime
  - Temporal Relation Identification
  - Temporal Relation Validation
  - Temporal Relation Visualization
- TimeML Challenges: TempEval-2
  - TempEval-2: System Summaries
  - Overview of Results
- Future of the TTK
  - New Input Formats
  - Narrative Containers/Narrative Times
  - Medical Documents
  - Cross-Document Analysis
- Summary
12. Afterword: The Future of Annotation
- Crowdsourcing Annotation
  - Amazons Mechanical Turk
  - Games with a Purpose (GWAP)
  - User-Generated Content
- Handling Big Data
  - Boosting
  - Active Learning
  - Semi-Supervised Learning
- NLP Online and in the Cloud
  - Distributed Computing
  - Shared Language Resources
  - Shared Language Applications
- And Finally...
A. List of Available Corpora and Specifications
- Corpora
- Specifications, Guidelines, and Other Resources
- Representation Standards
B. List of Software Resources
- Annotation and Adjudication Software
  - Multipurpose Tools
  - Corpus Creation and Exploration Tools
  - Manual Annotation Tools
  - Automated Annotation Tools
    - Multipurpose tools
    - Phonetic annotation
    - Part-of-speech taggers/syntactic parsers
    - Tokenizers/chunkers/stemmers
    - Other
- Machine Learning Resources
C. MAE User Guide
- Installing and Running MAE
- Loading Tasks and Files
  - Loading a Task
  - Loading a File
  - Annotating Entities
    - Attribute information
    - Nonconsuming tags
  - Annotating Links
  - Deleting Tags
- Saving Files
- Defining Your Own Task
  - Task Name
  - Elements (a.k.a. Tags)
  - Attributes
    - id attributes
    - start attribute
    - Attribute types
    - Default attribute values
- Frequently Asked Questions
D. MAI User Guide
- Installing and Running MAI
- Loading Tasks and Files
  - Loading a Task
  - Loading Files
- Adjudicating
  - The MAI Window
  - Adjudicating a Tag
  - Extent Tags
  - Link Tags
  - Nonconsuming Tags
  - Adding New Tags
  - Deleting tags
- Saving Files
E. Bibliography
- References for Using Amazons Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright