Accumulo. Application Development, Table Design, and Best Practices - Helion

ebook

Autor: Aaron Cordova, Billie Rinaldi, Michael Wall
ISBN: 978-14-919-4692-3
stron: 552, Format: ebook
Data wydania: 2015-07-01
Księgarnia: Helion

Cena książki: 152,15 zł (poprzednio: 176,92 zł)
Oszczędzasz: 14% (-24,77 zł)

Osoby, które kupiły tę książkę, wybierały także »

Get up to speed on Apache Accumulo, the flexible, high-performance key/value store created by the National Security Agency (NSA) and based on Google’s BigTable data storage system. Written by former NSA team members, this comprehensive tutorial and reference covers Accumulo architecture, application development, table design, and cell-level security.

With clear information on system administration, performance tuning, and best practices, this book is ideal for developers seeking to write Accumulo applications, administrators charged with installing and maintaining Accumulo, and other professionals interested in what Accumulo has to offer. You will find everything you need to use this system fully.

Get a high-level introduction to Accumulo’s architecture and data model
Take a rapid tour through single- and multiple-node installations, data ingest, and query
Learn how to write Accumulo applications for several use cases, based on examples
Dive into Accumulo internals, including information not available in the documentation
Get detailed information for installing, administering, tuning, and measuring performance
Learn best practices based on successful implementations in the field
Find answers to common questions that every new Accumulo user asks

Osoby które kupowały "Accumulo. Application Development, Table Design, and Best Practices", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Accumulo. Application Development, Table Design, and Best Practices eBook -- spis treści

Foreword
Preface
- Goals and Audience
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
1. Architecture and Data Model
- Recent Trends
- The Role of Databases
- Distributed Applications
- Fast Random Access
  - Accessing Sorted Versus Unsorted Data
- Versions
- History
- Data Model
  - Rows and Columns
  - Data Modification and Timestamps
- Advanced Data Model Components
  - Column Families
  - Column Visibility
  - Full Data Model
- Tables
- Introduction to the Client API
  - Approach to Rows
  - Exploiting Sort Order
- Architecture Overview
  - ZooKeeper
  - Hadoop
  - Accumulo
    - Tablet servers
    - Master
    - Garbage collector
    - Monitor
    - Client
    - Thrift proxy
  - A Typical Cluster
- Additional Features
  - Automatic Data Partitioning
  - High Consistency
  - Automatic Load Balancing
  - Massive Scalability
  - Failure Tolerance and Automatic Recovery
  - Support for Analysis: Iterators
  - Support for Analysis: MapReduce Integration
  - Data Lifecycle Management
  - Compression
  - Robust Timestamps
- Accumulo and Other Data Management Systems
  - Comparisons to Relational Databases
    - SQL
    - Transactions
    - Normalization
  - Comparisons to Other NoSQL Databases
    - Data model
    - Key ordering
    - Tight Hadoop integration
    - High versus eventual consistency
    - Column visibility and access control
    - Iterators
    - Dynamic column families and locality groups
    - Support for very large rows
    - Parallelized BatchScanners
    - Namespaces
- Use Cases Suited for Accumulo
  - A New Kind of Flexible Analytical Warehouse
  - Building the Next Gmail
  - Massive Graph or Machine-Learning Problems
  - Relieving Relational Databases
  - Massive Search Applications
  - Applications with a Long History of Versioned Data
2. Quick Start
- Demo of the Shell
  - The help Command
  - Creating a Table and Inserting Some Data
  - Scanning for Data
  - Using Authorizations
  - Using a Simple Iterator
- Demo of Java Code
  - Creating a Table and Inserting Some Data
  - Scanning for Data
  - Using Authorizations
  - Using a Simple Iterator
- A More Complete Installation
- Other Important Resources
- One Last Example with a Unit Test
- Additional Resources
3. Basic API
- Development Environment
  - Obtaining the Client Library
  - Using Maven
    - Using Maven with an IDE
  - Configuring the Classpath
- Introduction to the Example Application: Wikipedia Pages
  - Wikipedia Data
  - Data Modeling
  - Obtaining Example Code
  - Downloading Sample Wikipedia Pages
  - Downloading All English Wikipedia Articles
- Connect
- Insert
  - Committing Mutations
  - Handling Errors
  - Insert Example
  - Using Lexicoders
  - Writing to Multiple Tables
- Lookups and Scanning
  - Lookup Example
  - Crafting Ranges
  - Grouping by Rows
  - Reusing Scanners
  - Isolated Row Views
  - Tuning Scanners
- Batch Scanning
- Update: Overwrite
  - Overwrite Example
  - Allowing Multiple Versions
- Update: Appending or Incrementing
- Update: Read-Modify-Write and Conditional Mutations
  - Conditional Mutation API
  - Conditional Mutation Batch API
  - Conditional Mutation Example
- Delete
  - Deleting and Reinserting
  - Removing Deleted Data from Disk
  - Batch Deleter
- Testing
  - MockAccumulo
  - MiniAccumuloCluster
4. Table API
- Basic Table Operations
  - Creating Tables
    - Options for creating tables
  - Renaming
  - Deleting Tables
  - Deleting Ranges of Rows
  - Deleting Entries Returned from a Scan
  - Configuring Table Properties
  - Locality Groups
    - Locality groups example
  - Bloom Filters
    - Key functors
  - Caching
  - Tablet Splits
    - Quickly and automatically splitting
    - Merging tablets
  - Compacting
    - Compaction properties
  - Additional Properties
  - Online Status
  - Cloning
    - Using cloning as a snapshotting mechanism
  - Importing and Exporting Tables
  - Additional Administrative Methods
- Table Namespaces
  - Creating
  - Renaming
  - Setting Namespace Properties
  - Deleting
  - Configuring Iterators
  - Configuring Constraints
  - Testing Class Loading for a Namespace
- Instance Operations
  - Setting Properties
    - Configuration
  - Cluster Information
  - Precedence of Properties
5. Security API
- Authentication
- Permissions
  - System Permissions
  - Namespace Permissions
  - Table Permissions
- Authorizations
  - Column Visibilities
  - Limiting Authorizations Written
  - An Example of Using Authorizations
  - Using a Default Visibility
  - Making Authorizations Work
- Auditing Security Operations
- Custom Authentication, Permissions, and Authorization
  - Custom Authentication Example
- Other Security Considerations
  - Using an Application Account for Multiple Users
  - Network
  - Disk Encryption
6. Server-Side Functionality and External Clients
- Constraints
  - Constraint Configuration API
  - Constraint Configuration Example
  - Creating Custom Constraints
  - Custom Constraint Example
- Iterators
  - Iterator Configuration API
  - VersioningIterator
  - Iterator Configuration Example
  - Adding Iterators by Setting Properties
  - Filtering Iterators
    - Built-in filters
    - Custom filters
    - Custom filtering iterator example
  - Combiners
    - Combiners for incrementing or appending updates
    - Built-in combiners
    - Custom combiners
    - Custom combiner example
  - Other Built-in Iterators
    - WholeRowIterator example
    - Low-level iterator API
- Thrift Proxy
  - Starting a Proxy
  - Python Example
  - Generating Client Code
- Language-Specific Clients
- Integration with Other Tools
  - Apache Hive
    - Table options
    - Serializing values
    - Additional options
    - Hive example
    - Optimizing Hive queries
  - Apache Pig
    - Pig example
  - Apache Kafka
- Integration with Analytical Tools
7. MapReduce API
- Formats
- Writing Worker Classes
- MapReduce Example
- MapReduce over Underlying RFiles
  - Example of Running a MapReduce Job over RFiles
- Delivering Rows to Map Workers
- Ingesters and Combiners as MapReduce Computations
- MapReduce and Bulk Import
  - Bulk Ingest to Avoid Duplicates
8. Table Design
- Single-Table Designs
  - Implementing Paging
- Secondary Indexing
  - Index Partitioned by Term
  - Querying a Term-Partitioned Index
    - Combining query terms
    - Querying for a term in a specific field
  - Maintaining Consistency Across Tables
    - Using MultiTableBatchWriter for consistency
  - Index Partitioned by Document
  - Querying a Document-Partitioned Index
  - Indexing Data Types
    - Using Lexicoders in indexing
    - Custom Lexicoder example: Inet4AddressLexicoder
- Full-Text Search
  - wikipediaMetadata
  - wikipediaIndex
  - wikipedia
  - wikipediaReverseIndex
  - Ingesting WikiSearch Data
  - Querying the WikiSearch Data
- Designing Row IDs
  - Lexicoders
  - Composite Row IDs
  - Key Size
  - Avoiding Hotspots
  - Designing Row IDs for Consistent Updates
- Designing Values
  - Storing Files and Large Values
  - Human-Readable Versus Binary Values and Formatters
- Designing Authorizations
- Designing Column Visibilities
9. Advanced Table Designs
- Time-Ordered Data
- Graphs
  - Building an Example Graph: Twitter
  - Traversing Graph Tables
  - Traversing the Example Twitter Graph
    - Blueprints for Accumulo
    - Titan
- Semantic Triples
  - Semantic Triples Example
- Spatial Data
  - Open Source Projects
  - Space-Filling Curves
- Multidimensional Data
- D4M and Matlab
  - D4M Example
    - Adding D4M to Octave or Matlab
    - Loading example data
    - Load example data using Java
- Machine Learning
  - Storing Feature Vectors
  - A Machine-Learning Example
- Approximating Relational and SQL Database Properties
  - Schema Constraints
  - SQL Operations
    - SELECT
    - WHERE
    - JOIN, GROUP BY, and ORDER BY
    - Strategies for Joins
    - GROUP BY and ORDER BY
10. Internals
- Tablet Server
  - Write Path
  - Read Path
  - Resource Manager
    - Minor compaction
    - Major compaction
    - Merging minor compaction
    - Splits
  - Write-Ahead Logs
    - Recovery
  - File formats
    - RFile optimizations
    - Relative key encoding
    - Locality groups
    - Bloom filters
  - Caching
- Master
  - FATE
  - Load Balancer
- Garbage Collector
- Monitor
- Tracer
- Client
  - Locating Keys
- Metadata Table
- Uses of ZooKeeper
- Accumulo and the CAP Theorem
11. Administration: Setup
- Preinstallation
  - Operating Systems
  - Kernel Tweaks
    - Swappiness
    - Number of open files
  - Native Libraries
  - User Accounts
  - Linux Filesystem
  - System Services
  - Software Dependencies
    - Apache Hadoop
    - Apache ZooKeeper
- Installation
  - Tarball Distribution Install
  - Installing on Clouderas CDH
  - Installing on Hortonworks HDP
  - Installing on MapR
  - Running via Amazon Web Services
  - Building from Source
    - Building a tarball distribution
    - Building native libraries
- Configuration
  - File Permissions
  - Server Configuration Files
    - accumulo-env.sh
    - accumulo-site.xml
  - Client Configuration
  - Deploying JARs
    - Using lib/ext/
    - Custom JAR loading example
    - Using HDFS
  - Setting Up Automatic Failover
  - Initialization
    - To reinitialize
    - Multiple instances
- Running Very Large-Scale Clusters
  - Networking
  - Limits
  - Metadata Table
  - Tablet Sizing
  - File Sizing
  - Using Multiple HDFS Volumes
    - Handling NameNode hostname changes
- Security
  - Column Visibilities and Accumulo Clients
  - Supporting Software Security
  - Network Security
    - Configuring SSL
  - Encryption of Data at Rest
  - Kerberized Hadoop
  - Application Permissions
12. Administration: Running
- Starting Accumulo
  - Via the start-all.sh Script
  - Via init.d Scripts
- Stopping Accumulo
  - Via the stop-all.sh Script
  - Via init.d scripts
  - Stopping Individual Processes
- Starting After a Crash
- Monitoring
  - Monitor Web Service
    - Overview
    - Master Server View
    - Tablet Servers View
    - Server Activity View
    - Garbage Collector View
    - Tables View
    - Recent Traces View
    - Documentation View
    - Recent Logs View
  - JMX Metrics
  - Logging
  - Tracing
    - Tracing in the shell
- Cluster Changes
  - Adding New Worker Nodes
  - Removing Worker Nodes
  - Adding New Control Nodes
  - Removing Control Nodes
- Table Operations
  - Changing Settings
    - Altering load balancing
    - Configuring iterators
    - Safely deploying custom iterators
  - Changing Online Status
  - Cloning
    - Altering cloned table properties
    - Cloning for MapReduce
  - Import, Export, and Backups
    - Exporting a table
    - Importing an exported table
    - Bulk-loading files from a MapReduce job
- Data Lifecycle
  - Versioning
  - Data Age-off
    - Ensuring that deletes are removed from tables
  - Compactions
    - Using major compaction to apply changes
    - Compacting specific ranges
  - Merging Tablets
  - Garbage Collection
- Failure Recovery
  - Typical Failures
    - Single machine failure
    - Single machine unresponsiveness
    - Network partitions
  - More-Serious Failures
    - All NameNodes failing simultaneously
    - All ZooKeeper servers failing simultaneously
    - Power loss to the data center
    - Loss of all replicas of an HDFS data block
  - Tips for Restoring a Cluster
    - Replay data
    - Back up NameNode metadata
    - Back up table configuration, users, and split points
    - Turn on HDFS trash
    - Create an empty RFile
    - Take Hadoop out of safe mode manually
  - Troubleshooting
    - Ensure that processes are running
    - Check log messages
    - Understand network partitions
    - Exception when scanning a table in the shell
    - Graphs on the monitor are blocky
    - Tablets not balancing across tablet servers
    - Calculate the size of changes to a cloned table
    - Unexpected or unexplained query results
    - Slow queries
    - Look at ZooKeeper
    - Use the listscans command
    - Look at user-initiated compactions
    - Inspect RFiles
13. Performance
- Understanding Read Performance
- Understanding Write Performance
  - BatchWriters
  - Bulk Loading
- Hardware Selection
  - Storage Devices
    - Hard disk drives
    - Storage-area networks
    - Solid-state disks
  - Networking
  - Virtualization
  - Running in a Public Cloud Environment
- Cluster Sizing
  - Modeling Required Write Performance
  - Cluster Planning Example
    - Estimated total volume of data
    - Types of user requests and indexes required
    - Compactions
    - Rate of incoming data
    - Age-off strategy
- Analyzing Performance
  - Using Tracing
  - Using the Monitor
  - Using Local Logs
- Tablet Server Tuning
  - External Settings
    - HDFS threads used to transfer data
    - HDFS durable sync
  - Memory Settings
    - tserver.memory.maps.max
    - tserver.memory.maps.native.enabled
    - Cache settings
    - Java heap size
    - tserver.mutation.queue.max
  - Write-Ahead Log Settings
    - tserver.wal.replication
    - tserver.wal.sync
    - tserver.wal.sync.method
  - Resource Settings
    - tserver.compaction.major.concurrent.max
    - tserver.compaction.minor.concurrent.max
    - tserver.readahead.concurrent.max
  - Timeouts
  - Scaling Vertically
- Cluster Tuning
  - Splitting Tables
  - Balancing Tablets
  - Balancing Reads and Writes
  - Data Locality
  - Sharing ZooKeeper
A. Shell Commands Quick Reference
- Debugging
- Exiting
- Help
- Iterator
- Permissions Administration
- Shell Execution
- Shell State
- Table Administration
- Table Control
- User Administration
- Writing, Reading, and Removing Data
B. Metadata Table
- Row ID
- File Column Family
- Scan Column Family
- future, last, and loc Column Families
- log Column Family
- srv Column Family
- ~tab:~pr Column
- Other Columns
C. Data Stored in ZooKeeper
- masters, tservers, gc, monitor, and tracers Nodes
- problems/problem_info Nodes
- root_tablet Node
- tables/table_id Nodes
- config/system_property_name Node
- users/username Nodes
- Other Nodes
Index