Learning Data Science - Helion
ISBN: 9781098112950
stron: 596, Format: ebook
Data wydania: 2023-09-15
Księgarnia: Helion
Cena książki: 288,15 zł (poprzednio: 339,00 zł)
Oszczędzasz: 15% (-50,85 zł)
As an aspiring data scientist, you appreciate why organizations rely on data for important decisions--whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data.
Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas.
- Refine a question of interest to one that can be studied with data
- Pursue data collection that may involve text processing, web scraping, etc.
- Glean valuable insights about data through data cleaning, exploration, and visualization
- Learn how to use modeling to describe the data
- Generalize findings beyond the data
Osoby które kupowały "Learning Data Science", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Learning Data Science eBook -- spis treści
- Preface
- Expected Background Knowledge
- Organization of the Book
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- I. The Data Science Lifecycle
- 1. The Data Science Lifecycle
- The Stages of the Lifecycle
- Examples of the Lifecycle
- Summary
- 2. Questions and Data Scope
- Big Data and New Opportunities
- Example: Google Flu Trends
- Target Population, Access Frame, and Sample
- Example: What Makes Members of an Online Community Active?
- Example: Who Will Win the Election?
- Example: How Do Environmental Hazards Relate to an Individuals Health?
- Instruments and Protocols
- Measuring Natural Phenomena
- Example: What Is the Level of CO2 in the Air?
- Accuracy
- Types of Bias
- Types of Variation
- Summary
- Big Data and New Opportunities
- 3. Simulation and Data Design
- The Urn Model
- Sampling Designs
- Sampling Distribution of a Statistic
- Simulating the Sampling Distribution
- Simulation with the Hypergeometric Distribution
- Example: Simulating Election Poll Bias and Variance
- The Pennsylvania Urn Model
- An Urn Model with Bias
- Conducting Larger Polls
- Example: Simulating a Randomized Trial for a Vaccine
- Scope
- The Urn Model for Random Assignment
- Example: Measuring Air Quality
- Summary
- The Urn Model
- 4. Modeling with Summary Statistics
- The Constant Model
- Minimizing Loss
- Mean Absolute Error
- Mean Squared Error
- Choosing Loss Functions
- Summary
- 5. Case Study: Why Is My Bus Always Late?
- Question and Scope
- Data Wrangling
- Exploring Bus Times
- Modeling Wait Times
- Summary
- II. Rectangular Data
- 6. Working with Dataframes Using pandas
- Subsetting
- Data Scope and Question
- Dataframes and Indices
- Slicing
- Filtering Rows
- Example: How Recently Has Luna Become a Popular Name?
- Aggregating
- Basic Group-Aggregate
- Example: Using .value_counts()
- Grouping on Multiple Columns
- Custom Aggregation Functions
- Pivoting
- Basic Group-Aggregate
- Joining
- Inner Joins
- Left, Right, and Outer Joins
- Example: Popularity of NYT Name Categories
- Transforming
- Apply
- Example: Popularity of L Names
- The Price of Apply
- How Are Dataframes Different from Other Data Representations?
- Dataframes and Spreadsheets
- Dataframes and Matrices
- Dataframes and Relations
- Summary
- Subsetting
- 7. Working with Relations Using SQL
- Subsetting
- SQL Basics: SELECT and FROM
- Whats a Relation?
- Slicing
- Filtering Rows
- Example: How Recently Has Luna Become a Popular Name?
- Aggregating
- Basic Group-Aggregate Using GROUP BY
- Grouping on Multiple Columns
- Other Aggregation Functions
- Joining
- Inner Joins
- Left and Right Joins
- Example: Popularity of NYT Name Categories
- Transforming and Common Table Expressions
- SQL Functions
- Multistep Queries Using a WITH Clause
- Example: Popularity of L Names
- Summary
- Subsetting
- III. Understanding The Data
- 8. Wrangling Files
- Data Source Examples
- Drug Abuse Warning Network (DAWN) Survey
- San Francisco Restaurant Food Safety
- File Formats
- Delimited Format
- Fixed-Width Format
- Hierarchical Formats
- Loosely Formatted Text
- File Encoding
- File Size
- The Shell and Command-Line Tools
- Table Shape and Granularity
- Granularity of Restaurant Inspections and Violations
- DAWN Survey Shape and Granularity
- Summary
- Data Source Examples
- 9. Wrangling Dataframes
- Example: Wrangling CO2 Measurements from the Mauna Loa Observatory
- Quality Checks
- Addressing Missing Data
- Reshaping the Data Table
- Quality Checks
- Quality Based on Scope
- Quality of Measurements and Recorded Values
- Quality Across Related Features
- Quality for Analysis
- Fixing the Data or Not
- Missing Values and Records
- Transformations and Timestamps
- Transforming Timestamps
- Piping for Transformations
- Modifying Structure
- Example: Wrangling Restaurant Safety Violations
- Narrowing the Focus
- Aggregating Violations
- Extracting Information from Violation Descriptions
- Summary
- Example: Wrangling CO2 Measurements from the Mauna Loa Observatory
- 10. Exploratory Data Analysis
- Feature Types
- Example: Dog Breeds
- Transforming Qualitative Features
- Relabel categories
- Collapse categories
- Convert quantitative to ordinal
- The Importance of Feature Types
- What to Look For in a Distribution
- What to Look For in a Relationship
- Two Quantitative Features
- One Qualitative and One Quantitative Variable
- Two Qualitative Features
- Comparisons in Multivariate Settings
- Guidelines for Exploration
- Example: Sale Prices for Houses
- Understanding Price
- What Next?
- Examining Other Features
- Delving Deeper into Relationships
- Fixing Location
- EDA Discoveries
- Summary
- Feature Types
- 11. Data Visualization
- Choosing Scale to Reveal Structure
- Filling the Data Region
- Including Zero
- Revealing Shape Through Transformations
- Banking to Decipher Relationships
- Revealing Relationships Through Straightening
- Smoothing and Aggregating Data
- Smoothing Techniques to Uncover Shape
- Smoothing Techniques to Uncover Relationships and Trends
- Smoothing Techniques Need Tuning
- Reducing Distributions to Quantiles
- When Not to Smooth
- Facilitating Meaningful Comparisons
- Emphasize the Important Difference
- Ordering Groups
- Avoid Stacking
- Selecting a Color Palette
- Guidelines for Comparisons in Plots
- Incorporating the Data Design
- Data Collected Over Time
- Observational Studies
- Unequal Sampling
- Geographic Data
- Adding Context
- Example: 100m Sprint Times
- Creating Plots Using plotly
- Figure and Trace Objects
- Modifying Layout
- Plotting Functions
- Annotations
- Other Tools for Visualization
- matplotlib
- Grammar of Graphics
- Summary
- Choosing Scale to Reveal Structure
- 12. Case Study: How Accurate Are Air Quality Measurements?
- Question, Design, and Scope
- Finding Collocated Sensors
- Wrangling the List of AQS Sites
- Wrangling the List of PurpleAir Sites
- Matching AQS and PurpleAir Sensors
- Wrangling and Cleaning AQS Sensor Data
- Checking Granularity
- Removing Unneeded Columns
- Checking the Validity of Dates
- Checking the Quality of PM2.5 Measurements
- Wrangling PurpleAir Sensor Data
- Checking the Granularity
- Visualizing timestamps
- Checking the sampling rate
- Handling Missing Values
- Checking the Granularity
- Exploring PurpleAir and AQS Measurements
- Creating a Model to Correct PurpleAir Measurements
- Summary
- IV. Other Data Sources
- 13. Working with Text
- Examples of Text and Tasks
- Convert Text into a Standard Format
- Extract a Piece of Text to Create a Feature
- Transform Text into Features
- Text Analysis
- String Manipulation
- Converting Text to a Standard Format with Python String Methods
- String Methods in pandas
- Splitting Strings to Extract Pieces of Text
- Regular Expressions
- Concatenation of Literals
- Character classes
- Wildcard character
- Negated character classes
- Shorthands for character classes
- Anchors and boundaries
- Escaping metacharacters
- Quantifiers
- Alternation and Grouping to Create Features
- Reference Tables
- Concatenation of Literals
- Text Analysis
- Summary
- Examples of Text and Tasks
- 14. Data Exchange
- NetCDF Data
- JSON Data
- HTTP
- REST
- XML, HTML, and XPath
- Example: Scraping Race Times from Wikipedia
- XPath
- Example: Accessing Exchange Rates from the ECB
- Summary
- V. Linear Modeling
- 15. Linear Models
- Simple Linear Model
- Example: A Simple Linear Model for Air Quality
- Interpreting Linear Models
- Assessing the Fit
- Fitting the Simple Linear Model
- Multiple Linear Model
- Fitting the Multiple Linear Model
- Example: Where Is the Land of Opportunity?
- Explaining Upward Mobility Using Commute Time
- Relating Upward Mobility Using Multiple Variables
- Feature Engineering for Numeric Measurements
- Feature Engineering for Categorical Measurements
- Summary
- 16. Model Selection
- Overfitting
- Example: Energy Consumption
- Train-Test Split
- Cross-Validation
- Regularization
- Model Bias and Variance
- Summary
- Overfitting
- 17. Theory for Inference and Prediction
- Distributions: Population, Empirical, Sampling
- Basics of Hypothesis Testing
- Example: A Rank Test to Compare Productivity of Wikipedia Contributors
- Example: A Test of Proportions for Vaccine Efficacy
- Bootstrapping for Inference
- Basics of Confidence Intervals
- Basics of Prediction Intervals
- Example: Predicting Bus Lateness
- Example: Predicting Crab Size
- Example: Predicting the Incremental Growth of a Crab
- Probability for Inference and Prediction
- Formalizing the Theory for Average Rank Statistics
- General Properties of Random Variables
- Probability Behind Testing and Intervals
- Probability Behind Model Selection
- Summary
- 18. Case Study: How to Weigh a Donkey
- Donkey Study Question and Scope
- Wrangling and Transforming
- Exploring
- Modeling a Donkeys Weight
- A Loss Function for Prescribing Anesthetics
- Fitting a Simple Linear Model
- Fitting a Multiple Linear Model
- Bringing Qualitative Features into the Model
- Model Assessment
- Summary
- VI. Classification
- 19. Classification
- Example: Wind-Damaged Trees
- Modeling and Classification
- A Constant Model
- Examining the Relationship Between Size and Windthrow
- Modeling Proportions (and Probabilities)
- A Logistic Model
- Log Odds
- Using a Logistic Curve
- A Loss Function for the Logistic Model
- From Probabilities to Classification
- The Confusion Matrix
- Precision Versus Recall
- Summary
- 20. Numerical Optimization
- Gradient Descent Basics
- Minimizing Huber Loss
- Convex and Differentiable Loss Functions
- Variants of Gradient Descent
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Newtons Method
- Summary
- 21. Case Study: Detecting Fake News
- Question and Scope
- Obtaining and Wrangling the Data
- Exploring the Data
- Exploring the Publishers
- Exploring Publication Date
- Exploring Words in Articles
- Modeling
- A Single-Word Model
- Multiple-Word Model
- Predicting with the tf-idf Transform
- Summary
- Additional Material
- Data Sources
- Index