Doing Data Science. Straight Talk from the Frontline - Helion
ISBN: 978-14-493-6389-5
stron: 408, Format: ebook
Data wydania: 2013-10-09
Księgarnia: Helion
Cena książki: 169,15 zł (poprzednio: 196,69 zł)
Oszczędzasz: 14% (-27,54 zł)
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Topics include:
- Statistical inference, exploratory data analysis, and the data science process
- Algorithms
- Spam filters, Naive Bayes, and data wrangling
- Logistic regression
- Financial modeling
- Recommendation engines and causality
- Data visualization
- Social networks and data journalism
- Data engineering, MapReduce, Pregel, and Hadoop
Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.
Osoby które kupowały "Doing Data Science. Straight Talk from the Frontline", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Doing Data Science. Straight Talk from the Frontline eBook -- spis treści
- Doing Data Science
- Dedication
- Preface
- Motivation
- Origins of the Class
- Origins of the Book
- What to Expect from This Book
- How This Book Is Organized
- How to Read This Book
- How Code Is Used in This Book
- Who This Book Is For
- Prerequisites
- Supplemental Reading
- About the Contributors
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
- 1. Introduction: What Is Data Science?
- Big Data and Data Science Hype
- Getting Past the Hype
- Why Now?
- Datafication
- The Current Landscape (with a Little History)
- Data Science Jobs
- A Data Science Profile
- Thought Experiment: Meta-Definition
- OK, So What Is a Data Scientist, Really?
- In Academia
- In Industry
- 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
- Statistical Thinking in the Age of Big Data
- Statistical Inference
- Populations and Samples
- Populations and Samples of Big Data
- Big Data Can Mean Big Assumptions
- Can N=ALL?
- Data is not objective
- Modeling
- What is a model?
- Statistical modeling
- But how do you build a model?
- Probability distributions
- Fitting a model
- Overfitting
- Exploratory Data Analysis
- Philosophy of Exploratory Data Analysis
- Exercise: EDA
- Sample code
- The Data Science Process
- A Data Scientists Role in This Process
- Thought Experiment: How Would You Simulate Chaos?
- Case Study: RealDirect
- How Does RealDirect Make Money?
- Exercise: RealDirect Data Strategy
- Sample R code
- Statistical Thinking in the Age of Big Data
- 3. Algorithms
- Machine Learning Algorithms
- Three Basic Algorithms
- Linear Regression
- Start by writing something down
- Fitting the model
- Extending beyond least squares
- Adding in modeling assumptions about the errors
- Adding other predictors
- Transformations
- Review
- Exercise
- k-Nearest Neighbors (k-NN)
- Example with credit scores
- Similarity or distance metrics
- Training and test sets
- Pick an evaluation metric
- Putting it all together
- Choosing k
- What are the modeling assumptions?
- k-means
- 2D version
- Linear Regression
- Exercise: Basic Machine Learning Algorithms
- Solutions
- Sample R code: Linear regression on the housing dataset
- Sample R code: K-NN on the housing dataset
- Solutions
- Summing It All Up
- Thought Experiment: Automated Statistician
- 4. Spam Filters, Naive Bayes, and Wrangling
- Thought Experiment: Learning by Example
- Why Wont Linear Regression Work for Filtering Spam?
- How About k-nearest Neighbors?
- Naive Bayes
- Bayes Law
- A Spam Filter for Individual Words
- A Spam Filter That Combines Words: Naive Bayes
- Fancy It Up: Laplace Smoothing
- Comparing Naive Bayes to k-NN
- Sample Code in bash
- Scraping the Web: APIs and Other Tools
- Jakes Exercise: Naive Bayes for Article Classification
- Sample R Code for Dealing with the NYT API
- Thought Experiment: Learning by Example
- 5. Logistic Regression
- Thought Experiments
- Classifiers
- Runtime
- You
- Interpretability
- Scalability
- M6D Logistic Regression Case Study
- Click Models
- The Underlying Math
- Estimating and
- Newtons Method
- Stochastic Gradient Descent
- Implementation
- Evaluation
- Media 6 Degrees Exercise
- Sample R Code
- 6. Time Stamps and Financial Modeling
- Kyle Teague and GetGlue
- Timestamps
- Exploratory Data Analysis (EDA)
- Metrics and New Variables or Features
- Whats Next?
- Cathy ONeil
- Thought Experiment
- Financial Modeling
- In-Sample, Out-of-Sample, and Causality
- Preparing Financial Data
- Log Returns
- Example: The S&P Index
- Working out a Volatility Measurement
- Exponential Downweighting
- The Financial Modeling Feedback Loop
- Why Regression?
- Adding Priors
- A Baby Model
- Exercise: GetGlue and Timestamped Event Data
- Exercise: Financial Data
- 7. Extracting Meaning from Data
- William Cukierski
- Background: Data Science Competitions
- Background: Crowdsourcing
- The Kaggle Model
- A Single Contestant
- Their Customers
- Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
- Feature Selection
- Example: User Retention
- Filters
- Wrappers
- Selecting an algorithm
- Selection criterion
- In practice
- Embedded Methods: Decision Trees
- Entropy
- The Decision Tree Algorithm
- Handling Continuous Variables in Decision Trees
- Random Forests
- User Retention: Interpretability Versus Predictive Power
- David Huffaker: Googles Hybrid Approach to Social Research
- Moving from Descriptive to Predictive
- Social at Google
- Privacy
- Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
- William Cukierski
- 8. Recommendation Engines: Building a User-Facing Data Product at Scale
- A Real-World Recommendation Engine
- Nearest Neighbor Algorithm Review
- Some Problems with Nearest Neighbors
- Beyond Nearest Neighbor: Machine Learning Classification
- The Dimensionality Problem
- Singular Value Decomposition (SVD)
- Important Properties of SVD
- Principal Component Analysis (PCA)
- Theorem: The resulting latent features will be uncorrelated
- Alternating Least Squares
- Theorem with no proof: The preceding algorithm will converge if your prior is large enough
- Fix V and Update U
- Last Thoughts on These Algorithms
- Thought Experiment: Filter Bubbles
- Exercise: Build Your Own Recommendation System
- Sample Code in Python
- A Real-World Recommendation Engine
- 9. Data Visualization and Fraud Detection
- Data Visualization History
- Gabriel Tarde
- Marks Thought Experiment
- What Is Data Science, Redux?
- Processing
- Franco Moretti
- A Sample of Data Visualization Projects
- Marks Data Visualization Projects
- New York Times Lobby: Moveable Type
- Project Cascade: Lives on a Screen
- Cronkite Plaza
- eBay Transactions and Books
- Public Theater Shakespeare Machine
- Goals of These Exhibits
- Data Science and Risk
- About Square
- The Risk Challenge
- Detecting suspicious activity using machine learning
- The Trouble with Performance Estimation
- Defining the error metric
- Defining the labels
- Challenges in features and learning
- Model Building Tips
- Code readability and reusability
- Get a pair!
- Productionizing machine learning models
- Data Visualization at Square
- Ians Thought Experiment
- Data Visualization for the Rest of Us
- Data Visualization Exercise
- Data Visualization History
- 10. Social Networks and Data Journalism
- Social Network Analysis at Morning Analytics
- Case-Attribute Data versus Social Network Data
- Social Network Analysis
- Terminology from Social Networks
- Centrality Measures
- The Industry of Centrality Measures
- Thought Experiment
- Morningside Analytics
- How Visualizations Help Us Find Schools of Fish
- More Background on Social Network Analysis from a Statistical Point of View
- Representations of Networks and Eigenvalue Centrality
- A First Example of Random Graphs: The Erdos-Renyi Model
- A Second Example of Random Graphs: The Exponential Random Graph Model
- Inference for ERGMs
- Further examples of random graphs: latent space models, small-world networks
- Data Journalism
- A Bit of History on Data Journalism
- Writing Technical Journalism: Advice from an Expert
- Social Network Analysis at Morning Analytics
- 11. Causality
- Correlation Doesnt Imply Causation
- Asking Causal Questions
- Confounders: A Dating Example
- OK Cupids Attempt
- The Gold Standard: Randomized Clinical Trials
- A/B Tests
- Second Best: Observational Studies
- Simpsons Paradox
- The Rubin Causal Model
- Visualizing Causality
- Definition: The Causal Effect
- Three Pieces of Advice
- Correlation Doesnt Imply Causation
- 12. Epidemiology
- Madigans Background
- Thought Experiment
- Modern Academic Statistics
- Medical Literature and Observational Studies
- Stratification Does Not Solve the Confounder Problem
- What Do People Do About Confounding Things in Practice?
- Is There a Better Way?
- Research Experiment (Observational Medical Outcomes Partnership)
- Closing Thought Experiment
- 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
- Claudias Data Scientist Profile
- The Life of a Chief Data Scientist
- On Being a Female Data Scientist
- Data Mining Competitions
- How to Be a Good Modeler
- Data Leakage
- Market Predictions
- Amazon Case Study: Big Spenders
- A Jewelry Sampling Problem
- IBM Customer Targeting
- Breast Cancer Detection
- Pneumonia Prediction
- How to Avoid Leakage
- Evaluating Models
- Accuracy: Meh
- Probabilities Matter, Not 0s and 1s
- Choosing an Algorithm
- A Final Example
- Parting Thoughts
- Claudias Data Scientist Profile
- 14. Data Engineering: MapReduce, Pregel, and Hadoop
- About David Crawshaw
- Thought Experiment
- MapReduce
- Word Frequency Problem
- Enter MapReduce
- Other Examples of MapReduce
- What Cant MapReduce Do?
- Pregel
- About Josh Wills
- Thought Experiment
- On Being a Data Scientist
- Data Abundance Versus Data Scarcity
- Designing Models
- Mind the gap
- Economic Interlude: Hadoop
- A Brief Introduction to Hadoop
- Cloudera
- Back to Josh: Workflow
- So How to Get Started with Hadoop?
- 15. The Students Speak
- Process Thinking
- Naive No Longer
- Helping Hands
- Your Mileage May Vary
- Bridging Tunnels
- Some of Our Work
- 16. Next-Generation Data Scientists, Hubris, and Ethics
- What Just Happened?
- What Is Data Science (Again)?
- What Are Next-Gen Data Scientists?
- Being Problem Solvers
- Cultivating Soft Skills
- Being Question Askers
- Being an Ethical Data Scientist
- Career Advice
- Index
- About the Authors
- Colophon
- Copyright