Web Scraping with Python. Collecting More Data from the Modern Web. 2nd Edition - Helion

ebook

Autor: Ryan Mitchell
ISBN: 978-14-919-8552-6
stron: 308, Format: ebook
Data wydania: 2018-03-21
Księgarnia: Helion

Cena książki: 139,00 zł

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Python - Programowanie

If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server’s response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you’re likely to encounter.

Parse complicated HTML pages
Develop crawlers with the Scrapy framework
Learn methods to store data you scrape
Read and extract data from documents
Clean and normalize badly formatted data
Read and write natural languages
Crawl through forms and logins
Scrape JavaScript and crawl through APIs
Use and write image-to-text software
Avoid scraping traps and bot blockers
Use scrapers to test your website

Osoby które kupowały "Web Scraping with Python. Collecting More Data from the Modern Web. 2nd Edition", wybierały także:

Django 4. Praktyczne tworzenie aplikacji sieciowych. Wydanie IV 125,48 zł, (38,90 zł -69%)
Django. Kurs video. Aplikacje webowe w Pythonie 117,35 zł, (39,90 zł -66%)
Sztuczna inteligencja w Azure. Kurs video. Uczenie maszynowe i Azure Machine Learning Service 199,00 zł, (69,65 zł -65%)
Web scraping w Data Science. Kurs video. Uczenie maszynowe i architektura splotowych sieci neuronowych 178,97 zł, (62,64 zł -65%)
Data Science w Pythonie. Kurs video. Algorytmy uczenia maszynowego 199,00 zł, (69,65 zł -65%)

Spis treści

Web Scraping with Python. Collecting More Data from the Modern Web. 2nd Edition eBook -- spis treści

Preface
- What Is Web Scraping?
- Why Web Scraping?
- About This Book
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
I. Building Scrapers
1. Your First Web Scraper
- Connecting
- An Introduction to BeautifulSoup
  - Installing BeautifulSoup
  - Running BeautifulSoup
  - Connecting Reliably and Handling Exceptions
2. Advanced HTML Parsing
- You Dont Always Need a Hammer
- Another Serving of BeautifulSoup
  - find() and find_all() with BeautifulSoup
  - Other BeautifulSoup Objects
  - Navigating Trees
    - Dealing with children and other descendants
    - Dealing with siblings
    - Dealing with parents
- Regular Expressions
- Regular Expressions and BeautifulSoup
- Accessing Attributes
- Lambda Expressions
3. Writing Web Crawlers
- Traversing a Single Domain
- Crawling an Entire Site
  - Collecting Data Across an Entire Site
- Crawling Across the Internet
4. Web Crawling Models
- Planning and Defining Objects
- Dealing with Different Website Layouts
- Structuring Crawlers
  - Crawling Sites Through Search
  - Crawling Sites Through Links
  - Crawling Multiple Page Types
- Thinking About Web Crawler Models
5. Scrapy
- Installing Scrapy
  - Initializing a New Spider
- Writing a Simple Scraper
- Spidering with Rules
- Creating Items
- Outputting Items
- The Item Pipeline
- Logging with Scrapy
- More Resources
6. Storing Data
- Media Files
- Storing Data to CSV
- MySQL
  - Installing MySQL
  - Some Basic Commands
  - Integrating with Python
  - Database Techniques and Good Practice
  - Six Degrees in MySQL
- Email
II. Advanced Scraping
7. Reading Documents
- Document Encoding
- Text
  - Text Encoding and the Global Internet
    - A history of text encoding
    - Encodings in action
- CSV
  - Reading CSV Files
- PDF
- Microsoft Word and .docx
8. Cleaning Your Dirty Data
- Cleaning in Code
  - Data Normalization
- Cleaning After the Fact
  - OpenRefine
    - Installation
    - Using OpenRefine
      - Filtering
      - Cleaning
9. Reading and Writing Natural Languages
- Summarizing Data
- Markov Models
  - Six Degrees of Wikipedia: Conclusion
- Natural Language Toolkit
  - Installation and Setup
  - Statistical Analysis with NLTK
  - Lexicographical Analysis with NLTK
- Additional Resources
10. Crawling Through Forms and Logins
- Python Requests Library
- Submitting a Basic Form
- Radio Buttons, Checkboxes, and Other Inputs
- Submitting Files and Images
- Handling Logins and Cookies
  - HTTP Basic Access Authentication
- Other Form Problems
11. Scraping JavaScript
- A Brief Introduction to JavaScript
  - Common JavaScript Libraries
    - jQuery
    - Google Analytics
    - Google Maps
- Ajax and Dynamic HTML
  - Executing JavaScript in Python with Selenium
  - Additional Selenium Webdrivers
- Handling Redirects
- A Final Note on JavaScript
12. Crawling Through APIs
- A Brief Introduction to APIs
  - HTTP Methods and APIs
  - More About API Responses
- Parsing JSON
- Undocumented APIs
  - Finding Undocumented APIs
  - Documenting Undocumented APIs
  - Finding and Documenting APIs Automatically
- Combining APIs with Other Data Sources
- More About APIs
13. Image Processing and Text Recognition
- Overview of Libraries
  - Pillow
  - Tesseract
    - Installing Tesseract
    - pytesseract
  - NumPy
- Processing Well-Formatted Text
  - Adjusting Images Automatically
  - Scraping Text from Images on Websites
- Reading CAPTCHAs and Training Tesseract
  - Training Tesseract
- Retrieving CAPTCHAs and Submitting Solutions
14. Avoiding Scraping Traps
- A Note on Ethics
- Looking Like a Human
  - Adjust Your Headers
  - Handling Cookies with JavaScript
  - Timing Is Everything
- Common Form Security Features
  - Hidden Input Field Values
  - Avoiding Honeypots
- The Human Checklist
15. Testing Your Website with Scrapers
- An Introduction to Testing
  - What Are Unit Tests?
- Python unittest
  - Testing Wikipedia
- Testing with Selenium
  - Interacting with the Site
    - Drag and drop
    - Taking screenshots
- unittest or Selenium?
16. Web Crawling in Parallel
- Processes versus Threads
- Multithreaded Crawling
  - Race Conditions and Queues
  - The threading Module
- Multiprocess Crawling
  - Multiprocess Crawling
  - Communicating Between Processes
- Multiprocess CrawlingAnother Approach
17. Scraping Remotely
- Why Use Remote Servers?
  - Avoiding IP Address Blocking
  - Portability and Extensibility
- Tor
  - PySocks
- Remote Hosting
  - Running from a Website-Hosting Account
  - Running from the Cloud
- Additional Resources
18. The Legalities and Ethics of Web Scraping
- Trademarks, Copyrights, Patents, Oh My!
  - Copyright Law
- Trespass to Chattels
- The Computer Fraud and Abuse Act
- robots.txt and Terms of Service
- Three Web Scrapers
  - eBay versus Bidders Edge and Trespass to Chattels
  - United States v. Auernheimer and The Computer Fraud and Abuse Act
  - Field v. Google: Copyright and robots.txt
- Moving Forward
Index