Introduction to web scraping: Resources

Key Points

Introduction: What is web scraping?
  • Humans are good at categorizing information, computers not so much.

  • Often, data on a web site is not properly structured, making its extraction difficult.

  • Web scraping is the process of automating the extraction of data from web sites.

Selecting content on a web page with XPath
  • XML and HTML are markup languages. They provide structure to documents.

  • XML and HTML documents are made out of nodes, which form a hierarchy.

  • The hierarchy of nodes inside a document is called the node tree.

  • Relationships between nodes are: parent, child, sibling.

  • XPath queries are constructed as paths going up or down the node tree.

  • XPath queries can be run in the browser using the $x() function.

Manually scrape data using browser extensions
  • Data that is relatively well structured (in a table) is relatively easily to scrape.

  • More often than not, web scraping tools need to be told what to scrape.

  • XPath can be used to define what information to scrape, and how to structure it.

  • More advanced data cleaning operations are best done in a subsequent step.

Web scraping using Python and Scrapy
  • Scrapy is a Python framework that can be use to scrape content from the web.

  • A Scrapy project is a set of configuration files and pieces of code that tell Scrapy what to do.

  • In Scrapy, a “Spider” is the code that tells it what to do on a specific website.

  • A Scrapy project can have more than one spider but needs at least one.

  • With Scrapy, we can use XPath, CSS selectors and Regular Expressions to define what elements to scrape from a page.

  • Extracted data can be stored in “Item” objects. Such objects must be defined before they can be used.

  • Scrapy will automatically stored extracted data in CSS, JSON or XML format based on the file extension given in the -o option.

Conclusion
  • Web scraping is, in general, legal and won’t get you into trouble.

  • There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.

  • Be nice. In doubt, ask.

Resources

Glossary

CSS selectors
CSS selectors serve a similar function to XPath, in selecting parts of an HTML document, but were designed for web development (for applying styles such as colour to parts of a document). As such, they are more popular, but are limited in what they can express relative to XPath. Every CSS selector can be translated into an equivalent XPath expression, but not vice-versa. CSS selectors are constructed by specifying properties of the targets combined with properties of their context. CSS selectors can be evaluated using the document.querySelectorAll() function.

Denial of service attack
If someone sends too many requests over a short span of time, they can prevent other “normal” users from accessing the site during that time, or even cause the server to run out of resources and crash. In fact, this is such an efficient way to disrupt a web site that hackers are often doing it on purpose. Modern web servers include measures to ward off such illegitimate use of their resources and their first line of defense often involves refusing any further requests coming from this IP address. Since, scraping tools’ usually have developed measures to avoid inadvertently launching such an attack, the risks of causing trouble is limited.

Visual scrapers
Visual scrapers are tools in which the user can visually select the elements to extract, and the logical order to follow in performing a sequence of extractions. They require little or no code, and assist in designing XPath or CSS selectors. These tools vary in how flexible they are, how easy to use, to what extent they help you identify and debug scraping problems, how easy it is to keep and transfer your scraper to another service, and how costly the service is.

Web scraping
Web scraping is a technique for targeted, automated extraction of information from websites. Similar extraction can be done manually but it is usually faster, more efficient and less error-prone to automate the task. Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.

Web scraping code of conduct

XPath
These specify parts of a tree-structured document, be it XML or HTML. They can be very specific about which nodes to include or exclude.