Isoxya 3 deep dive: Processors

2022-02-16 · Isoxya

This is the second of a series of deep dives into Isoxya 3, the scalable web crawler and scraper, following the previous dive into the open-source Isoxya API. This post is about Isoxya processors, which implement the data-processing logic of a crawl or scrape, customising Isoxya to a specific product or industry. After an introduction to the processor interface, two specific plugins are discussed: Isoxya plugin Crawler HTML and Isoxya plugin Spellchecker, both open-source under the BSD-3 licence and freely available on GitHub and Docker Hub.

Overview

Many crawlers are designed for a specific purpose, such as gathering the metrics needed for SEO. Some scrapers allow customisation of which data to extract, such as prices from a catalogue. Flexibility can vary widely: in the simplest implementations, it might be possible to use regular expressions to pattern-match; in more complex implementations, it might be possible to parse the DOM and use XPath or CSS selectors. Thus, the extraction syntax might be anything from a simple list of regexes to some JavaScript to run within a VM. The crawling logic itself might be controllable via some simple options, if at all.

But what if you wanted to extract data using some third-party library in Ruby? Or if you wanted to parse the page using a compiled language such as Java? What if the page wasn’t HTML at all, but an image? Could you run it through a machine learning SVM classifier using Python to try to identify it? What about scanning uploads for viruses using some third-party executable? Or checksumming URLs to detect changes? Or spellchecking paragraphs in your native language? The traditional crawling and scraping model breaks down quickly under such requirements.

Isoxya solves this by making no attempt to force implementations, libraries, or even languages on you. That is, decisions about whether to parse pages as HTML or an alternative, choosing which content to extract and how, or even which pages to crawl next, are extracted to plugins, which are able to do whatever is required so long as they conform to a simple, well-documented JSON interface. The upshot of this is that Isoxya can be used to build a variety of products, without requiring them to implement crawl, queue, or deduplication logic directly.

Interface

The Isoxya processor interface is designed to satisfy this principle: If you can write a small program to process a single page, Isoxya can scale that to process hundreds, thousands, or even millions of pages efficiently and predictably—typically without requiring any code changes. Once Isoxya’s spiders have fetched the page, it is sent to a processor plugin along with some metadata. That plugin replies answering two questions:

  1. What data, if any, should be extracted from this page? (data)
  2. Which pages, if any, should also be crawled following this one? (urls)

Everything else is handled by Isoxya itself. An example plugin request is:

{
  "body": "…",
  "header": {
    "Accept-Ranges": "bytes",
    "Age": "602398",
    "Cache-Control": "max-age=604800",
    "Content-Encoding": "gzip",
    "Content-Length": "648",
    "Content-Type": "text/html; charset=UTF-8",
    "Date": "Thu, 16 Dec 2021 14:29:02 GMT",
    "Etag": "\"3147526947+ident\"",
    "Expires": "Thu, 23 Dec 2021 14:29:02 GMT",
    "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
    "Server": "ECS (bsa/EB13)",
    "Vary": "Accept-Encoding",
    "X-Cache": "HIT"
  },
  "meta": {
    "config": null,
    "duration": 0.106178796,
    "error": null,
    "method": "GET",
    "status": 200,
    "url": "http://example.com:80/"
  }
}

Here, body is the Base-64-encoded body of the page crawled, header contains its HTTP headers, and meta contains metadata about the request. The processor plugin can handle this however it pleases. An example plugin response is:

{
  "data": {
    "title": "Example Domain"
  },
  "urls": []
}

Here, data contains the only thing this plugin cares about—extracting the page title—and urls contains outbound links—empty for this page. In fact, it’s not necessary to return data if the plugin doesn’t extract anything, and urls is only necessary if the plugin wants to control how the crawl itself progresses. More on this below.

Isoxya plugin Crawler HTML

Isoxya plugin Crawler HTML provides a core run loop for the crawling engine, parsing each page as static HTML, and extracting request metadata and outbound URLs. In terms of the processor interface explained above, its main purpose is to return a list of urls to crawl next. Typically, only internal URLs are followed (although Isoxya Pro can optionally validate external URLs too). The plugin also surfaces some metadata about the page and its headers, making it useful in checking links, testing page speeds, or validating cache hits. An example response is:

{
  "data": {
    "duration": 0.106178796,
    "error": null,
    "header": {
      "Accept-Ranges": "bytes",
      "Age": "602398",
      "Cache-Control": "max-age=604800",
      "Content-Encoding": "gzip",
      "Content-Length": "648",
      "Content-Type": "text/html; charset=UTF-8",
      "Date": "Thu, 16 Dec 2021 14:29:02 GMT",
      "Etag": "\"3147526947+ident\"",
      "Expires": "Thu, 23 Dec 2021 14:29:02 GMT",
      "Last-Modified": "Thu, 17 Oct 2019 07:18:26 GMT",
      "Server": "ECS (bsa/EB13)",
      "Vary": "Accept-Encoding",
      "X-Cache": "HIT"
    },
    "method": "GET",
    "status": 200
  },
  "urls": [
    "https://www.iana.org/domains/example"
  ]
}

You can use this plugin with the Isoxya engine to provide crawling functionality for HTML pages. With no changes, this plugin allows Isoxya to be useful. But if you want to customise which URLs get crawled (only the first 10 per page, for example), you can adapt this plugin, or replace it entirely with your own for full control.

Isoxya plugin Spellchecker

Isoxya plugin Spellchecker provides spellchecking capabilities to entire websites, even if they have millions of pages, and supports 7 languages. It is a good example of a data-extraction plugin operating within a non-traditional problem space. In terms of the processor interface explained above, its sole purpose is to return data containing potential spelling mistakes, with no urls as it doesn’t attempt to control the crawl itself. An example response is:

{
  "data": [
    {
      "paragraph": "Isoxya",
      "results": [
        {
          "correct": false,
          "offset": 1,
          "status": "miss",
          "suggestions": [
            "Oxyanion"
          ],
          "word": "Isoxya"
        }
      ]
    }
  ],
  "urls": []
}

The structure of data is free-form; that is, each processor plugin can define whichever schema is most appropriate, and this will flow through the engine and out to Elasticsearch or whichever service is receiving the processed data.

Combinations

In the examples above, the Spellchecker plugin returns data but no urls. You might be wondering what the effect of this is: Does it mean that it can only spellcheck one page? (Yes.) Must it be extended to spellcheck an entire site? (No.) Since the Spellchecker plugin is focussed only on data extraction, it has no concept of the site graph itself. That means that when the crawl starts, the homepage will get crawled and sent to the plugin, it will check for any spelling mistakes and return them in data, and the crawl will complete after a single page—regardless of the size of the site!

In contrast, the Crawler HTML plugin is focussed on traversing the site graph, so can crawl multi-page sites with wild abandon—but without having any knowledge of human language or how to spellcheck. If only there were a way of combining these programs somehow…

The API documentation for the Crawl endpoint gives a hint: processors is an array of objects, not simply an object. In fact, you can specify multiple processors for a single crawl, and the outputs from these will be combined automatically! Almost certainly, this should be:

  • 0–1 plugin to traverse the site graph (populating urls like the Crawler HTML plugin)

  • 0+ plugins to extract useful data (populating data like the Spellchecker plugin)

What effect would different combinations of these plugins have?

  • 0 traversal, 0 extraction: impossible since the crawl could never complete

  • 1 traversal, 0 extraction: mapping the site graph plus any metadata

  • 1 traversal, 1 extraction: crawling the site and extracting from each page

  • 1 traversal, 2+ extractions: requesting each page once with multiple extractions

  • 2+ traversals: complex and not usually useful

Using the example script to create a new crawl using both plugins simultaneously:

{
  "agent": "Isoxya/3.0.4 (+https://www.isoxya.com/)",
  "began": "2022-02-16T14:55:13.272295Z",
  "depth_max": null,
  "duration": null,
  "ended": null,
  "href": "/site/aHR0cHM6Ly93d3cuaXNveHlhLmNvbTo0NDM/crawl/2022-02-16T14:55:13.272295Z",
  "list": null,
  "pages": null,
  "pages_max": null,
  "parent": null,
  "processor_config": null,
  "processors": [
    {
      "href": "/processor/719dd86b-15f9-4356-b6a9-f032f0fccde8"
    },
    {
      "href": "/processor/52950af0-24dd-4761-a6cb-b6a782c9f57c"
    }
  ],
  "progress": null,
  "site": {
    "channels": 1,
    "href": "/site/aHR0cHM6Ly93d3cuaXNveHlhLmNvbTo0NDM",
    "rate_limit": 1,
    "url": "https://www.isoxya.com:443"
  },
  "speed": null,
  "status": "pending",
  "streamers": [
    {
      "href": "/streamer/ffb2309e-7453-4c86-b11a-f4480415f089"
    }
  ],
  "validate": false
}

You might be wondering what effect this will have on the site being crawled: Isoxya’s spiders are smart enough to fetch each page only once, and to send it to each processor. The open-source edition of Isoxya interleaves these tasks within a single process. Isoxya Pro, available under commercial licence, dedicates an entirely separate queue to each processor, enabling pages to be processed in parallel across not only multiple processes but also multiple computers (using Kubernetes, for instance).

Summary

Isoxya processors implement the crawl run loop and data extraction logic, customising Isoxya to a specific product or industry. The Crawler HTML plugin can be used for a typical configuration, coupled with zero or more data extraction plugins. One such plugin is the Spellchecker plugin. Combinations of plugins can be used within a single crawl, supporting complex crawling logic. Plugins conform to a simple JSON interface, facilitating interoperability between programming languages.