Isoxya 1.1 release

2019-05-10 · Isoxya

This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.

We’re pleased to announce the release of Isoxya 1.1—the high performance internet data processor and web crawler. It’s the first minor-version release since Isoxya 1.0 (over 5 months ago), and represents a major milestone in the maturity and functionality of the software, bringing even higher performance, and simplified interfaces for connection with other programs.

Increased Performance

Having used Isoxya 1.0 to crawl millions of pages, we identified a number of improvements we wanted to make, both to performance and to usability. In addition to fixing an issue only usually affecting websites with pages in the millions, we completely overhauled the caching and storage systems, to reduce resource usage yet further and provide a far smarter strategy of data deduplication.

Simplified Interfaces

Although we were overall very happy with Isoxya 1.0, we did realise that the interfaces we used to communicate with data processors (‘pickaxes’) and data streamers (‘pipelines’) were perhaps a little complex. This is because we prioritised pure performance in using binary payloads, rather than using JSON like in the rest of the Isoxya API. However, a tiny loss in raw performance at this layer is easily justified by a clearer implementation, the actual effect anticipated to be very small (besides, we reasoned that we could always compress JSON payloads, if needed). Thus, Isoxya 1.1 presents much clearer interfaces for sending and receiving data—so clear, in fact, that we can now summarise them in this blogpost!

The remainder of this post is rather technical—but if you’re using SEO web crawlers or crawling the internet for other purposes, hopefully you don’t mind that!

Data Processors (‘pickaxes’)

A pickax is a small data-processing program, concerned only with how to process data from one page. It can be written in any language, conforming to a simple interface, and hosted separately. This is how Isoxya can be applied to a specific purpose within an industry. A pickax is typically run within our infrastructure, for reasons of speed and network-traffic minimisation.

Isoxya sends full page data, including: the full page body (whether HTML, JSON, images, etc.); page headers (including information about Content-Type and caching); and metadata (an absolute URL, and the HTTP Status Code).

{
  "body": "PCFkb2N0eXBlIGh0bWw+CjxodG1sPgo8aGVhZD4KICAgIDx0aXRsZT5FeGFtcGxlIERvbWFpbjwvdGl0bGU+CgogICAgPG1ldGEgY2hhcnNldD0idXRmLTgiIC8+CiAgICA8bWV0YSBodHRwLWVxdWl2PSJDb250ZW50LXR5cGUiIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIgLz4KICAgIDxtZXRhIG5hbWU9InZpZXdwb3J0IiBjb250ZW50PSJ3aWR0aD1kZXZpY2Utd2lkdGgsIGluaXRpYWwtc2NhbGU9MSIgLz4KICAgIDxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+CiAgICBib2R5IHsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZjBmMGYyOwogICAgICAgIG1hcmdpbjogMDsKICAgICAgICBwYWRkaW5nOiAwOwogICAgICAgIGZvbnQtZmFtaWx5OiAiT3BlbiBTYW5zIiwgIkhlbHZldGljYSBOZXVlIiwgSGVsdmV0aWNhLCBBcmlhbCwgc2Fucy1zZXJpZjsKICAgICAgICAKICAgIH0KICAgIGRpdiB7CiAgICAgICAgd2lkdGg6IDYwMHB4OwogICAgICAgIG1hcmdpbjogNWVtIGF1dG87CiAgICAgICAgcGFkZGluZzogNTBweDsKICAgICAgICBiYWNrZ3JvdW5kLWNvbG9yOiAjZmZmOwogICAgICAgIGJvcmRlci1yYWRpdXM6IDFlbTsKICAgIH0KICAgIGE6bGluaywgYTp2aXNpdGVkIHsKICAgICAgICBjb2xvcjogIzM4NDg4ZjsKICAgICAgICB0ZXh0LWRlY29yYXRpb246IG5vbmU7CiAgICB9CiAgICBAbWVkaWEgKG1heC13aWR0aDogNzAwcHgpIHsKICAgICAgICBib2R5IHsKICAgICAgICAgICAgYmFja2dyb3VuZC1jb2xvcjogI2ZmZjsKICAgICAgICB9CiAgICAgICAgZGl2IHsKICAgICAgICAgICAgd2lkdGg6IGF1dG87CiAgICAgICAgICAgIG1hcmdpbjogMCBhdXRvOwogICAgICAgICAgICBib3JkZXItcmFkaXVzOiAwOwogICAgICAgICAgICBwYWRkaW5nOiAxZW07CiAgICAgICAgfQogICAgfQogICAgPC9zdHlsZT4gICAgCjwvaGVhZD4KCjxib2R5Pgo8ZGl2PgogICAgPGgxPkV4YW1wbGUgRG9tYWluPC9oMT4KICAgIDxwPlRoaXMgZG9tYWluIGlzIGVzdGFibGlzaGVkIHRvIGJlIHVzZWQgZm9yIGlsbHVzdHJhdGl2ZSBleGFtcGxlcyBpbiBkb2N1bWVudHMuIFlvdSBtYXkgdXNlIHRoaXMKICAgIGRvbWFpbiBpbiBleGFtcGxlcyB3aXRob3V0IHByaW9yIGNvb3JkaW5hdGlvbiBvciBhc2tpbmcgZm9yIHBlcm1pc3Npb24uPC9wPgogICAgPHA+PGEgaHJlZj0iaHR0cDovL3d3dy5pYW5hLm9yZy9kb21haW5zL2V4YW1wbGUiPk1vcmUgaW5mb3JtYXRpb24uLi48L2E+PC9wPgo8L2Rpdj4KPC9ib2R5Pgo8L2h0bWw+Cg==",
  "header": {
    "Cache-Control": "max-age=604800",
    "Content-Encoding": "gzip",
    "Content-Length": "606",
    "Content-Type": "text/html; charset=UTF-8",
    "Date": "Wed, 01 May 2019 06:06:49 GMT",
    "Etag": "\"1541025663+gzip\"",
    "Expires": "Wed, 08 May 2019 06:06:49 GMT",
    "Last-Modified": "Fri, 09 Aug 2013 23:54:35 GMT",
    "Server": "ECS (dcb/7F82)",
    "Vary": "Accept-Encoding",
    "X-Cache": "HIT"
  },
  "meta": {
    "status_code": 200,
    "url": "http://example.com:80/"
  }
}

This is mined for useful data by the pickaxes, which extract the useful information from it, potentially along with the URLs linked out from the page (optional, since we can run a separate pickax simultaneously for this purpose), and send it back to Isoxya. Isoxya deals with decisions like whether or not to crawl the URLs and how to deduplicate them, allowing the pickaxes to concentrate only on data, not crawling logic.

{
  "data": {
    "contentType": "text/html; charset=UTF-8",
    "title": "Example Domain"
  },
  "urls": [
    "http://www.iana.org/domains/example"
  ]
}

Data Streamers (‘pipelines’)

A pipeline is a small data-streaming program, concerned only with how to send to an external program. This could be a dedicated program in another API, a simple webserver logging payloads to flatfiles for use in data-warehousing applications, a process for inserting rows into an MPP database, a route in your existing API, Elasticsearch, AWS Redshift, Hadoop/Hive, or myriad other options.

Isoxya streams extracted data to external programs, along with metadata about which site-snapshot it came from, the age of the data, and the absolute URL of the page it came from.

{
  "data": {
    "contentType": "text/html; charset=UTF-8",
    "title": "Example Domain"
  },
  "site_snap": {
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/site_snap/2019-05-01T06:19:54.48295Z"
  },
  "t_retrieved": "2019-05-01T06:06:48.740524Z",
  "url": "http://example.com:80/"
}

Isoxya 1.1 is available now for private preview. If you’d like to get in touch to talk to us about your large-scale crawling needs—whether huge sites with millions of pages, or tens of thousands of sites with only a few pages—or because you have some less common requirements—web-crawling for performance-testing, spellchecking, data-extraction for machine-learning dataset training, regulatory research or auditing, security-analysis or virus-checking, etc.—then you can reach us here.