Isoxya plugin: Crawler HTML 1.1 release

2020-01-31 · Isoxya

This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.


It’s been 5 months since we released open-source Isoxya plugin: Crawler HTML 1.0, and we’re pleased to announce version 1.1—merging link-checking in from Isoxya: plugin Link Checker and now extracting headers by default. We’ve crawled millions of pages with the previous version as part of our private beta programme, and this has given us insights into how to improve.

Since Isoxya—our web crawler & data processing system—provides a flexible plugin system, we were able to develop a simple open-source plugin to extract the HTTP status code from requests, and stream into other systems through other plugins. This remains a useful reference implementation for others building Isoxya plugins. However, running two plugins for every page processed just to extract the HTTP status code seemed wasteful, so we merged the functionality and made those metrics available by default.

Headers extraction

This got us thinking: if the HTTP status code can be cost-effectively collected by default, why not the HTTP headers, too? Often it’s useful to be able to check whether a content-type is set correctly, or whether the page has been served through a cache and what the expiry is set to. Also, it’s useful to be able to check on SSL/TLS, how strong those guarantees are, and whether HSTS is working. All this can be accomplished by extracting the headers, which this plugin now does by default.

Onwards and webwards

We’re continuing to extend and improve our plugins, as well as beginning a phase of work to upgrade much of Isoxya’s infrastructure. This will enable us to continue crawling and expanding in the coming year, as we bring more people onboard in our private beta programme. As part of this, we’re also working towards Isoxya 1.4, which will be a big release bringing many improvements to the core crawling engine, as well as consolidating many features we added last year.