Isoxya plugin: Elasticsearch 1.0 release

2019-09-12 · computing

This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.

We’re really rather excited to announce the very first release of open-source Isoxya plugin: Elasticsearch 1.0—a plugin which streams data from the Isoxya crawler to the Elasticsearch database. Now any type of data from Isoxya—whether link-checking, spellchecking, or something else entirely—can be streamed directly to Elasticsearch in seconds, placing the full power of that and related tools such as Kibana at your disposal. Docker images are available, and similar to Isoxya plugin: Spellchecker, announced just a few days ago, we’ve decided to release the plugin open-source (BSD-3 licence).

Build your own product

By providing a data-streaming pipeline to Elasticsearch, we’ve cut our programs free from the traditional request-develop-release cycle. Whilst many other web and SEO crawlers reinvent the wheel by implementing their own reporting solutions, these typically have a fairly limited set of features, and often can only cope with the insights companies have thought about in advance. Even when a customer requests something with clear merit, there is often a long phase of evaluation, development and testing, and—if you’re lucky—finally a release of the insight or feature you need. But data analytics is a well-solved problem, and we’d rather lean on the years of expertise of a industry-standard solution such as Elasticsearch, than slowly develop our own. Coupled with Kibana, this lets you get up and running and gaining useful business insights in minutes—even if you’re not a programmer or data scientist. Just take a look at Elastic’s Kibana page to see the sort of thing which is now possible.

The biggest spellchecker in the world (probably)

We’ve just built what is probably the largest spellchecker in the world—with minimal extra work. One of the various motivations behind how we designed Isoxya’s flexible data-processing and data-streaming plugin system was that we wanted to be able to spellcheck entire live websites—even if they had millions or tens of millions of pages. There are surprisingly few products in this area, and those we looked at were very slow, only able to cope with a few hundred or thousand pages, and exceedingly expensive. So, we used the power of open-source to connect to Hunspell—the same spellchecker as is used in LibreOffice, Mozilla Firefox, Mozilla Thunderbird, and Google Chrome—and now with this Elasticsearch plugin, we have a functional spellchecker able to drill into results from millions of pages, scaling at a click of a button.

Insights in seconds

Every component within the main Isoxya run-loop is push-based. That means that rather than have to wait for programs to check for pending work, allocate resources, check completion statuses, and build reports, every stage is intelligently progressed with minimal wait-times—much like an SMS being sent to your phone. This means that typically, once a new site-snapshot is requested, the crawl starts, takes the data from the website, processes it using a data-processing plugin, and streams it using a data-streamer plugin—typically all within seconds. There’s no waiting for an ‘end of crawl’-type procedure, because data is processed page-by-page right from the very first URL. Given that Isoxya is designed to also support continuous crawling, time-series views of websites are not only possible, but trivial, to generate. This means that visualisations and dashboards built on Kibana can be refreshed constantly, data usually being available within the Elasticsearch cluster for overviews or deep-dives faster than you could go and make a coffee.