Isoxya 3.0 release: crawling and scraping on Kubernetes

2022-01-24 · Computing

 

I’m pleased to announce the release of Isoxya 3.0, bringing Kubernetes support to the next-generation web crawler for the very first time. Arriving just a few weeks after Isoxya’s 5th birthday, Isoxya 3.0 has been rearchitected extensively, simplifying installation and dependencies, and supporting rapid prototyping and reliable scaling of crawler and scraper workloads. The core engine and many processor and streamer plugins are available open-source, whereas Isoxya Pro adds high availability, error recovery, and horizontal scaling using Kubernetes, PostgreSQL, Redis, and RabbitMQ. Output into Elasticsearch is possible using an open-source plugin, or you can design your own API endpoint to receive streamed data live.

Isoxya

Isoxya is the open-source core engine. Its intended audience is home, hobbyist, and research & development. It uses SQLite as an embedded database to support small crawls and get up and running quickly. It uses the same plugin interfaces as Isoxya Pro, meaning it’s possible to start off using Isoxya for free, and then scale using the commercial edition. Or you can just keep using the open-source edition indefinitely, if you prefer. :)

Isoxya Pro

Isoxya Pro is a commercial alternative to the open-source engine. Its intended audience is businesses or agencies that have outgrown the hobbyist or R&D stages, adding high availability, error recovery, horizontal scaling, and other features such as external link validation and list crawls. It uses PostgreSQL as a database, Redis as an LRU in-memory cache, and RabbitMQ as a message broker. These dependencies can be satisfied as desired, whether containerised, managed, or standalone. Isoxya API Pro is supplemented by independently scalable Isoxya Crawler Pro spiders, Isoxya Processor Pro plugin connectors, and Isoxya Streamer Pro plugin connectors. These are deployed dynamically by Isoxya Controller Pro, which creates, scales, and deletes deployments via the Kubernetes API.

Isoxya Docs

Isoxya Docs is the new home for Isoxya’s documentation. You can use this to interact with the Isoxya API or understand precise differences between Isoxya and Isoxya Pro. Unlike previously, this can now be downloaded or forked easily.

Isoxya plugin Crawler HTML

Isoxya plugin Crawler HTML provides a core run loop for the crawling engine, parsing each page as static HTML, and extracting request metadata and outbound URLs. This is used for both Isoxya and Isoxya Pro.

Isoxya plugin Elasticsearch

Isoxya plugin Elasticsearch streams data into an Elasticsearch cluster, making it possible to query data using the advanced reporting features of Elasticsearch and Kibana. It’s possible to use this with both Isoxya and Isoxya Pro.

Isoxya plugin NGINX

Isoxya plugin NGINX is a simple packaged configuration for NGINX, for logging textual payloads such as JSON. This can be helpful when building custom plugins for Isoxya or Isoxya Pro.

Isoxya plugin Spellchecker

Isoxya plugin Spellchecker provides spellchecking capabilities to entire websites, even if they have millions of pages, and supports 7 languages: English, Czech, German, Spanish, Estonian, French, and Dutch. This is available free to Isoxya or Isoxya Pro.

Availability

You can get started with Isoxya today, in just a few minutes and even fewer commands. Source code for the core engine, documentation, and various plugins are available on GitHub, and container images are published to Docker Hub. Isoxya Pro requires more setup, and requires a commercial licence based on your requirements regarding concurrent crawls, max pages per crawl, custom user agents, and max rate limits.

Download Free Source Request Quote