Isoxya web crawler: 2.1 released, introducing open-source Community Edition

2021-03-19 · Isoxya

I’m pleased to announce the release of Isoxya web crawler 2.1, an internet data processing system representing years of research into building next-generation crawlers and scrapers. Back in September 2020, I announced that a major new version of Isoxya was coming, and that it would be splitting off part of the commercial edition into a new open-source edition. Isoxya 2.0 was released in January 2021, along with 3 open-source plugins. Today’s release of Isoxya 2.1 completes delivery of that milestone, and for the very first time, introduces an open-source mini crawler with minimal dependencies—and usable absolutely free!

Community and Pro Editions

Isoxya now comes in two editions: Community Edition (CE), a free and open-source (BSD 3-Clause) mini crawler, suitable for small crawls on a single computer; and Pro Edition (PE), a commercial and closed-source distributed crawler, suitable for small, large, and humongous crawls on high-availability clusters of multiple computers. Both editions utilise flexible plugins, allowing numerous programming languages to be used to extend the core engine via JSON interfaces. Plugins written for Isoxya CE should typically scale to Isoxya PE with minimal or no changes. More details and licences are available on request.

Features

Feature Community Edition (CE) Pro Edition (PE)
Licence open-source commercial
API
CLI scripts
Plugins 3+ 3+
·    
Authentication Tigrosa
Database SQLite PostgreSQL
Cache Redis
Message broker RabbitMQ
·    
High-availability
Horizontal scaling
Error recovery
Resource management
·    
Concurrent crawls 1 ∞¹
Pages/crawl ∞²⁺³ ∞¹
User-agents ∞¹
Rate-limit (reqs/s) 1/10³ ∞¹⁺⁴
·    
Robots.txt
Crawl max pages
Crawl max depth
List crawls
External link check
Crawl cancellation
Organisations
·    
Crawler channels 1 ∞¹
Processor channels 1 ∞¹
Streamer channels 1 ∞¹
·    
OS variant Linux Linux
Packaging container container
Support community direct
·    
Price free on request

Features and limits are indicative only, not guarantees. ∞ indicates many, not infinite! ¹ depending on licence and infrastructure. ² no hard-limit, but small as single-process. ³ not configurable. ⁴ set globally per-site; configurable for on-prem only.

Availability

Open-source programs have been published to GitHub and Docker Hub: Isoxya CE (code, download), and the Crawler HTML (code, download), Elasticsearch (code, download), and Spellchecker (code, download) plugins. In addition, there are example scripts available, either as a demo, or for use on a CLI. Documentation for the Isoxya API and interfaces is also available. There’s a hands-on walkthrough, and a short video. If you’re interested in Isoxya PE, more details and licences are available on request.