Isoxya web crawler is 5 today!

2021-11-23 · computing


5 years ago today, the first commits were made to what became Isoxya web crawler. Back then, it was called Pavouk, had a component called spion (spy), said hi in Dutch when it booted, and none of it was open-source. A lot’s changed since then: every line of code has been reviewed and rewritten multiple times, component names and greetings are more boring, and the entire core and many plugins are able to be installed, examined, and improved upon open-source and for free. Isoxya Pro, which became the commercial edition, has also had so many bugfixes and refactors I’ve lost count—and it’s become far faster and easier to integrate with. The last 5 years have been a long journey, and as this anniversary approached, I did a lot of thinking about what’s next for our crawling and scraping spiders. 🕷️ I think Isoxya’s birthday is a good day to make a big announcement in that regard: Isoxya 3 is coming!

Kubernetes support

Isoxya Pro has always been designed to scale. To grow beyond a single computer, the controller manages resources and boots crawlers, processors, and streamers as needed. Containers are used to accomplish this predictably. Originally, Docker Swarm was used as the container orchestrator, but after various issues with reliability after testing sites with millions of URLs, this was migrated to use Pacemaker and Podman. Although that solution has shown impressive stability, there’s no denying that the installation is rather complex. These days, many projects are moving to Kubernetes, which simplifies cluster operations without sacrificing scalability. Isoxya Pro will support Kubernetes fully in Isoyxa 3.

Dependency simplification

Isoxya Pro was originally designed as a multi-tenant crawler. The idea was that a single installation could manage namespaced crawls from multiple organisations directly, authorising resources and interacting with billing systems. Over the years, it’s become clear that this is unlikely to be a common use-case—especially considering that multi-tenant projects likely have their own API, and so can handle such anyway. To simplify dependencies and installation, Isoxya Pro will remove Tigrosa, the authentication and authorisation layer, in Isoxya 3.

Wiki Repository documentation

Isoxya’s API documentation is being rewritten from scratch, moving from a static website to a dynamic wiki repository on GitHub. This will make high-quality documentation simpler to maintain, as well as making it easier for others to contribute if desired. As part of Isoxya 3, the REST API is being reworked to simplify endpoints and rename various JSON parameters. The new wiki repository will be updated progressively, and it should be fully up-to-date in time for the launch of Isoxya 3.

Download Free Source