Isoxya web crawler: 2.0 released, with 3 open-source plugins

2021-01-29 · computing

 

There are many lessons to be learned from events related to the recent US election and subsequent Capitol riots, but one of them should be this: that the availability of online data and services is fragile, and that social media accounts, mobile apps, and website hosting can vanish suddenly. Whatever your own opinions about the culpability of the people and organisations involved, it’s worth remembering that the megacorporations that wield the power of the off-switch might next time use it in very different circumstances. Much of the internet as we know it is within the control of just a few people, and whilst we’ve gained a lot from economies of scale, this centralisation has come at a huge cost.

Back in October 2020, I wrote that I do not want to be part of a platform which attempts to silence those who might disagree with me. I had no idea how quickly things would escalate, and that in just a couple of months, all of the major social media platforms would censure and ban an unprecedented number of accounts, mostly for allegedly violating their content policies. Whilst these problems are complex, I do feel that at least some part of the solution is clear: we must decrease our reliance on the current implementation of social media, and favour decentralising the internet both technologically and corporately. Just a decade ago, lots of people had small sites hosted in numerous places around the world. These days, it seems that even large organisations host their content with the same handful of providers, making them subject to the whims—good or bad—of the same few tech companies.

Today, I’m announcing the release of Isoxya 2.0, along with 3 open-source plugins, all backed by extensive documentation. I first wrote about this major new version in September 2020, and following a complete overhaul and rewrite of the software, it’s finally ready. Whilst this might not immediately seem connected to recent events, the independence and flexibility of the Isoxya web crawler can not only help people and organisations to regain a small amount of control over their online lives, but also make it possible to develop innovative software to explore, monitor, and preserve online data—all without being dependent on some megacorporation. Allow me to explain with a few examples:

Although originally inspired by SEO software, Isoxya is not SEO software, per se. Whilst it’s certainly possible to use it for this purpose, Isoxya is separated into multiple components, much like a motorbike. Not only does this give the possibility of replacing or upgrading parts as needed, it also allows for ditching the motorbike design altogether—perhaps using the engine to make a tuk-tuk instead. Web crawlers are typically designed either with a very specific purpose in mind (like SEO), or as a simple download tool without much control over how data is extracted and transformed. One Isoxya plugin streams data into Elasticsearch, allowing visual exploration of a website’s structure and data at scale. Another Isoxya plugin uses Hunspell, the same spellchecker as is used in LibreOffice, Mozilla Firefox, Mozilla Thunderbird, Google Chrome, and various proprietary programs, to spellcheck even large websites. Building on these plugins would allow, for example, analysing and investigating mentions of certain terms online, or tracking the spread of disinformation throughout networks and news sites. And since both of these plugins, as well as another plugin for processing HTML pages, are open-source, anyone is able to study or build on the code themselves.

Or the monitoring theme could be taken from a journalism angle: Ever more frequently, quotes or information are included in articles, which depend on their hosting on social media or websites elsewhere. Indeed, it could well be that an entire report depends on sources of data found only online. The problem with that is that such data is easily modified. How then, to detect that a source has been changed, or that some critical piece of information hasn’t been subtly altered? With Isoxya, this would be trivial: One of the open-source plugins could be modified, or alternatively an entirely new plugin created, which extracts data from a website and generates checksums. These checksums could then be stored, perhaps by a trusted party. Verifying whether sources are still the same would then be as simple as accessing the website again, and comparing the checksums. The nice thing about this idea is that it provides an entirely different use with minimal extra work: An alert system could be created, to email subscribers whenever a favourite webpage of theirs changes. This would be similar to subscribing to a blog for updates, but work as a generic solution, and not even requiring giving your email to the blog, or letting them know you’re keeping an eye on things—useful for keeping an eye on competitors?

The journalistic theme also relates to another important possibility: data preservation. The Internet Archive Wayback Machine creates snapshots of websites in order to preserve their contents for future comparison, and to protect against the site going offline. These snapshots are typically infrequent, however, and use of the service depends on storing the data in their collection. What if you wrote an article, but wanted to easily make your own copy of a site for reference and in case it was changed after publication? This could be implemented with a couple of plugins for Isoxya, preserving entire pages instead of extracting data during processing, and copying the pages to a hard drive or shared storage during streaming. And this could have very different uses, too, such as downloading an entire website because it’s taken offline by its webhost, for the purposes of mirroring the site or preserving it for law enforcement. Whilst this can also be done with other tools, Isoxya is designed to handle sites with even millions of pages, and to be able to spread out that work across multiple networked computers.

So what’s next for Isoxya? Work will now begin on Isoxya 2.1, which will adapt the existing Pro Edition into a new open-source Community Edition. That provides an interesting challenge: Most of Isoxya’s design so far has been focussed on allowing it to operate at extreme scale, across 5, 10, or even more computers. The Community Edition, however, will cut out many dependencies, and work as a mini-crawler suitable for a single machine and small websites. Both the Pro Edition and the Community Edition, however, will support the same plugins, meaning it will be possible to develop software based on the mini-crawler, and then scale that to the high-availability, distributed crawler as needed. Stay tuned for further announcements as I progress with that. And in the meantime, if you’re a programmer or company who thinks they might like to start building on top of Isoxya, get in touch; I’m open to giving out some limited Pro Edition licences to help get those projects moving. Happy crawling! 🕷️