the future of web crawling is open-source: Isoxya 2 is coming

2020-09-17 · Isoxya

Almost 4 years ago, I wrote a line of code. Having previously had a number of roles within web development, advertising, and analytical search engine optimisation, I’d already designed two commercial web crawlers: a prototype for one company which never went anywhere, and a greenfield redesign for another company which definitely did—later going on to raise a lot of investment. Over the years that followed, I found myself thinking more and more abstractly about the web crawling problem, and realised that despite there being many good web crawlers around—many of them well established and years old—most were not very flexible when it came to uses outside of traditional SEO. And so, I set off on my journey to design and build a next-generation web crawler. It should support multiple industries, not just SEO. It should support multiple uses, including machine learning and human language analysis. It should support multiple programming languages, to give programmers freedom to connect to other software. It should be able to handle even the largest of websites, with tens of millions of pages. It should be distributed, supporting horizontal scaling across multiple computers. It should be fault-tolerant, automatically recovering from all sorts of errors without human intervention. It should be cheap to run. And finally, it should be fast! What followed has been an interesting—and challenging—journey.

prototyping is easy

I had a usable prototype after a few months. This would likely have been even quicker, but I chose to write Isoxya in the Haskell programming language, which although I’d used before, I’d never written anything of this scale using it. But having researched the language, structure, rationale, and advantages, I became convinced that Haskell would be a solid choice, giving a high degree of confidence with regards to code correctness, and blistering speeds as a compiled and heavily optimised language.

designing interfaces is harder

After the initial prototyping, I did some work improving and trying to clean up the interfaces I’d designed. I wanted it to be easy to interact with the web crawler from all sorts of different systems, and for a reasonable balance between pure speed and usability. Initially, I leaned much more to the former, but when writing the first extensions to the engine, I realised that this would somewhat raise the bar to entry, since not everyone would be comfortable utilising and debugging binary structures. I also convinced myself that having a core engine with a plugin system was the way to go; this would allow interoperability with other languages and systems, whilst keeping the core fast and efficient. I changed the interfaces for data processing and streaming to be JSON with Base-64 encoded blobs, and kept testing and optimising the crawling engine.

the internet is rather large

Designing a data processing system usually requires following a formal or informal specification closely, often with little variance between structures. The internet, however, is rather large, and the sheer number of different web pages—some compliant with formal specifications, and some rather broken but somehow just about working anyway—vastly complicates this. Each time everything seemed to work fine, I found a different sort of website, and something broke. Here the choice of Haskell helped greatly, since I was able to fix, extend, and refactor extensively, with very few regressions. Over the course of the last few years, I’ve personally tested Isoxya with hundreds (thousands?) of different webpages, and whilst an incompatible website is probably only a pixel’s throw away, it does now have a pretty decent level of coverage.

think of the servers!

Building a piece of software of this scale actually takes far more than just programming. It’s easy to think that building a web crawler is as simple as putting together a few lines of code; after all, many programmers build a small web crawler of sorts at some stage in their careers. But when designing for running at scale on multiple computers simultaneously, with high-availability protections against individual components failing, it becomes critical to consider not only the code, but also what the code is running on. I was also keen to ensure that Isoxya would be cheap to run, not requiring more CPU or RAM than really necessary—even when processing large pages or files. As a result, I’ve actually spent a significant proportion of the time working on Isoxya setting up and customising servers, testing it at scale across networks whilst carefully monitoring availability and resource consumption. At times, I was my own ‘chaos monkey’, killing programs or cutting the power entirely, to ensure that Isoxya could tolerate it. I even crawled websites that had in excess of 10 million pages per website, to ensure it could cope with this end of the scale. There’s definitely some performance tuning still possible there—after all, finding people with websites that big isn’t exactly common—but everything’s mostly sorted.

shoestring budgets

It’s one thing to design and test software of this size if you have a large budget, and are able to rent all the hardware you can dream of. It’s quite another thing to do it if you’re self-funded, and working on a somewhat ‘shoestring budget’—at least in comparison to the many other companies who have millions of Euros in funding. There’s also the issue of human resources. For pretty much all the core engineering I’ve done so far, I’ve written everything single-handedly. I’ve had others help out along the way with various feedback, administrative support, and testing, but the truth of the matter is it’s very hard to compete against companies with tens or even hundreds of employees. Whilst this means I haven’t been able to build out as many plugins or reference products as I would have liked, it has forced me to be aggressive in my prioritisation of where to invest resources—both money and time. What I’ve developed is already production-ready, having run on the internet for an extended period of time, and with many pages ‘under the bridge’. However, it’s also limited in as much as it’s still API-only, with no nice user interface to show off its capabilities. But API-first design has long been important to me, since once you get that right, graphical web applications can follow easily.

spiders in the closet

Here, however, I’ve encountered some problems. The lack of resources to be able to take the project as far as I’d like, or to get it in front of the right people, has meant that I’ve for a while been sitting on some very powerful technology—which absolutely nobody is using. That’s like having super-smart, brightly-coloured spiders, with enough eyes between them to spot all manner of juicy things—and having them sit unheeded in the closet. It’s important to me that the software I’ve invested so much in actually sees the light of day, and gets used by people to solve real-world problems. I’m also aware that, despite not being a newcomer in this field, there’s a difference between having designed crawling and data processing systems before, and actually convincing other people to trust your software. Many other commercial crawling companies offer demos, but usually those don’t show much more than their own UIs or reporting engines. This also is a bit tricky, because I decided a while back to focus on doing the crawling and processing really well, but to leave the analytics and reporting to other established systems. Since Isoxya supports both data processing and streaming plugins, I wrote a plugin to send data into the excellent Elasticsearch and Kibana ecosystem, making possible far more uses than I can likely even imagine.

the open-source community

For many years, I’ve contributed various patches and projects to the open-source community. After extended discussion with various people who have been kind enough to advise me about business, and balancing the various risks and opportunities with what I’ve designed, I’ve decided that Isoxya 2 will be open-source. Not only the plugins I’ve written so far—for crawling HTML pages, spellchecking pages in various languages, and streaming data into Elasticsearch—but also a smaller version of the core engine will be licensed open-source. I’m still choosing between BSD-3 and Apache 2.0 licence, but in either case, I hope for developers to be able to boot a simple binary or Docker container—or compile Isoxya themselves, if they choose—and be able to run small crawls and develop plugins on their own computers. That engine will necessarily be limited to small numbers of pages and much slower speeds, whilst having minimal external dependencies requiring installation and configuration.

building a viable business

This is all very well, but I also want to build a viable business! I’ve spent a lot of time recently analysing and discussing monetisation models, and hopefully I’ve found a solution. The problem is trying to balance providing a meaningful product to the open-source community, usable within that ecosystem without being crippled by some dependency on closed-source or commercial software, whilst also having enough scope to generate revenue to fund further development and expansion of the project. As such, I intend to offer a commercial ‘professional’ or ‘enterprise’ edition, which will contain the more complex features I’ve already developed, such as high-availability resiliency, multi-computer scaling, larger websites, and far higher speeds. The open-source ‘community’ edition should be enough for small website owners, researchers, and those wanting to experiment or develop their own plugins and integrate with other systems. The commercial edition will offer a natural next step for running those same plugins and integrations at scale, fully-compatible with the open-source edition, such as would be desirable for medium- and large-website owners, or even SEO or data-science web crawling companies. This is very different to what many other commercial crawlers offer, usually offering only SaaS and only the possibility of on-premise only for the largest of customers. In addition to this, I intend to offer the possibility of support, and maybe in the future, also an additional SaaS offering for those who would prefer to not have to setup and maintain the multi-computer, multi-database infrastructure required.

the months ahead

Isoxya 1.5—the latest version—will likely be the last in that series. I’ve spent a few days analysing all the features and specifications I’ve ended up with after these years of research, and have drawn up a plan for developing Isoxya 2 split between the open-source ‘community’ edition and the commercial closed-source ‘professional’ or ‘enterprise’ edition. Since almost all of this code already exists and is running live on my servers, I anticipate the time-frame as being relatively low: in the region of a few months or so, with the first code open-sourced much sooner than that. What I hope to do over this period is also use the opportunity for a final quality-check prior to launching it publicly for others to build on. Part of that work will also involve bringing the API and interface documentation up-to-date, progressively explaining how to get started with Isoxya, and showing the numerous opportunities for building software on top of it. After the open-source and commercial editions are released—if I get the opportunity—I would like to build a commercial search product on top of it, which would be able to monitor keywords across a fully-customisable list of websites, for brand monitoring, competitor analysis, or financial market news.

get involved

If you’re interested in web crawling, either as a programmer building your own plugins, or someone curious about the commercial possibilities, you’ll hopefully want to get involved! As well as subscribing to my newsletter about this and other projects, and connecting on Twitter or LinkedIn, you can let me know if you’d like early access or a deeper discussion about Isoxya’s design and capabilities. It’d also be great if you’d like to let anyone else you think might be interested know, and when the time comes, help to promote the open-source GitHub repositories within your own networks. Many thanks to those who have shown a continued interest in Isoxya over the past few years; hopefully, more will join that community, and soon the number of products using it as their window into the internet will expand. Happy crawling! 🕷️