Major Infrastructure Upgrades

2020-04-28 · computing

This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.

We realise we’ve been rather quiet, of late—but now we’re ready to let you know a little of what we’ve been up to! Having released so much new software last year, particularly for Isoxya—Web Crawler & Data Processing System—we decided to start this year with major infrastructure upgrades. After testing our snazzy spider by crawling millions of pages, we decided to bring together knowledge from other consultancy projects, and build an improved highly-available system.

Not too cloudy

These days, it’s rare to hear a company not talking about cloud computing, and moving things ‘into the cloud’ is a frequent long-term goal of many small- and medium-sized companies. This can well be a solid decision, especially for those without long-term infrastructure skills in-house. For programs with intensive processing requirements, however, this can come at the cost of predictability, flexibility, and even price. At Pavouk, we deploy most of our services onto dedicated servers, and optimise everything from the ground up. Being in total control of both hardware and software gives us a competitive advantage when running web crawlers, since we know exactly who the tenants are on each server, and are able to optimise the software we write to be more efficient. We’ve recently taken delivery of new AMD Ryzen processors, and are looking forward to seeing what we can do with such cutting-edge hardware.

Obsessive optimisation

It’s true that premature optimisation can be a waste of resources on a programming project. But it’s also important to balance that against the requirements of what you’re building. For many projects, increasing speed by even 20% might not be worth the effort—after all, it’s usually cheaper to increase equipment budget than personnel costs. But when you’re running processes that execute millions of times per month, even medium-sized optimisations can amount to sizeable savings. Trying to balance these conflicting considerations is hard—and honestly, we don’t always get it right. But we’d usually rather err on the side of having something slightly more efficient than it needs to be right now, if only to help absorb the knock-on effects of growth. We obsessively tune databases, refine algorithms, and weigh the consequences of how new features are implemented, in order to be able to crawl quickly and cheaply.

Automate all the things

If you’re a small company like we are, finding the budget to increase personnel on a project can be hard. Years ago, the start-up costs likely would have been prohibitive to even launching the project. But these are the days of increased automation, and ‘working smarter’ helps us to achieve far more than we’d ever be able to do otherwise. Isoxya is a vast piece of software, consisting of over 13 individual programs at the first level—not even counting datastores, messaging systems, and caches. Keeping all that up-to-date and in-sync would be a mammoth task (or perhaps, a tarantulic one?), were it not for Spiderbot—our continuous integration system robot. Dependency-checking, compiling, testing, and packaging are automatic, and we’re typically able to go from bugfix or feature update through to live installation just by pushing Git code commits—exactly what you’d expect from the technology company behind a next-generation web crawler.

Up next

Now we’ve come to the end of our infrastructure improvement drive, we’ll be turning our attention back to improving the software itself. We’ve got a lot planned for the coming weeks as we continue to work towards taking Isoxya into public availability, so stay tuned to be the first to hear what’s going on. And don’t forget, we’re offering 1000 pages for free to those of you who’d like a sneak preview, so get in touch if you’d like to hear more (or just would like to save some money)!