Isoxya 1.4 release

2020-05-19 · computing

This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.

We’re pleased to announce Isoxya 1.4—Web Crawler & Data Processing System. It’s been 6 months since our last major update to the core crawling engine, and in this time Isoxya has moved to our custom-designed highly-available infrastructure, as well as receiving a plethora of performance improvements and bug fixes.

Expanding our web

We continue to work towards making Isoxya generally-available later this year. In the meantime, we’re steadily increasing the number of distinct sites we’re testing, as well as inviting people to join our private beta and give us feedback. We’re now offering 1000 pages for free, working to sample websites and give a taste of the sorts of data analysis Isoxya can offer. Get in touch if you’d like to request that we include your site in this programme.

Better information

Isoxya started off a few years ago, and was much of a closely-kept secret (for the first few versions, we didn’t even have a website!). Now we’re looking to make good connections with people and companies who understand our vision, and who are excited about being involved in what we think is surely the most innovative web crawler on the market (admittedly, we might be a little biased!). We’ve recently been expanding the information available on our website, and now have a whole section dedicated to Isoxya, explaining what it can do and how the flexible plugins system works. If you haven’t yet done so, we encourage you to take a look!

Redesigned virtual resource management

One of the key strengths of Isoxya is the way it handles resource usage. When a new snapshot is taken of a site, virtual resources are allocated to the sites being crawled. Not only that, but plugins handling data processing and data streaming also use virtual resources, allowing data to be streamed from the site, through the processors, and out through the pipeline in seconds (and we mean seconds; typically data is available in even a third-party system in under 5 seconds from the start of the crawl; often, quicker!). If external links are validated, we do this in parallel, minimising waiting whilst still crawling at a safe speed to not overload the sites. This release presents redesigned virtual resource management running on our highly-available backbone, which should increase performance and decrease failures.

Optimised performance

This release delivers numerous performance optimisations, not only at the hardware level using faster CPUs, but also at a code level. We’ve expanded the dedicated in-memory cache to offload even more from the database, whilst also reviewing every single database query. We’ve made extensive changes, especially to the statistics system, which tracks every URL in-flight with perfect accuracy—even when crawling a site with millions of pages. This should give a better experience across the board, both with very large websites, and with use-cases requiring fast processing of lots of small websites.

API Demo scripts

We’ve reworked everything in the open-sourced Isoxya API Demo scripts, as well as splitting common authentication tasks like login into newly-open-sourced Tigrosa API Demo scripts. These can be used as reference for integrating with the Isoxya API, or alternatively, as a simple method of using Isoxya.

Gentler spiders

We’ve reassessed the default rate-limits used per-site by spiders, and have reduced them. Isoxya is able to crawl very quickly, but we understand that this is something which should be planned and ideally agreed in advance. To minimise potential disruption or accidents, we’ve set a low speed by default. This limit is increasable on request (either because of owning the site involved, or because of it being clear the site can handle the traffic). We also respect the robots.txt file, so if you don’t want our spiders to visit some part of your site, you can control this yourself in the usual way (we typically use the Isoxya user agent).

What’s next

Over the next few days, we’ll be working through our list of sites to visit and offering free reports; if you haven’t yet requested that we visit your site, you can do so here. After that, we’ll be continuing to work towards the next release of Isoyxa, whilst gradually increasing our journey through the interweb. We’re also hoping to develop some case studies or technical ‘deep dives’; if you’d like to be involved in that or have some suggestions about what might be useful, then let us know. Also, don’t forget to sign up to Pavouk Community, to stay up-to-date on what we’re making; you can also post any comments or questions there. Happy crawling! 🕷️