This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.
We’re pleased to announce the release of Isoxya 1.3—the high-performance web-crawling system with data-processing and data-streaming plugins. With the introduction of external URL validation, public plugin configurations, and full support for streaming data directly into Elasticsearch, we feel this release is a solid and creative foundation for analysing data and powering other products.
External URL Validation
Isoxya is now able to validate external URLs. Since all site-snapshots take place within the scope of a single site (i.e. not subdomains, and not other sites), Isoxya’s design allowed us to approach the problem rather differently. So, during such a site-snapshot, Isoxya builds up site-lists of external URLs automatically, and links them to the site-snapshot. Then, at the end of the crawl, it launches multiple child site-snapshots with depth-limit of 1, and processes those independently—deduplicated and in parallel.
Although a still somewhat experimental feature at present, HEAD requests are now supported in addition to the default GET. Whilst this can help to reduce load on a site, it’s currently treated the same as GET for billing purposes. We’d be interested in hearing people’s thoughts about how to further develop this feature, including whether it would be more useful to allow these separate requests to be chosen by the user, or to integrate HEAD-aware conditional requests directly into the crawler.
Public Pickaxes, Pipelines, and User Agents
Previously, it was necessary for pickaxes, pipelines, and user-agents to be approved and specified at the organisation level (we don’t allow free control over these settings). Now, it’s possible to use our pre-configured pickaxes, pipelines, and user-agents, marked as public and available in your organisation’s options automatically.
Pickaxes and Pipelines Metadata
Pickaxes and pipelines now have tags. Not only does this make configuration clearer, but pickax tags are also sent right through to pipelines, enabling specific treatment of data from specific sources. This has enabled us to vastly optimise the Isoxya Pipeline Elasticsearch when data has been processed by the Isoxya Pickax Spellchecker, for example. The tags will also be present on financial statements.
Site-snapshots can now contain an optional configuration, passed to the pickaxes. This is free-form, and can be anything those pickaxes support. The Isoxya Pickax Spellchecker uses it to change the language used for spellchecking.
Pipelines Fatter Droplets
Pipelines now send extra data in droplets, with useful additions such as the site URL already separated from the absolute page URL. This makes for easier searching and aggregation in external systems such as Elasticsearch using Kibana.
Pickaxes Control vs Data Simplification
Originally, there were two types of pickax: control pickaxes, which determined the route taken through the site, and data pickaxes, which dealt with the data-extraction and processing. Some early feedback made it clear to us, however, that this distinction wasn’t well-understood, so we’ve merged the two concepts into one. Now, there is only the concept of pickax, which may emit crawl-graph or extracted data—or both.
Financial Usage System
We’ve designed the first part of the financial usage system, which will be used for billing purposes. We’ve still got more work to do on this, but we’re aiming to make it following a metered-billing approach, with each pickax and pipeline itemised and charged separately. This gives a lot of flexibility over potential uses, making it possible to combine different plugins whilst still keeping the total cost down.
Spiders with Better Memory
We’ve optimised memory usage in our crawling queues, enabling the spiders that visit the sites to have better memory characteristics, giving a more polished experience, operation-wise.
Pickaxes and Pipelines Resource Allocation
Pickaxes and pipelines have had stability improvements made, with regards to how they allocate resources at the start of a site-snapshot. They should now be more reliable when starting new crawls, even if some time has passed since site-snapshots were last running for that organisation.