Ayakashi tour

Welcome aboard!
In this section we will cover all of Ayakashi’s basic building blocks and how they all fit together to create complete and ambitious scraping systems.
We will start with scrapers and the different components that comprise a scraper, then move to scripts and finally see how we use the two in a pipeline.

scrapers

If you have already read how to run a simple scraper you already know kind of how scrapers look like.
Inside scrapers we place our page interaction logic, like navigation and interacting with page elements (clicking things, filling forms etc).
One more thing we want to do in a page is to extract data from it, we accomplice this inside a scraper as well.

props

Ayakashi’s way of finding things in the page and using them is done with props and domQL.
Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page’s structure is.
Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.

They are defined by giving them a name and a domQL query for how to find them in the page:

    ayakashi
        .selectOne("myButton")
        .where({
            and: [{
                class: {
                    eq: "btn"
                }
            }, {
                "style-background-color": {
                    eq: "rgb(40, 167, 69)"
                }
            }, {
                textContent: {
                    like: "Click me"
                }
            }]
        });

We can then use it by referencing its name (myButton).
Props are a vital component since they are used as both action input and data extraction models.
Make sure to read the domQL section for a complete reference on how to define and query them.

extractors

Extractors are used to extract data from a prop.
Consider the following element and then a prop for that element:

<div id="myDiv">hello</div>
ayakashi.select("myDivProp").where({id: {eq: "myDiv"}});

we can extract its text by using the text extractor, like this:

const result = await ayakashi.extractFirst("myDivProp", "text");
// => "hello"

This is a very basic example and there is quite more to read about extractors on the data extraction section.

actions

Actions are high level functions that are used to interact with a page. For example:

await ayakashi.click("myButtonProp");
await ayakashi.waitUntilExists("searchBox");
await ayakashi.typeIn("searchBox", "some text to search");

You can read about all of the builtin actions in the reference, as well as how to create your own.

preloaders

Preloaders are used to load a piece of code and make it available in a page that will be loaded by a scraper.
Both 3rd party libraries as well as any other code you wish can be preloaded and made available (or even executed) before any of the page’s code has begun loading.

scripts

Scripts are really simple, they are just functions that complement our scrapers.
For example a script could:

  • cleanup/normalize the data we just extracted
  • hit an external API to enrich our data
  • save our data

The builtin saving methods (sql, json, csv) are actually implemented as scripts.
One thing to note is that inside scripts there is no page or browser access, they are meant to be run before and/or after scrapers (which of course have page access) in a standalone manner to keep things structured and readable.
We will see an example of how to create a script when we build our first complete project.

pipelines

With pipelines we can integrate scrapers and scripts together.
They are defined inside an ayakashi.config.js file which is the file that describes and configures a complete ayakashi project.
We will see a complete example in the next section where we will build a complete project, but they are pretty simple and they look like this:

{
    waterfall: [{
        type: "scraper",
        module: "myScraper"
    }, {
        type: "script",
        module: "printToConsole"
    }]
}

This instructs ayakashi to first run our scraper myScraper and then pass the extracted data to our printToConsole script which will just print our data.
The waterfall part means: “run this in a serial (waterfall) manner, one after the other, while passing data from one to the next”.
Pipelines are also used to parallelize our tasks, as they also have a parallel mode in conjunction to waterfall, as well as the ability to mix and match them together to create many interest scenarios.