Running a simple scraper

This is the very first section where we can start getting our feet wet by looking at and running some code.
If you would like to play along, head over to installation first to get ayakashi in your system.

We will make a pretty simple scraper that loads a github repo, clicks a button and extracts some data like the clone url and the star count.

The scraper file

We will place all of our code in a single file to keep things simple.
Here is the complete code:

module.exports = async function(ayakashi) {
    //go to the page
    await ayakashi.goTo("https://github.com/ayakashi-io/ayakashi");

    //find and extract the about message
    ayakashi
        .selectOne("about")
        .where({itemprop: {eq: "about"}});
    const about = await ayakashi.extractFirst("about", "text");

    //find and extract star count
    ayakashi
        .selectOne("stars")
        .where({href: {like: "/stargazers"}});
    const stars = await ayakashi.extractFirst("stars", "number");

    //find the green button that opens the clone dialog
    ayakashi
        .selectOne("cloneDialogTrigger")
        .where({
            and: [{
                class: {
                    eq: "btn"
                }
            }, {
                "style-background-color": {
                    eq: "rgb(40, 167, 69)"
                }
            }, {
                textContent: {
                    like: "Clone"
                }
            }]
        });
    //click it
    await ayakashi.click("cloneDialogTrigger");

    //find and extract the clone url
    ayakashi
        .selectOne("cloneUrl")
        .where({"aria-label": {like: "Clone this repository at"}});
    const cloneUrl = await ayakashi.extractFirst("cloneUrl", "value");

    //return our results
    return {about, stars, cloneUrl};
};

As you can see, a scraper is a nodejs module that exports an async function (so we can use the convenient async/await syntax).

The scraper does the following things in a serial manner:

  • Loads a repository page
  • Creates the about prop and then extracts its value as text
  • Creates the stars prop and then extracts its value as a number
  • Creates the cloneDialogTrigger prop which points to the Clone button
  • Clicks the cloneDialogTrigger prop
  • Creates another prop, called cloneUrl and extracts it
  • Returns the results

The prop construct is used to define all the different entities in the page by giving them a name and how to find them.
They can then be used for extraction or as input in actions (like in the click action above).
We will explore them in more detail in the tour section and learn everything about the syntax in the domQL section.
Extracting data is also fully covered in the data extraction section.

Let’s run it

ayakashi run --simple ./github.js

The run command will download a recommended chromium version the first time it is run.
It will then start outputting some info about its progress and finally print a table on our console with the extracted data.

Simple mode (--simple)

On our run command, the --simple flag is used to enable simple mode.
With simple mode we can run single file scrapers directly.
It’s a valuable tool for quick prototyping, simple scrapers or examples.
By default the run command is looking for an Ayakashi project folder, which is what we will build on the complete scraper project section.

Next steps

This section served as an introduction to get a bit familiar with the base API and the run command.
Next follows the tour in which we will get in more detail about each concept and then move on to build a complete project by re-using the scraper of this section while enhancing it.