Querying with domQL

In this section we will see how to define props and also learn the complete syntax of domQL.
If you haven’t read it already, take a minute to check the short tour.

Table of contents

Operators

and

ayakashi
    .select()
    .where({
        and: [{
            class: {
                eq: "container"
            }
        }, {
            class: {
                neq: "content"
            }
        }]
    })

and works the same as most querying/programming languages.
If all of the expressions evaluate to true, then the and expression is true and we have a match.
It can contain any number of nested expressions (the example has 2).

or

ayakashi
    .select()
    .where({
        or: [{
            class: {
                eq: "container"
            }
        }, {
            class: {
                eq: "content"
            }
        }]
    })

or works the same as most querying/programming languages.
If at least one of the expressions evaluates to true, then the or expression is true and we have a match.
It can contain any number of nested expressions (the example has 2).

eq

ayakashi
    .select()
    .where({
        id: {
            eq: "main"
        }
    })

The most basic matching operator.
The specified attribute on the left must equal the value on the right to have a match.

neq

ayakashi
    .select()
    .where({
        id: {
            neq: "main"
        }
    })

The reverse of eq. The specified attribute on the left must not equal the value on the right to have a match.

$neq (strict)

ayakashi
    .select()
    .where({
        id: {
            $neq: "main"
        }
    })

The strict version of neq.
The difference is that the specified attribute must exist but not be equal to our right value.
In the above example, the query will match all elements that have an id and it’s not equal to main.
The non-strict version would have also matched any element that doesn’t have an id at all.

like

ayakashi
    .select()
    .where({
        href: {
            like: "github.com"
        }
    })

like will match if the specified attribute contains (or matches if using a regex) the value on the right.
Instead of a string, a regex can also be used:

ayakashi
    .select()
    .where({
        href: {
            like: /github/
        }
    })

nlike

ayakashi
    .select()
    .where({
        href: {
            nlike: "github.com"
        }
    })

The reverse of like.
The specified attribute on the left must not contain (or not match if using a regex) the value on the right.
Instead of a string, a regex can also be used:

ayakashi
    .select()
    .where({
        href: {
            nlike: /github/
        }
    })

$nlike (strict)

ayakashi
    .select()
    .where({
        href: {
            $nlike: "github.com"
        }
    })

The strict version of nlike.
The difference is that the specified attribute must exist but not contain (or not match if using a regex) the value on the right.
In the above example, the query will match all elements that have a href but it does not contain github.com.
The non-strict version would have also matched any element that doesn’t have a href at all.

in

ayakashi
    .select()
    .where({
        class: {
            in: ["header", "footer"]
        }
    })

Matches any element that has the specified attribute equal to at least one value of the list on the right.
It’s a more compact version of an or with a series of eqs.

nin

ayakashi
    .select()
    .where({
        class: {
            nin: ["header", "footer"]
        }
    })

The reverse of in.
Matches any element that has the specified attribute not equal to any value of the list on the right.
It’s a more compact version of an and with a series of neqs.

$nin (strict)

ayakashi
    .select()
    .where({
        class: {
            $nin: ["header", "footer"]
        }
    })

The strict version of nin.
The difference is that the specified attribute must exist but not be equal to any value of the list on the right.
In the above example, the query will match all elements that have a class attribute but not the header and footer classes.
The non-strict version would have also matched any element that doesn’t have a class attribute at all.

Limit, skip and order

We can control the amount and ordering of the matches with limit, skip and order.

ayakashi
    .select()
    .where({
        class: {
            eq: "container"
        }
    })
    .limit(1)

The above example will only have 1 match (if any).
Instead of select() and limit(1) you can use the convenience methods selectOne() and selectFirst() (both do the same thing but each can be more readable in different contexts).

ayakashi
    .selectFirst()
    .where({
        class: {
            eq: "container"
        }
    })

To get only the second match, use skip(1)

ayakashi
    .selectOne()
    .where({
        class: {
            eq: "container"
        }
    })
    .skip(1)

To match in a reverse order, use order("desc").
This will return the last element (if any)

ayakashi
    .select()
    .where({
        class: {
            eq: "container"
        }
    })
    .order("desc")
    .limit(1)

The selectLast() method will also expand to the exact same query

ayakashi
    .selectLast()
    .where({
        class: {
            eq: "container"
        }
    })

Child queries

Sometimes we need to limit the scope of a match. Child queries accomplice just that

ayakashi
    .selectOne("listContainer")
    .where({
        class: {
            eq: "repo-list"
        }
    })

ayakashi
    .select("githubLinks")
    .where({
        href: {
            like: "github.com"
        }
    })
    .from("listContainer")

The second query will only search for links inside our first prop listContainer.

Instead of using from() you can also chain the child queries

ayakashi
    .selectOne("listContainer")
    .where({
        class: {
            eq: "repo-list"
        }
    })
    .selectChildren("githubLinks")
        .where({
            href: {
                like: "github.com"
            }
        })

Like the normal select(), convenience methods also exist for child selections

  • selectChildren() (no match limit)
  • selectChild() (limit(1))
  • selectFirstChild() (same as selectChild())
  • selectLastChild() (limit(1) and order("desc"))

You can nest child queries indefinitely, using either from() or the convenience methods.

Tracking missing children

When scraping a collection of elements there might be some child elements that might not exist.

Imagine we need to extract the links from the following html:

<div class="container"><a href="http://example.com">link1</a></div>
<div class="container">I don't have a link</div>
<div class="container"><a href="http://example2.com">link2</a></div>

Our query:

ayakashi
    .select("parentProp")
    .where({
        class: {
            eq: "container"
        }
    })
    .selectChild("childProp")
        .where({
            tagName: {
                eq: "a"
            }
        });

If we extract the childProp, we will get a result like this: ["link1", "link2"].
If we were also extracting more child props from each container this would have messed our ordering and the final extraction result would be incorrect.
In such a case we can use the trackMissingChildren() method like this:

ayakashi
    .select("parentProp")
    .where({
        class: {
            eq: "container"
        }
    })
    .trackMissingChildren() // <-- applied on the parent prop
    .selectChild("childProp")
        .where({
            tagName: {
                eq: "A"
            }
        });

Now, our result will be: ["link1", "", "link2"].
trackMissingChildren() will add a child match placeholder if a child does not exist in a collection of parents.
This will ensure proper ordering when extracting children that might sometimes not exist.

Querying element attributes

ayakashi
    .select()
    .where({
        and: [{
            href: {
                like: "github.com"
            }
        }, {
            text: {
                eq: "Repo"
            }
        }]
    })

Any standard element attribute will work as expected.

Querying data attributes

ayakashi
    .select()
    .where({
        "data-content": {
            like: "some content"
        }
    })

The special attributes dataKey and dataValue are also available if you need to query only for keys or only for values.

ayakashi
    .select()
    .where({
        datakey: {
            eq: "content"
        }
    })
ayakashi
    .select()
    .where({
        dataValue: {
            like: "Here is my content:"
        }
    })

For dataKey, use only the part of the name after the data-.
So data-content becomes just content.
If the attribute name contains multiple words separated by hyphens (-) the attribute name has to be properly camelCased. For example data-index-name has to be written like indexName.
Data attributes behave like any normal attribute so all of the operators apply here as well.

Querying with style

ayakashi
    .select()
    .where({
        "style-background-color": {
            eq: "#8AAAE5"
        }
    })

We can also query by style.
Just prepend style- to the full qualified css property and you can match it like any other attribute.
The query will search the actual evaluated css properties of the element, so that means external, inline or styles in a <style> element can be matched.
Note: Make sure to use the full name of the css property, so instead of font, use font-size and instead of border or border-width, use border-bottom-width etc.