The scalable web crawling and scraping library for JavaScript 0.8.7

Apify SDK simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or into the cloud, rotate proxies and much more. The SDK is available as the apify NPM package. It can be used either stand-alone in your own applications or in actors running on the Apify cloud platform.

Motivation

Thanks to tools like Puppeteer or cheerio, it is easy to write a Node.js code to extract data from web pages. But eventually things will get complicated. For example, when you try to:

  • Perform a deep crawl of an entire website using a persistent queue of URLs.
  • Run your scraping code on a list of 100k URLs in a CSV file, without losing any data when your code crashes.
  • Rotate proxies to hide your browser origin.
  • Schedule the code to run periodically and send notification on errors.
  • Disable browser fingerprinting protections used by websites.

Python has Scrapy for these tasks, but there was no such library for JavaScript, the language of the web. The use of JavaScript is natural, since the same language is used to write the scripts as well as the data extraction code running in a browser.

The goal of the Apify SDK is to fill this gap and provide a toolbox for generic web scraping, crawling and automation tasks in JavaScript. So don't reinvent the wheel every time you need data from the web, and focus on writing code specific to the target website, rather than developing commonalities.

Overview

The Apify SDK is available as the apify NPM package and it provides the following tools:

  • BasicCrawler - Provides a simple framework for the parallel crawling of web pages whose URLs are fed either from a static list or from a dynamic queue of URLs. This class serves as a base for more complex crawlers (see below).
  • CheerioCrawler - Enables the parallel crawling of a large number of web pages using the cheerio HTML parser. This is the most efficient web crawler, but it does not work on websites that require JavaScript.
  • PuppeteerCrawler - Enables the parallel crawling of a large number of web pages using the headless Chrome browser and Puppeteer. The pool of Chrome browsers is automatically scaled up and down based on available system resources.
  • PuppeteerPool - Provides web browser tabs for user jobs from an automatically-managed pool of Chrome browser instances, with configurable browser recycling and retirement policies. Supports reuse of the disk cache to speed up the crawling of websites and reduce proxy bandwidth.
  • RequestList - Represents a list of URLs to crawl. The URLs can be passed in code or in a text file hosted on the web. The list persists its state so that crawling can resume when the Node.js process restarts.
  • RequestQueue - Represents a queue of URLs to crawl, which is stored either on a local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.
  • Dataset - Provides a store for structured data and enables their export to formats like JSON, JSONL, CSV, XML, Excel or HTML. The data is stored on a local filesystem or in the cloud. Datasets are useful for storing and sharing large tabular crawling results, such as a list of products or real estate offers.
  • KeyValueStore - A simple key-value store for arbitrary data records or files, along with their MIME content type. It is ideal for saving screenshots of web pages, PDFs or to persist the state of your crawlers. The data is stored on a local filesystem or in the cloud.
  • AutoscaledPool - Runs asynchronous background tasks, while automatically adjusting the concurrency based on free system memory and CPU usage. This is useful for running web scraping tasks at the maximum capacity of the system.
  • PuppeteerUtils - Provides several helper functions useful for web scraping. For example, to inject jQuery into web pages or to hide browser origin.
  • Additionally, the package provides various helper functions to simplify running your code on the Apify cloud platform and thus take advantage of its pool of proxies, job scheduler, data storage, etc. For more information, see the Apify SDK Programmer's Reference.

Getting started

The Apify SDK requires Node.js 8 or later.

Local stand-alone usage

Add Apify SDK to any Node.js project by running:

npm install apify --save

Run the following example to perform a recursive crawl of a website using Puppeteer.

const Apify = require('apify');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest(new Apify.Request({ url: 'https://www.iana.org/' }));
    const pseudoUrls = [new Apify.PseudoUrl('https://www.iana.org/[.*]')];

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        handlePageFunction: async ({ request, page }) => {
            const title = await page.title();
            console.log(`Title of ${request.url}: ${title}`);
            await Apify.utils.puppeteer.enqueueLinks(page, 'a', pseudoUrls, requestQueue);
        },
        maxRequestsPerCrawl: 100,
        maxConcurrency: 10,
    });

    await crawler.run();
});

By default, Apify SDK stores data to ./apify_storage in the current working directory. You can override this behavior by setting either the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variable. For details, see Environment variables and Data storage.

Local usage with Apify command-line interface (CLI)

To avoid the need to set the environment variables manually, to create a boilerplate of your project, and to enable pushing and running your code on the Apify cloud, you can use the Apify command-line interface (CLI) tool.

Install the CLI by running:

npm -g install apify-cli

You might need to run the above command with sudo, depending on how crazy your configuration is.

Now create a boilerplate of your new web crawling project by running:

apify create my-hello-world

The CLI will prompt you to select a project boilerplate template - just pick "Hello world". The tool will create a directory called my-hello-world with a Node.js project files. You can run the project as follows:

cd my-hello-world
apify run

By default, the crawling data will be stored in a local directory at ./apify_storage. For example, the input JSON file for the actor is expected to be in the default key-value store in ./apify_storage/key_value_stores/default/INPUT.json.

Now you can easily deploy your code to the Apify cloud by running:

apify login
apify push

Your script will be uploaded to the Apify cloud and built there so that it can be run in the cloud. For more information, view the Apify CLI and Apify Actor documentation.

Usage on the Apify cloud platform

You can also develop your web scraping project in an online code editor directly on the Apify cloud. You'll need to have an Apify Account. Go to Actors page in the app, click Create new and then go to the Source tab and start writing your code or paste one of the code examples below.

For more information, view the Apify actors quick start guide.

What is an "actor"?

When you deploy your script to the Apify cloud platform, it becomes an actor. An actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

To run an actor, you need to have an Apify Account. Actors can be shared in the Apify Library so that other people can use them. But don't worry, if you share your actor in the library and somebody uses it, it runs under their account, not yours.

Related links

Examples

An example is better than a thousand words. In the following sections you will find several examples of how to perform various web scraping and automation tasks using the Apify SDK. All the examples can be found in the examples directory in the repository.

To run the examples, just copy them into the directory where you installed the Apify SDK using npm install apify and then run them, for example, by calling:

node basic_crawler.js

Note that for production projects you should set either the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variable in order to tell the SDK how to store its data and crawling state. See Local stand-alone usage above for details.

Alternatively, if you're using the Apify CLI, you can copy and paste the source code of each of the examples into the main.js file created by the CLI. Then go to the project directory and run the example using:

apify run

Crawl several pages in raw HTML

This is the most basic example of the Apify SDK, which demonstrates some of its elementary tools, such as the BasicCrawler and RequestList classes. The script just downloads several web pages with plain HTTP requests (using the request-promise library) and stores their raw HTML and URL to the default dataset. In local configuration, the data will be stored as JSON files in ./apify_storage/datasets/default.

const Apify = require('apify');
const requestPromise = require('request-promise');

// Apify.main() function wraps the crawler logic (it is optional).
Apify.main(async () => {
    // Create and initialize an instance of the RequestList class that contains
    // a list of URLs to crawl. Here we use just a few hard-coded URLs.
    const requestList = new Apify.RequestList({
        sources: [
            { url: 'http://www.google.com/' },
            { url: 'http://www.example.com/' },
            { url: 'http://www.bing.com/' },
            { url: 'http://www.wikipedia.com/' },
        ],
    });
    await requestList.initialize();

    // Create a BasicCrawler - the simplest crawler that enables
    // users to implement the crawling logic themselves.
    const crawler = new Apify.BasicCrawler({

        // Let the crawler fetch URLs from our list.
        requestList,

        // This function will be called for each URL to crawl.
        // The 'request' option is an instance of the Request class, which contains
        // information such as URL and HTTP method, as supplied by the RequestList.
        handleRequestFunction: async ({ request }) => {
            console.log(`Processing ${request.url}...`);

            // Fetch the page HTML
            const html = await requestPromise(request.url);

            // Store the HTML and URL to the default dataset.
            await Apify.pushData({
                url: request.url,
                html,
            });
        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run();

    console.log('Crawler finished.');
});

Crawl an external list of URLs with Cheerio

This example demonstrates how to use CheerioCrawler to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using cheerio and extract some data from it: the page title and all H1 tags.

const Apify = require('apify');

// Apify.utils contains various utilities, e.g. for logging.
// Here we turn off the logging of unimportant messages.
const { log } = Apify.utils;
log.setLevel(log.LEVELS.WARNING);

// A link to a list of Fortune 500 companies' websites available on GitHub.
const CSV_LINK = 'https://gist.githubusercontent.com/hrbrmstr/ae574201af3de035c684/raw/f1000.csv';

// Apify.main() function wraps the crawler logic (it is optional).
Apify.main(async () => {
    // Create an instance of the RequestList class that contains a list of URLs to crawl.
    // Here we download and parse the list of URLs from an external file.
    const requestList = new Apify.RequestList({
        sources: [{ requestsFromUrl: CSV_LINK }],
    });
    await requestList.initialize();

    // Create an instance of the CheerioCrawler class - a crawler
    // that automatically loads the URLs and parses their HTML using the cheerio library.
    const crawler = new Apify.CheerioCrawler({
        // Let the crawler fetch URLs from our list.
        requestList,

        // The crawler downloads and processes the web pages in parallel, with a concurrency
        // automatically managed based on the available system memory and CPU (see AutoscaledPool class).
        // Here we define some hard limits for the concurrency.
        minConcurrency: 10,
        maxConcurrency: 50,

        // On error, retry each page at most once.
        maxRequestRetries: 1,

        // Increase the timeout for processing of each page.
        handlePageTimeoutSecs: 60,

        // This function will be called for each URL to crawl.
        // It accepts a single parameter, which is an object with the following fields:
        // - request: an instance of the Request class with information such as URL and HTTP method
        // - html: contains raw HTML of the page
        // - $: the cheerio object containing parsed HTML
        handlePageFunction: async ({ request, html, $ }) => {
            console.log(`Processing ${request.url}...`);

            // Extract data from the page using cheerio.
            const title = $('title').text();
            const h1texts = [];
            $('h1').each((index, el) => {
                h1texts.push({
                    text: $(el).text(),
                });
            });

            // Store the results to the default dataset. In local configuration,
            // the data will be stored as JSON files in ./apify_storage/datasets/default
            await Apify.pushData({
                url: request.url,
                title,
                h1texts,
                html,
            });
        },

        // This function is called if the page processing failed more than maxRequestRetries+1 times.
        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed twice.`);
        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run();

    console.log('Crawler finished.');
});

Recursively crawl a website using Puppeteer

This example demonstrates how to use PuppeteerCrawler in combination with RequestList and RequestQueue to recursively scrape the Hacker News website using headless Chrome / Puppeteer. The crawler starts with a single URL, finds links to next pages, enqueues them and continues until no more desired links are available. The results are stored to the default dataset. In local configuration, the results are represented as JSON files in ./apify_storage/datasets/default

const Apify = require('apify');

Apify.main(async () => {
    // Create and initialize an instance of the RequestList class that contains the start URL.
    const requestList = new Apify.RequestList({
        sources: [
            { url: 'https://news.ycombinator.com/' },
        ],
    });
    await requestList.initialize();

    // Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
    const requestQueue = await Apify.openRequestQueue();

    // Create an instance of the PuppeteerCrawler class - a crawler
    // that automatically loads the URLs in headless Chrome / Puppeteer.
    const crawler = new Apify.PuppeteerCrawler({
        // The crawler will first fetch start URLs from the RequestList
        // and then the newly discovered URLs from the RequestQueue
        requestList,
        requestQueue,

        // Run Puppeteer in headless mode. If you set headless to false, you'll see the scraping
        // browsers showing up on your screen. This is great for debugging.
        launchPuppeteerOptions: { headless: true },

        // This function will be called for each URL to crawl.
        // Here you can write the Puppeteer scripts you are familiar with,
        // with the exception that browsers and pages are automatically managed by the Apify SDK.
        // The function accepts a single parameter, which is an object with the following fields:
        // - request: an instance of the Request class with information such as URL and HTTP method
        // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
        handlePageFunction: async ({ request, page }) => {
            console.log(`Processing ${request.url}...`);

            // A function to be evaluated by Puppeteer within the browser context.
            const pageFunction = ($posts) => {
                const data = [];

                // We're getting the title, rank and URL of each post on Hacker News.
                $posts.forEach(($post) => {
                    data.push({
                        title: $post.querySelector('.title a').innerText,
                        rank: $post.querySelector('.rank').innerText,
                        href: $post.querySelector('.title a').href,
                    });
                });

                return data;
            };
            const data = await page.$eval('.athing', pageFunction);

            // Store the results to the default dataset.
            await Apify.pushData(data);

            // Find the link to the next page using Puppeteer functions.
            let nextHref;
            try {
                nextHref = await page.$eval('.morelink', el => el.href);
            } catch (err) {
                console.log(`${request.url} is the last page!`);
                return;
            }

            // Enqueue the link to the RequestQueue
            await requestQueue.addRequest(new Apify.Request({ url: nextHref }));
        },

        // This function is called if the page processing failed more than maxRequestRetries+1 times.
        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed too many times`);
        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run();

    console.log('Crawler finished.');
});

Save page screenshots

This example demonstrates how to read and write data to the default key-value store using Apify.getValue() and Apify.setValue(). The script crawls a list of URLs using Puppeteer, captures a screenshot of each page and saves it to the store. The list of URLs is provided as actor input that is also read from the store.

In local configuration, the input is stored in the default key-value store's directory as a JSON file at ./apify_storage/key_value_stores/default/INPUT.json. You need to create the file and set it with the following content:

{ "sources": [{ "url": "https://www.google.com" }, { "url": "https://www.duckduckgo.com" }] }

On the Apify cloud platform, the input can be either set manually in the UI app or passed as the POST payload to the Run actor API call. For more details, see Input and output in the Apify Actor documentation.

const Apify = require('apify');

Apify.main(async () => {
    // Read the actor input configuration containing the URLs for the screenshot.
    // By convention, the input is present in the actor's default key-value store under the "INPUT" key.
    const input = await Apify.getValue('INPUT');
    if (!input) throw new Error('Have you passed the correct INPUT ?');

    const { sources } = input;

    const requestList = new Apify.RequestList({ sources });
    await requestList.initialize();

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            console.log(`Processing ${request.url}...`);

            // This is a Puppeteer function that takes a screenshot of the page and returns its buffer.
            const screenshotBuffer = await page.screenshot();

            // The record key may only include the following characters: a-zA-Z0-9!-_.'()
            const key = request.url.replace(/[:/]/g, '_');

            // Save the screenshot. Choosing the right content type will automatically
            // assign the local file the right extension, in this case .png.
            // The screenshots will be stored in ./apify_storage/key_value_stores/default/
            await Apify.setValue(key, screenshotBuffer, { contentType: 'image/png' });
            console.log(`Screenshot of ${request.url} saved.`);
        },
    });

    // Run crawler.
    await crawler.run();

    console.log('Crawler finished.');
});

Open web page in Puppeteer via Apify Proxy

This example demonstrates how to load pages in headless Chrome / Puppeteer over Apify Proxy. To make it work, you'll need an Apify Account that has access to the proxy. The proxy password is available on the Proxy page in the app. Just set it to the APIFY_PROXY_PASSWORD environment variable or run the script using the CLI.

const Apify = require('apify');

Apify.main(async () => {
    // Apify.launchPuppeteer() is similar to Puppeteer's launch() function.
    // It accepts the same parameters and returns a preconfigured Puppeteer.Browser instance.
    // Moreover, it accepts several additional options, such as useApifyProxy.
    const options = {
        useApifyProxy: true,
    };
    const browser = await Apify.launchPuppeteer(options);

    console.log('Running Puppeteer script...');

    // Proceed with a plain Puppeteer script.
    const page = await browser.newPage();
    const url = 'https://en.wikipedia.org/wiki/Main_Page';
    await page.goto(url);
    const title = await page.title();

    console.log(`Page title: ${title}`);

    // Cleaning up after yourself is always good.
    await browser.close();
    console.log('Puppeteer closed.');
});

Invoke another actor

This example demonstrates how to start an Apify actor using Apify.call() and how to call Apify API using Apify.client. The script extracts the current Bitcoin prices from Kraken.com and sends them to your email using the apify/send-mail actor.

To make the example work, you'll need an Apify Account. Go to Account - Integrations page to obtain your API token and set it to the APIFY_TOKEN environment variable, or run the script using the CLI. If you deploy this actor to the Apify platform then you can set up a scheduler for early morning. Don't miss the chance of your life to get rich!

const Apify = require('apify');

Apify.main(async () => {
    // Launch the web browser.
    const browser = await Apify.launchPuppeteer();

    console.log('Obtaining email address...');
    const user = await Apify.client.users.getUser();

    // Load Kraken.com charts and get last traded price of BTC
    console.log('Extracting data from kraken.com...');
    const page = await browser.newPage();
    await page.goto('https://www.kraken.com/charts');
    const tradedPricesHtml = await page.$eval('#ticker-top ul', el => el.outerHTML);

    // Send prices to your email. For that, you can use an actor we already
    // have available on the platform under the name: apify/send-mail.
    // The second parameter to the Apify.call() invocation is the actor's
    // desired input. You can find the required input parameters by checking
    // the actor's documentation page: https://www.apify.com/apify/send-mail
    console.log(`Sending email to ${user.email}...`);
    await Apify.call('apify/send-mail', {
        to: user.email,
        subject: 'Kraken.com BTC',
        html: `<h1>Kraken.com BTC</h1>${tradedPricesHtml}`,
    });

    console.log('Email sent. Good luck!');
});

Use an actor as an API

This example shows a quick actor that has a run time of just a few seconds. It opens a web page that contains a webcam stream from the Golden Gate Bridge, takes a screenshot of the page and saves it as output.

This actor can be invoked synchronously using a single HTTP request to directly obtain its output as a reponse, using the Run actor synchronously Apify API endpoint. The example is also shared as the apify/example-golden-gate-webcam actor in the Apify library, so you can test it directly there simply by sending a POST request to https://api.apify.com/v2/acts/apify~example-golden-gate-webcam/run-sync?token=[YOUR_API_TOKEN]

const Apify = require('apify');

Apify.main(async () => {
    // Launch web browser.
    const browser = await Apify.launchPuppeteer();

    // Load http://goldengatebridge75.org/news/webcam.html and get an IFRAME with the webcam stream
    console.log('Opening web page...');
    const page = await browser.newPage();
    await page.goto('http://goldengatebridge75.org/news/webcam.html');
    const iframe = (await page.frames()).pop();

    // Get webcam image element handle.
    const imageElementHandle = await iframe.$('.VideoColm img');

    // Give the webcam image some time to load.
    console.log('Waiting for page to load...');
    await Apify.utils.sleep(3000);

    // Get a screenshot of that image.
    const imageBuffer = await imageElementHandle.screenshot();
    console.log('Screenshot captured.');

    // Save the screenshot as the actor's output. By convention, similarly to "INPUT",
    // the actor's output is stored in the default key-value store under the "OUTPUT" key.
    await Apify.setValue('OUTPUT', imageBuffer, { contentType: 'image/jpeg' });
    console.log('Actor finished.');
});

Environment variables

The following table shows the basic environment variables used by Apify SDK:

Environment variable Description
APIFY_LOCAL_STORAGE_DIR Defines the path to a local directory where key-value stores, request lists and request queues store their data. Typically it is set to ./apify_storage. If omitted, you should define the APIFY_TOKEN environment variable instead.
APIFY_TOKEN The API token for your Apify Account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor in the cloud. You can find your API token on the Account - Integrations page. If omitted, you should define the APIFY_LOCAL_STORAGE_DIR environment variable instead.
APIFY_PROXY_PASSWORD Optional password to Apify Proxy for IP address rotation. If you have have an Apify Account, you can find the password on the Proxy page in the Apify app. This feature is optional. You can use your own proxies or no proxies at all.
APIFY_HEADLESS If set to 1, web browsers launched by Apify SDK will run in the headless mode. You can still override this setting in the code, e.g. by passing the headless: true option to the Apify.launchPuppeteer() function. But having this setting in an environment variable allows you to develop the crawler locally in headful mode to simplify the debugging, and only run the crawler in headless mode once you deploy it to the cloud. By default, the the browsers are launched in headful mode, i.e. with windows.
APIFY_LOG_LEVEL Specifies the minimum log level, which can be one of the following values (in order of severity): DEBUG, INFO, WARNING, SOFT_FAIL and ERROR. By default, the log level is set to INFO, which means that DEBUG messages are not printed to console.
APIFY_MEMORY_MBYTES Sets the amount of system memory in megabytes to be used by the autoscaled pool. It is used to limit the number of concurrently running tasks. By default, the max amount of memory to be used is set to one quarter of total system memory, i. e. on a system with 8192 MB of memory, the autoscaling feature will only use up to 2048 MB of memory.

For the full list of environment variables used by Apify SDK and the Apify cloud platform, please see the Environment variables in the Apify actor documentation.

Data storage

The Apify SDK has several data storage types that are useful for specific tasks. The data is stored either on local disk to a directory defined by the APIFY_LOCAL_STORAGE_DIR environment variable, or on the Apify cloud under the user account identified by the API token defined by the APIFY_TOKEN environment variable. If neither of these variables is defined, by default Apify SDK sets APIFY_LOCAL_STORAGE_DIR to ./apify_storage in the current working directory and prints a warning.

Typically, you will be developing the code on your local computer and thus set the APIFY_LOCAL_STORAGE_DIR environment variable. Once the code is ready, you will deploy it to the Apify cloud, where it will automatically set the APIFY_TOKEN environment variable and thus use cloud storage. No code changes are needed.

Related links

Key-value store

The key-value store is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, PDFs or to persist the state of crawlers.

Each actor run is associated with a default key-value store, which is created exclusively for the actor run. By convention, the actor run input and output is stored in the default key-value store under the INPUT and OUTPUT key, respectively. Typically the input and output is a JSON file, although it can be any other format.

In the Apify SDK, the key-value store is represented by the KeyValueStore class. In order to simplify access to the default key-value store, the SDK also provides Apify.getValue() and Apify.setValue() functions.

In local configuration, the data is stored in the directory specified by the APIFY_LOCAL_STORAGE_DIR environment variable as follows:

[APIFY_LOCAL_STORAGE_DIR]/key_value_stores/[STORE_ID]/[KEY].[EXT]

Note that [STORE_ID] is the name or ID of the key-value store. The default key value store has ID default, unless you override it by setting the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable. The [KEY] is the key of the record and [EXT] corresponds to the MIME content type of the data value.

The following code demonstrates basic operations of key-value stores:

// Get actor input from the default key-value store
const input = await Apify.getValue('INPUT');

// Write actor output to the default key-value store.
await Apify.setValue('OUTPUT', { myResult: 123 });

// Open a named key-value store
const store = await Apify.openKeyValueStore('some-name');

// Write record. JavaScript object is automatically converted to JSON,
// strings and binary buffers are stored as they are
await store.setValue('some-key', { foo: 'bar' });

// Read record. Note that JSON is automatically parsed to a JavaScript object,
// text data returned as a string and other data is returned as binary buffer
const value = await store.getValue('some-key');

// Delete record
await store.delete('some-key');

To see a real-world example of how to get the input from the key-value store, see the screenshots.js example.

Dataset

Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine a dataset as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records.

When the dataset is stored in the Apify cloud, you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the Storage section in the Apify app. The actual data is exported using the Get dataset items Apify API endpoint. This way you can easily share crawling results.

Each actor run is associated with a default dataset, which is created exclusively for the actor run. Typically it is used to store crawling results specific for the actor run. Its usage is optional.

In the Apify SDK, the dataset is represented by the Dataset class. In order to simplify writes to the default dataset, the SDK also provides the Apify.pushData() function.

In local configuration, the data is stored in the directory specified by the APIFY_LOCAL_STORAGE_DIR environment variable as follows:

[APIFY_LOCAL_STORAGE_DIR]/datasets/[DATASET_ID]/[INDEX].json

Note that [DATASET_ID] is the name or ID of the dataset. The default dataset has ID default, unless you override it by setting the APIFY_DEFAULT_DATASET_ID environment variable. Each dataset item is stored as a separate JSON file, where [INDEX] is a zero-based index of the item in the dataset.

The following code demonstrates basic operations of the dataset:

// Write a single row to the default dataset
await Apify.pushData({ col1: 123, col2: 'val2' });

// Open a named dataset
const dataset = await Apify.openDataset('some-name');

// Write a single row
await dataset.pushData({ foo: 'bar' });

// Write multiple rows
await dataset.pushData([
  { foo: 'bar2', col2: 'val2' },
  { col3: 123 },
]);

To see how to use the dataset to store crawler results, see the cheerio_crawler.js example.

Request queue

The request queue is a storage of URLs to crawl. The queue is used for the deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.

Each actor run is associated with a default request queue, which is created exclusively for the actor run. Typically it is used to store URLs to crawl in the specific actor run. Its usage is optional.

In Apify SDK, the request queue is represented by the RequestQueue class.

In local configuration, the request queue data is stored in the directory specified by the APIFY_LOCAL_STORAGE_DIR environment variable as follows:

[APIFY_LOCAL_STORAGE_DIR]/request_queues/[QUEUE_ID]/[STATE]/[NUMBER].json

Note that [QUEUE_ID] is the name or ID of the request queue. The default queue has ID default, unless you override it by setting the APIFY_DEFAULT_REQUEST_QUEUE_ID environment variable. Each request in the queue is stored as a separate JSON file, where [STATE] is either handled or pending, and [NUMBER] is an integer indicating the position of the request in the queue.

The following code demonstrates basic operations of the request queue:

// Open the default request queue associated with the actor run
const queue = await Apify.openRequestQueue();

// Open a named request queue
const queueWithName = await Apify.openRequestQueue('some-name');

// Enqueue few requests
await queue.addRequest(new Apify.Request({ url: 'http://example.com/aaa'}));
await queue.addRequest(new Apify.Request({ url: 'http://example.com/bbb'}));
await queue.addRequest(new Apify.Request({ url: 'http://example.com/foo/bar'}), { forefront: true });

// Get requests from queue
const request1 = await queue.fetchNextRequest();
const request2 = await queue.fetchNextRequest();
const request3 = await queue.fetchNextRequest();

// Mark a request as handled
await queue.markRequestHandled(request1);

// If processing fails then reclaim the request back to the queue, so that it's crawled again
await queue.reclaimRequest(request2);

To see how to use the request queue with a crawler, see the puppeteer_crawler.js example.

Puppeteer live view

Apify SDK enables the real-time view of launched Puppeteer browser instances and their open tabs, including screenshots of pages and snapshots of HTML. This is useful for debugging your crawlers that run in headless mode.

The live view dashboard is run on a web server that is started on a port specified by the APIFY_CONTAINER_PORT environment variable (typically 4321). To enable live view, pass the liveView: true option to Apify.launchPuppeteer():

const browser = Apify.launchPuppeteer({ liveView: true });

or to PuppeteerCrawler constructor as follows:

const crawler = new PuppeteerCrawler({
    launchPuppeteerOptions: { liveView: true },
    // other options
})

To simplify debugging, you may also want to add the { slowMo: 300 } option to slow down all browser operation. See Puppeteer documentation for details.

Once the live view is enabled, you can open http://localhost:4321 and you will see a page like this:

Click on the magnifying glass icon to view page detail, showing page screenshot and raw HTML:

For more information, read the Debugging your actors with Live View article in Apify Knowlege base.

Support

If you find any bug or issue with the Apify SDK, please submit an issue on GitHub. For questions, you can ask on Stack Overflow or contact support@apify.com

Contributing

Your code contributions are welcome and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request.

Programmer's reference

The following sections describe all functions and properties provided by the apify package. All of them are instance members exported directly by the main module.

Members (3)

(static) events

Gets an instance of Node.js' EventEmitter class that emits various events from the SDK or the Apify platform. The event emitter is initialized by calling Apify.main() function.

Example usage:

Apify.main(async () => {
   
  Apify.events.on('cpuInfo', (data) => {
    if (data.isCpuOverloaded) console.log('Oh no, the CPU is overloaded!');
  });
.   
});

The following table shows all currently emitted events:

Event name Data Description
cpuInfo { "isCpuOverloaded": Boolean } The event is emitted approximately every second and it indicates whether the actor is using the maximum of available CPU resources. If that's the case, the actor should not add more workload. For example, this event is used by the AutoscaledPool class.
migrating None Emitted when the actor running on the Apify platform is going to be migrated to another worker server soon. You can use it to persist the state of the actor and abort the run, to speed up migration. For example, this is used by the RequestList class.
persistState { "isMigrating": Boolean } Emitted in regular intervals to notify all components of Apify SDK that it is time to persist their state, in order to avoid repeating all work when the actor restarts. This event is automatically emitted together with the migrating event, in which case the isMigrating flag is set to true. Otherwise the flag is false.

client

Gets the default instance of the ApifyClient class provided by the apify-client NPM package. The instance is created automatically by the Apify SDK and it is configured using the APIFY_API_BASE_URL, APIFY_USER_ID and APIFY_TOKEN environment variables.

The instance is used for all underlying calls to the Apify API in functions such as Apify.getValue() or Apify.call(). The settings of the client can be globally altered by calling the Apify.client.setOptions() function. Beware that altering these settings might have unintended effects on the entire Apify SDK package.

openRequestQueue

Opens a request queue and returns a promise resolving to an instance of the RequestQueue class.

RequestQueue represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.

For more details and code examples, see the RequestQueue class.

Methods (14)

call(actId, inputopt, optionsopt) → {Promise.<ActorRun>}

Runs an actor on the Apify platform using the current user account (determined by the APIFY_TOKEN environment variable), waits for the actor to finish and fetches its output.

By passing the waitSecs option you can reduce the maximum amount of time to wait for the run to finish. If the value is less than or equal to zero, the function returns immediately after the run is started.

The result of the function is an ActorRun object that contains details about the actor run and its output (if any). If the actor run failed, the function fails with ApifyCallError exception.

Example usage:

const run = await Apify.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);

Internally, the call() function calls the Run actor Apify API endpoint and few others to obtain the output.

Parameters:
  • actId ( String ) - Either username/actor-name or actor ID.
  • input ( Object | String | Buffer ) <optional> - Input for the actor. If it is an object, it will be stringified to JSON and its content type is set to application/json; charset=utf-8. Otherwise the options.contentType parameter must be provided.
  • options ( Object ) <optional> - Object with settings
    • contentType ( String ) <optional> - Content type for the input. If not specified, input is expected to be an object that will be stringified to JSON and content type set to application/json; charset=utf-8. If options.contentType is specified, then input must be a String or Buffer.
    • token ( String ) <optional> - User API token that is used to run the actor. By default, it is taken from the APIFY_TOKEN environment variable.
    • memory ( Number ) <optional> - Memory in megabytes which will be allocated for the new actor run.
    • build ( String ) <optional> - Tag or number of the actor build to run (e.g. beta or 1.2.345). If not provided, the run uses build tag or number from the default actor run configuration (typically latest).
    • waitSecs ( String ) <optional> - Maximum time to wait for the actor run to finish, in seconds. If the limit is reached, the returned promise is resolved to a run object that will have status READY or RUNNING and it will not contain the actor run output. If waitSecs is null or undefined, the function waits for the actor to finish (default behavior).
    • fetchOutput ( Boolean ) <optional> - If false then the function does not fetch output of the actor. Defaults to true.
    • disableBodyParser ( Boolean ) <optional> - If true then the function will not attempt to parse the actor's output and will return it in a raw Buffer. Defaults to false.
Throws:

If the run did not succeed, e.g. if it failed or timed out.

Type
( ApifyCallError )
Returns:
  • ( Promise.<ActorRun> ) - Returns a promise that resolves to an instance of ActorRun. If the actor run fails, the promise is rejected with ApifyCallError.
  • getApifyProxyUrl(opts) → {String}

    Constructs the URL to the Apify Proxy using the specified settings. The proxy URL can be used from Apify actors, web browsers or any other HTTP proxy-enabled applications.

    For more information, see the Apify Proxy page in the app or the documentation.

    Parameters:
    • opts ( Object )
      • password ( String ) - User's password for the proxy. By default, it is taken from the APIFY_PROXY_PASSWORD environment variable, which is automatically set by the system when running the actors on the Apify cloud.
      • groups ( Array.<String> ) <optional> - Array of Apify Proxy groups to be used. If not provided, the proxy will select the groups automatically.
      • session ( String ) <optional> - Apify Proxy session identifier to be used by the Chrome browser. All HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier can only contain the following characters: 0-9, a-z, A-Z, ".", "_" and "~".
    Returns:
  • ( String ) - Returns the proxy URL, e.g. http://auto:my_password@proxy.apify.com:8000.
  • getEnv() → {Object}

    Returns a new object which contains information parsed from the APIFY_XXX environment variables. It has the following properties:

    {
        // ID of the actor (APIFY_ACT_ID)
        actId: String,
     
        // ID of the actor run (APIFY_ACT_RUN_ID)
        actRunId: String,
     
        // ID of the user who started the actor - note that it might be
        // different than the owner of the actor (APIFY_USER_ID)
        userId: String,
     
        // Authentication token representing privileges given to the actor run,
        // it can be passed to various Apify APIs (APIFY_TOKEN).
        token: String,
     
        // Date when the actor was started (APIFY_STARTED_AT)
        startedAt: Date,
     
        // Date when the actor will time out (APIFY_TIMEOUT_AT)
        timeoutAt: Date,
     
        // ID of the key-value store where input and output data of this
        // actor is stored (APIFY_DEFAULT_KEY_VALUE_STORE_ID)
        defaultKeyValueStoreId: String,
     
        // ID of the dataset where input and output data of this
        // actor is stored (APIFY_DEFAULT_DATASET_ID)
        defaultDatasetId: String,
     
        // Amount of memory allocated for the actor,
        // in megabytes (APIFY_MEMORY_MBYTES)
        memoryMbytes: Number,
    }

    For the list of the APIFY_XXX environment variables, see Actor documentation. If some of the variables are not defined or are invalid, the corresponding value in the resulting object will be null.

    Returns:
  • ( Object )
  • main(userFunc)

    Runs the main user function that performs the job of the actor.

    Apify.main() is especially useful when you're running your code in an actor on the Apify platform. Note that its use is optional - the function is provided merely for your convenience.

    The function performs the following actions:

    1. When running on the Apify platform (i.e. APIFY_IS_AT_HOME environment variable is set), it sets up a connection to listen for platform events. For example, to get a notification about an imminent migration to another server. See Apify.events for details.
    2. It checks that either APIFY_TOKEN or APIFY_LOCAL_STORAGE_DIR environment variable is defined. If not, the functions sets APIFY_LOCAL_STORAGE_DIR to ./apify_storage inside the current working directory. This is to simplify running code examples.
    3. It invokes the user function passed as the userFunc parameter.
    4. If the user function returned a promise, waits for it to resolve.
    5. If the user function throws an exception or some other error is encountered, prints error details to console so that they are stored to the log.
    6. Exits the Node.js process, with zero exit code on success and non-zero on errors.

    The user function can be synchronous:

    Apify.main(() => {
      // My synchronous function that returns immediately
      console.log('Hello world from actor!');
    });

    If the user function returns a promise, it is considered asynchronous:

    const request = require('request-promise');
    
    Apify.main(() => {
      // My asynchronous function that returns a promise
      return request('http://www.example.com').then((html) => {
        console.log(html);
      });
    });

    To simplify your code, you can take advantage of the async/await keywords:

    const request = require('request-promise');
    
    Apify.main(async () => {
      // My asynchronous function
      const html = await request('http://www.example.com');
      console.log(html);
    });
    Parameters:
    • userFunc ( function ) - User function to be executed. If it returns a promise, the promise will be awaited. The user function is called with no arguments.

    getMemoryInfo() → {Promise}

    Returns memory statistics of the process and the system, which is an object with the following properties:

    {
      // Total memory available in the system or container
      totalBytes: Number,
       
      // Amount of free memory in the system or container
      freeBytes: Number,
       
      // Amount of memory used (= totalBytes - freeBytes)
      usedBytes: Number,
      // Amount of memory used the current Node.js process
      mainProcessBytes: Number,
      // Amount of memory used by child processes of the current Node.js process
      childProcessesBytes: Number,
    }

    If the process runs inside of Docker, the getMemoryInfo gets container memory limits, otherwise it gets system memory limits.

    Beware that the function is quite inefficient because it spawns a new process. Therefore you shouldn't call it too often, like more than once per second.

    Returns:
  • ( Promise ) - Returns a promise.
  • getValue(key) → {Promise.<Object>}

    Gets a value from the default KeyValueStore associated with the current actor run.

    This is just a convenient shortcut for KeyValueStore.getValue(). For example, calling the following code:

    const input = await Apify.getValue('INPUT');

    is equivalent to:

    const store = await Apify.openKeyValueStore();
    await store.getValue('INPUT');

    To store the value to the default-key value store, you can use the Apify.setValue() function.

    For more information, see Apify.openKeyValueStore() and KeyValueStore.getValue().

    Parameters:
    • key ( String ) - Unique record key.
    See:
    Returns:
  • ( Promise.<Object> ) - Returns a promise that resolves once the record is stored.
  • isAtHome() → {Boolean}

    Returns true when code is running on Apify platform and false otherwise (for example locally).

    Returns:
  • ( Boolean )
  • isDocker() → {Promise}

    Returns promise that resolves to true if the code is running in a Docker container.

    Returns:
  • ( Promise )
  • launchPuppeteer(optsopt) → {Promise}

    Launches headless Chrome using Puppeteer pre-configured to work within the Apify platform. The function has the same argument and the return value as puppeteer.launch(). See Puppeteer documentation for more details.

    The launchPuppeteer() function alters the following Puppeteer options:

    • Passes the setting from the APIFY_HEADLESS environment variable to the headless option, unless it was already defined by the caller or APIFY_XVFB environment variable is set to 1. Note that Apify Actor cloud platform automatically sets APIFY_HEADLESS=1 to all running actors.
    • Takes the proxyUrl option, checks it and adds it to args as --proxy-server=XXX. If the proxy uses authentication, the function sets up an anonymous proxy HTTP to make the proxy work with headless Chrome. For more information, read the blog post about proxy-chain library.
    • If opts.useApifyProxy is true then the function generates a URL of Apify Proxy based on opts.apifyProxyGroups and opts.apifyProxySession and passes it as opts.proxyUrl.
    • The function adds --no-sandbox to args to enable running headless Chrome in a Docker container on the Apify platform.

    To use this function, you need to have the puppeteer NPM package installed in your project. When running on the Apify cloud platform, you can achieve that simply by using the apify/actor-node-chrome base Docker image for your actor - see Apify Actor documentation for details.

    For an example of usage, see the apify/example-puppeteer actor.

    Parameters:
    • opts ( LaunchPuppeteerOptions ) <optional> - Optional settings passed to puppeteer.launch(). Additionally the object can contain the following fields:
    Returns:
  • ( Promise ) - Promise object that resolves to Puppeteer's Browser instance.
  • launchWebDriver(optsopt) → {Promise}

    Opens a new instance of Chrome web browser controlled by Selenium WebDriver. The result of the function is the new instance of the WebDriver class.

    To use this function, you need to have Google Chrome and ChromeDriver installed in your environment. For example, you can use the apify/actor-node-chrome base Docker image for your actor - see documentation for more details.

    For an example of usage, see the apify/example-selenium actor.

    Parameters:
    • opts ( Object ) <optional> - Optional settings passed to puppeteer.launch(). Additionally the object can contain the following fields:
      • proxyUrl ( String ) <optional> - URL to a proxy server. Currently only http:// scheme is supported. Port number must be specified. For example, http://example.com:1234.
      • headless ( String ) <optional> - Indicates that the browser will be started in headless mode. If the option is not defined, and the APIFY_HEADLESS environment variable has value 1 and APIFY_XVFB is NOT 1, the value defaults to true, otherwise it will be false.
      • userAgent ( String ) <optional> - User-Agent for the browser. If not provided, the function sets it to a reasonable default.
    Returns:
  • ( Promise )
  • openDataset(datasetIdOrNameopt) → {Promise.<Dataset>}

    Opens a dataset and returns a promise resolving to an instance of the Dataset class.

    Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud.

    For more details and code examples, see the Dataset class.

    Parameters:
    • datasetIdOrName ( string ) <optional> - ID or name of the dataset to be opened. If null or undefined, the function returns the default dataset associated with the actor run.
    Returns:
  • ( Promise.<Dataset> ) - Returns a promise that resolves to an instance of the Dataset class.
  • openKeyValueStore(storeIdOrNameopt) → {Promise.<KeyValueStore>}

    Opens a key-value store and returns a promise resolving to an instance of the KeyValueStore class.

    Key-value stores are used to store records or files, along with their MIME content type. The records are stored and retrieved using a unique key. The actual data is stored either on a local filesystem or in the Apify cloud.

    For more details and code examples, see the KeyValueStore class.

    Parameters:
    • storeIdOrName ( string ) <optional> - ID or name of the key-value store to be opened. If null or undefined, the function returns the default key-value store associated with the actor run.
    Returns:
  • ( Promise.<KeyValueStore> ) - Returns a promise that resolves to an instance of the KeyValueStore class.
  • pushData(data) → {Promise}

    Stores an object or an array of objects to the default Dataset of the current actor run.

    This is just a convenient shortcut for Dataset.pushData(). For example, calling the following code:

    await Apify.pushData({ myValue: 123 });

    is equivalent to:

    const dataset = await Apify.openDataset();
    await dataset.pushData({ myValue: 123 });

    For more information, see Apify.openDataset() and Dataset.pushData()

    IMPORTANT: Make sure to use the await keyword when calling pushData(), otherwise the actor process might finish before the data is stored!

    Parameters:
    • data ( Object | Array ) - Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB.
    See:
    Returns:
  • ( Promise ) - Returns a promise that resolves once the data is saved.
  • setValue(key, value, optionsopt) → {Promise}

    Stores or deletes a value in the default KeyValueStore associated with the current actor run.

    This is just a convenient shortcut for KeyValueStore.setValue(). For example, calling the following code:

    await Apify.setValue('OUTPUT', { foo: "bar" });

    is equivalent to:

    const store = await Apify.openKeyValueStore();
    await store.setValue('OUTPUT', { foo: "bar" });

    To get a value from the default-key value store, you can use the Apify.getValue() function.

    For more information, see Apify.openKeyValueStore() and KeyValueStore.setValue().

    Parameters:
    • key ( String ) - Unique record key.
    • value ( Object | String | Buffer ) - Record data, which can be one of the following values:
      • If null, the record in the key-value store is deleted.
      • If no options.contentType is specified, value can be any JavaScript object and it will be stringified to JSON.
      • If options.contentType is specified, value is considered raw data and it must be a String or Buffer.
      For any other value an error will be thrown.
    • options ( Object ) <optional>
      • contentType ( String ) <optional> - Specifies a custom MIME content type of the record.
    See:
    Returns:
  • ( Promise ) - Returns a promise that resolves once the value is stored or deleted.
  • AutoscaledPool

    Manages a pool of asynchronous resource-intensive tasks that are executed in parallel. The pool only starts new tasks if there is enough free CPU and memory available and the Javascript event loop is not blocked.

    The information about the CPU and memory usage is obtained by the Snapshotter class, which makes regular snapshots of system resources that may be either local or from the Apify cloud infrastructure in case the process is running on the Apify platform. Meaningful data gathered from these snapshots is provided to AutoscaledPool by the SystemStatus class.

    Before running the pool, you need to implement the following three functions: runTaskFunction(), isTaskReadyFunction() and isFinishedFunction().

    The auto-scaled pool is started by calling the run() function. The pool periodically queries the isTaskReadyFunction() function for more tasks, managing optimal concurrency, until the function resolves to false. The pool then queries the isFinishedFunction(). If it resolves to true, the run finishes. If it resolves to false, it assumes there will be more tasks available later and keeps querying for tasks, until finally both the isTaskReadyFunction() and isFinishedFunction() functions resolve to true. If any of the tasks throws then the run() function rejects the promise with an error.

    The pool evaluates whether it should start a new task every time one of the tasks finishes and also in the interval set by the options.maybeRunIntervalSecs parameter.

    Example usage:

    const pool = new Apify.AutoscaledPool({
        maxConcurrency: 50,
        runTaskFunction: async () => {
            // Run some resource-intensive asynchronous operation here.
        },
        isTaskReadyFunction: async () => {
            // Tell the pool whether more tasks are ready to be processed. (true / false)
        },
        isFinishedFunction: async () => {
            // Tell the pool whether it should finish or wait for more tasks to become available. (true / false)
        }
    });
    
    await pool.run();

    Constructor

    new AutoscaledPool(options)

    Parameters:
    • options ( Object )
      • runTaskFunction ( function ) - A function that performs an asynchronous resource-intensive task. The function must either be labeled async or return a promise.
      • isTaskReadyFunction ( function ) - A function that indicates whether runTaskFunction should be called. This function is called every time there is free capacity for a new task and it should indicate whether it should start or not by resolving to either true or `false. Besides its obvious use, it is also useful for task throttling to save resources.
      • isFinishedFunction ( function ) - A function that is called only when there are no tasks to be processed. If it resolves to true then the pool's run finishes. Being called only when there are no tasks being processed means that as long as isTaskReadyFunction() keeps resolving to true, isFinishedFunction() will never be called. To abort a run, use the pool.abort() method.
      • minConcurrency ( Number ) <optional> - Minimum number of tasks running in parallel. Defaults to 1.
      • maxConcurrency ( Number ) <optional> - Maximum number of tasks running in parallel. Defaults to 1000.
      • desiredConcurrencyRatio ( Number ) <optional> - Minimum level of desired concurrency to reach before more scaling up is allowed. Defaults to 0.95.
      • scaleUpStepRatio ( Number ) <optional> - Defines the fractional amount of desired concurrency to be added with each scaling up. The minimum scaling step is one. Defaults to 0.05.
      • scaleDownStepRatio ( Number ) <optional> - Defines the amount of desired concurrency to be subtracted with each scaling down. The minimum scaling step is one. Defaults to 0.05.
      • maybeRunIntervalSecs ( Number ) <optional> - Indicates how often the pool should call the runTaskFunction() to start a new task, in seconds. This has no effect on starting new tasks immediately after a task completes. Defaults to 0.5.
      • loggingIntervalSecs ( Number ) <optional> - Specifies a period in which the instance logs its state, in seconds. Set to null to disable periodic logging. Defaults to 60.
      • autoscaleIntervalSecs ( Number ) <optional> - Defines in seconds how often the pool should attempt to adjust the desired concurrency based on the latest system status. Setting it lower than 1 might have a severe impact on performance. We suggest using a value from 5 to 20. Defaults to 10.
      • snapshotterOptions ( Number ) <optional> - Options to be passed down to the Snapshotter constructor. This is useful for fine-tuning the snapshot intervals and history. See Snapshotter source code for more details.
      • systemStatusOptions ( Number ) <optional> - Options to be passed down to the SystemStatus constructor. This is useful for fine-tuning the system status reports. If a custom snapshotter is set in the options, it will be used by the pool. See SystemStatus source code for more details.

    Methods (2)

    abort() → {Promise}

    Aborts the run of the auto-scaled pool, discards all currently running tasks and destroys it.

    Returns:
  • ( Promise )
  • run() → {Promise}

    Runs the auto-scaled pool. Returns a promise that gets resolved or rejected once all the tasks are finished or one of them fails.

    Returns:
  • ( Promise )
  • BasicCrawler

    Provides a simple framework for the parallel crawling of web pages, whose URLs are fed either from a static list or from a dynamic queue of URLs.

    BasicCrawler invokes the user-provided handleRequestFunction for each Request object, which corresponds to a single URL to crawl. The Request objects are fed from the RequestList or RequestQueue instances provided by the requestList or requestQueue constructor options, respectively.

    If both requestList and requestQueue is used, the instance first processes URLs from the RequestList and automatically enqueues all of them to RequestQueue before it starts their processing. This ensures that a single URL is not crawled multiple times.

    The crawler finishes if there are no more Request objects to crawl.

    New requests are only launched if there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class. All AutoscaledPool configuration options can be passed to the autoscaledPoolOptions parameter of the CheerioCrawler constructor. For user convenience, the minConcurrency and maxConcurrency options are available directly in the constructor.

    Example usage:

    const rp = require('request-promise');
    
    // Prepare a list of URLs to crawl
    const requestList = new Apify.RequestList({
      sources: [
          { url: 'http://www.example.com/page-1' },
          { url: 'http://www.example.com/page-2' },
      ],
    });
    await requestList.initialize();
    
    // Crawl the URLs
    const crawler = new Apify.BasicCrawler({
        requestList,
        handleRequestFunction: async ({ request }) => {
            // 'request' contains an instance of the Request class
            // Here we simply fetch the HTML of the page and store it to a dataset
            await Apify.pushData({
                url: request.url,
                html: await rp(request.url),
            })
        },
    });
    
    await crawler.run();

    Constructor

    new BasicCrawler(options)

    Parameters:
    • options ( Object )
      • handleRequestFunction ( function ) - User-provided function that performs the logic of the crawler. It is called for each URL to crawl.

        The function that receives an object as argument, with the following field:

        • request: the Request object representing the URL to crawl

        The function must return a promise.

      • requestList ( RequestList ) - Static list of URLs to be processed. Either RequestList or RequestQueue must be provided.
      • requestQueue ( RequestQueue ) - Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either RequestList or RequestQueue must be provided.
      • handleFailedRequestFunction ( function ) <optional> - Function that handles requests that failed more then option.maxRequestRetries times. See source code on GitHub for default behavior.
      • maxRequestRetries ( Number ) <optional> - How many times the request is retried if handleRequestFunction failed. Defaults to 3.
      • maxRequestsPerCrawl ( Number ) <optional> - Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
      • autoscaledPoolOptions ( Object ) <optional> - Custom options passed to the underlying AutoscaledPool instance constructor. Note that the runTaskFunction, isTaskReadyFunction and isFinishedFunction options are provided by BasicCrawler and cannot be overridden.
      • minConcurrency ( Object ) <optional> - Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1.
      • maxConcurrency ( Object ) <optional> - Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1000.
    See:

    Methods (2)

    abort() → {Promise}

    Aborts the crawler by preventing additional requests and terminating the running ones.

    Returns:
  • ( Promise )
  • run() → {Promise}

    Runs the crawler. Returns a promise that gets resolved once all the requests are processed.

    Returns:
  • ( Promise )
  • CheerioCrawler

    Provides a framework for the parallel crawling of web pages using plain HTTP requests and cheerio HTML parser.

    CheerioCrawler downloads each URL using a plain HTTP request, parses the HTML content using cheerio and then invokes the user-provided handlePageFunction to extract page data using a jQuery-like interface to parsed HTML DOM.

    The source URLs are represented using Request objects that are fed from the RequestList or RequestQueue instances provided by the requestList or requestQueue constructor options, respectively.

    If both requestList and requestQueue is used, the instance first processes URLs from the RequestList and automatically enqueues all of them to RequestQueue before it starts their processing. This ensures that a single URL is not crawled multiple times.

    The crawler finishes if there are no more Request objects to crawl.

    By default, CheerioCrawler downloads HTML using the request-promise NPM package. You can override this behavior by setting the requestFunction option.

    New requests are only started if there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class. All AutoscaledPool configuration options can be passed to the autoscaledPoolOptions parameter of the CheerioCrawler constructor. For user convenience, the minConcurrency and maxConcurrency options are available directly.

    Example usage:

    // Prepare a list of URLs to crawl
    const requestList = new Apify.RequestList({
      sources: [
          { url: 'http://www.example.com/page-1' },
          { url: 'http://www.example.com/page-2' },
      ],
    });
    await requestList.initialize();
    
    // Crawl the URLs
    const crawler = new Apify.CheerioCrawler({
        requestList,
        handlePageFunction: async ({ $, html, request }) => {
    const data = [];
    // Do some data extraction from the page with Cheerio.
            $('.some-collection').each((index, el) => {
                data.push({ title: $(el).find('.some-title').text() });
            });
    // Save the data to dataset.
            await Apify.pushData({
                url: request.url,
                html,
                data,
            })
        },
    });
    
    await crawler.run();

    Constructor

    new CheerioCrawler(options)

    Parameters:
    • options ( Object )
      • handlePageFunction ( function ) - User-provided function that performs the logic of the crawler. It is called for each page loaded and parsed by the crawler.

        The function that receives an object as argument, with the following three fields:

        • $: the Cheerio object
        • html: the raw HTML
        • request: the Request object representing the URL to crawl

        If the function returns a promise, it is awaited.

      • requestList ( RequestList ) - Static list of URLs to be processed. Either RequestList or RequestQueue must be provided.
      • requestQueue ( RequestQueue ) - Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either RequestList or RequestQueue must be provided.
      • requestFunction ( function ) <optional> - Overrides the function that performs the HTTP request to get the raw HTML needed for Cheerio. See source code on GitHub for default behavior.
      • handlePageTimeoutSecs ( Number ) <optional> - Timeout in which the function passed as options.handlePageFunction needs to finish, given in seconds. Defaults to 300.
      • requestTimeoutSecs ( Number ) <optional> - Timeout in which the function passed as options.requestFunction needs to finish, given in seconds. Defaults to 30.
      • ignoreSslErrors ( Boolean ) <optional> - If set to true, SSL certificate errors will be ignored. This is dependent on using the default request function. If using a custom request function, user needs to implement this functionality. Defaults to false.
      • handleFailedRequestFunction ( function ) <optional> - Function that handles requests that failed more then option.maxRequestRetries times. See source code on GitHub for default behavior.
      • maxRequestRetries ( Number ) <optional> - How many times the request is retried if either requestFunction or handlePageFunction failed. Defaults to 3.
      • maxRequestsPerCrawl ( Number ) <optional> - Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
      • autoscaledPoolOptions ( Object ) <optional> - Custom options passed to the underlying AutoscaledPool instance constructor. Note that the runTaskFunction, isTaskReadyFunction and isFinishedFunction options are provided by CheerioCrawler and cannot be overridden.
      • minConcurrency ( Object ) <optional> - Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1.
      • maxConcurrency ( Object ) <optional> - Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1000.
    See:

    Methods (2)

    abort() → {Promise}

    Aborts the crawler by preventing crawls of additional pages and terminating the running ones.

    Returns:
  • ( Promise )
  • run() → {Promise}

    Runs the crawler. Returns promise that gets resolved once all the requests got processed.

    Returns:
  • ( Promise )
  • Dataset

    The Dataset class represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results.

    Do not instantiate this class directly, use the Apify.openDataset() function instead.

    Dataset stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variable is set.

    If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the data is stored in the local directory in the following files:

    [APIFY_LOCAL_STORAGE_DIR]/datasets/[DATASET_ID]/[INDEX].json

    Note that [DATASET_ID] is the name or ID of the dataset. The default dataset has ID default, unless you override it by setting the APIFY_DEFAULT_DATASET_ID environment variable. Each dataset item is stored as a separate JSON file, where [INDEX] is a zero-based index of the item in the dataset.

    If the APIFY_TOKEN environment variable is provided instead, the data is stored in the Apify Dataset cloud storage.

    Example usage:

    // Write a single row to the default dataset
    await Apify.pushData({ col1: 123, col2: 'val2' });
    
    // Open a named dataset
    const dataset = await Apify.openDataset('some-name');
    
    // Write a single row
    await dataset.pushData({ foo: 'bar' });
    
    // Write multiple rows
    await dataset.pushData([
      { foo: 'bar2', col2: 'val2' },
      { col3: 123 },
    ]);

    Constructor

    new Dataset()

    Methods (7)

    delete() → {Promise}

    Removes the dataset either from the Apify cloud storage or from the local directory, depending on the mode of operation.

    Returns:
  • ( Promise )
  • forEach(iteratee, opts, index) → {Promise.<undefined>}

    Iterates over dataset items, yielding each in turn to an iteratee function. Each invocation of iteratee is called with three arguments: (element, index).

    If iteratee returns a Promise then it is awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<undefined> )
  • getData(options) → {Promise}

    Returns items in the dataset based on the provided parameters.

    If format is json then the function doesn't return an array of records but PaginationList instead.

    Parameters:
    • options ( Object )
      • format ( String ) <optional> - Format of the items, possible values are: json, csv, xlsx, html, xml and rss. Defaults to 'json'.
      • offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
      • limit ( Number ) <optional> - Maximum number of array elements to return. Defaults to 250000.
      • desc ( Boolean ) <optional> - If true then the objects are sorted by createdAt in descending order. Otherwise they are sorted in ascending order.
      • fields ( Array ) <optional> - An array of field names that will be included in the result. If omitted, all fields are included in the results.
      • unwind ( String ) <optional> - Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are.
      • disableBodyParser ( Boolean ) <optional> - If true then response from API will not be parsed.
      • attachment ( Number ) <optional> - If true then the response will define the Content-Disposition: attachment HTTP header, forcing a web browser to download the file rather than to display it. By default, this header is not present.
      • delimiter ( String ) <optional> - A delimiter character for CSV files, only used if format is csv. You might need to URL-encode the character (e.g. use %09 for tab or %3B for semicolon). Defaults to ','.
      • bom ( Number ) <optional> - All responses are encoded in UTF-8 encoding. By default, the CSV files are prefixed with the UTF-8 Byte Order Mark (BOM), while JSON, JSONL, XML, HTML and RSS files are not. If you want to override this default behavior, set bom option to true to include the BOM, or set bom to false to skip it.
      • xmlRoot ( String ) <optional> - Overrides the default root element name of the XML output. By default, the root element is results.
      • xmlRow ( String ) <optional> - Overrides the default element name that wraps each page or page function result object in XML output. By default, the element name is page or result, depending on the value of the simplified option.
      • skipHeaderRow ( Number ) <optional> - If set to 1 then header row in csv format is skipped.
    Returns:
  • ( Promise )
  • getInfo(opts) → {Promise}

    Returns an object containing general information about the dataset.

    Parameters:
    • opts ( Object )
    Returns:
  • ( Promise )
  • Example
    {
      "id": "WkzbQMuFYuamGv3YF",
      "name": "d7b9MDYsbtX5L7XAj",
      "userId": "wRsJZtadYvn4mBZmm",
      "createdAt": "2015-12-12T07:34:14.202Z",
      "modifiedAt": "2015-12-13T08:36:13.202Z",
      "accessedAt": "2015-12-14T08:36:13.202Z",
      "itemsCount": 0
    }

    map(iteratee, opts, index) → {Promise.<Array>}

    Produces a new array of values by mapping each value in list through a transformation function (iteratee). Each invocation of iteratee is called with three arguments: (element, index).

    If iteratee returns a Promise then it's awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<Array> )
  • pushData(data) → {Promise}

    Stores an object or an array of objects to the dataset. The function returns a promise that resolves when the operation finishes. It has no result, but throws on invalid args or other errors.

    IMPORTANT: Make sure to use the await keyword when calling pushData(), otherwise the actor process might finish before the data is stored!

    The size of the data is limited by the receiving API and therefore pushData will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.

    The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates.

    Parameters:
    • data ( Object | Array ) - Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB.
    Returns:
  • ( Promise ) - Returns a promise that resolves once the data is saved.
  • reduce(iteratee, memo, opts, index) → {Promise.<*>}

    Boils down a list of values into a single value.

    Memo is the initial state of the reduction, and each successive step of it should be returned by iteratee. The iteratee is passed three arguments: the memo, then the value and index of the iteration.

    If no memo is passed to the initial invocation of reduce, the iteratee is not invoked on the first element of the list. The first element is instead passed as the memo in the invocation of the iteratee on the next element in the list.

    If iteratee returns a Promise then it's awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • memo ( * ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<*> )
  • KeyValueStore

    The KeyValueStore class represents a key-value store, a simple data storage that is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots, actor inputs and outputs, web pages, PDFs or to persist the state of crawlers.

    Do not instantiate this class directly, use the Apify.openKeyValueStore() function instead.

    Each actor run is associated with a default key-value store, which is created exclusively for the run. By convention, the actor input and output are stored into the default key-value store under the INPUT and OUTPUT key, respectively. Typically, input and output are JSON files, although it can be any other format. To access the default key-value store directly, you can use the Apify.getValue() and Apify.setValue() convenience functions.

    KeyValueStore stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variable is set.

    If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the data is stored in the local directory in the following files:

    [APIFY_LOCAL_STORAGE_DIR]/key_value_stores/[STORE_ID]/[KEY].[EXT]

    Note that [STORE_ID] is the name or ID of the key-value store. The default key value store has ID default, unless you override it by setting the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable. The [KEY] is the key of the record and [EXT] corresponds to the MIME content type of the data value.

    If the APIFY_TOKEN environment variable is provided instead, the data is stored in the Apify Dataset cloud storage.

    Example usage:

    // Get actor input from the default key-value store
    const input = await Apify.getValue('INPUT');
    
    // Write actor output to the default key-value store.
    await Apify.setValue('OUTPUT', { myResult: 123 });
    
    // Open a named key-value store
    const store = await Apify.openKeyValueStore('some-name');
    
    // Write a record. JavaScript object is automatically converted to JSON,
    // strings and binary buffers are stored as they are
    await store.setValue('some-key', { foo: 'bar' });
    
    // Read a record. Note that JSON is automatically parsed to a JavaScript object,
    // text data returned as a string and other data is returned as binary buffer
    const value = await store.getValue('some-key');
    // Delete record
    await store.delete('some-key');

    Constructor

    new KeyValueStore()

    See:
    • Apify.getValue()
    • Apify.setValue()

    Methods (3)

    delete() → {Promise}

    Removes the key-value store either from the Apify cloud storage or from the local directory, depending on the mode of operation.

    Returns:
  • ( Promise )
  • getValue(key) → {Promise.<Object>}

    Gets a value from the key-value store.

    The function returns a promise that resolves to the record value, whose JavaScript type depends on the MIME content type of the record. Records with the application/json content type are automatically parsed and returned as a JavaScript object. Similarly, records with text/plain content types are returned as a string. For all other content types, the value is returned as a raw Buffer instance.

    If the record does not exist, the functions resolves to null.

    To save or delete a value in the key-value store, use the KeyValueStore.setValue() function.

    Example usage:

    const store = await Apify.openKeyValueStore('my-screenshots');
    const buffer = await store.getValue('screenshot1.png');
    Parameters:
    • key ( String ) - Key of the record.
    Returns:
  • ( Promise.<Object> ) - Returns a promise that resolves to an object, string or Buffer, depending on the MIME content type of the record.
  • setValue(key, value, optionsopt) → {Promise}

    Saves or deletes a record in the key-value store. The function returns a promise that resolves once the record has been saved or deleted.

    Example usage:

    const store = await Apify.openKeyValueStore('my-store');
    await store.setValue('RESULTS', 'my text data', { contentType: 'text/plain' });

    By default, value is converted to JSON and stored with the application/json; charset=utf-8 MIME content type. To store the value with another content type, pass it in the options as follows:

    const store = await Apify.openKeyValueStore('my-store');
    await store.setValue('RESULTS', 'my text data', { contentType: 'text/plain' });

    If you set custom content type, value must be either a string or Buffer, otherwise an error will be thrown.

    If value is null, the record is deleted instead. Note that the setValue() function succeeds regardless whether the record existed or not.

    To retrieve a value from the key-value store, use the KeyValueStore.getValue() function.

    IMPORTANT: Always make sure to use the await keyword when calling setValue(), otherwise the actor process might finish before the value is stored!

    Parameters:
    • key ( String ) - Unique record key.
    • value ( Object | String | Buffer ) - Record data, which can be one of the following values:
      • If null, the record in the key-value store is deleted.
      • If no options.contentType is specified, value can be any JavaScript object and it will be stringified to JSON.
      • If options.contentType is specified, value is considered raw data and it must be a String or Buffer.
      For any other value an error will be thrown.
    • options ( Object ) <optional>
      • contentType ( String ) <optional> - Specifies a custom MIME content type of the record.
    Returns:
  • ( Promise ) - Returns a promise that resolves once the value is stored or deleted.
  • PseudoUrl

    Represents a pseudo URL (PURL) - an URL pattern used by web crawlers to specify which URLs should the crawler visit. This class is used by the Apify.utils.puppeteer.enqueueLinks() function.

    A PURL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [regexp], which defines a JavaScript-style regular expression to match against the URL.

    For example, a PURL http://www.example.com/pages/[(\w|-)*] will match all of the following URLs:

    • http://www.example.com/pages/
    • http://www.example.com/pages/my-awesome-page
    • http://www.example.com/pages/something

    If either [ or ] is part of the normal query string, it must be encoded as [\x5B] or [\x5D], respectively. For example, the following PURL:

    http://www.example.com/search?do[\x5B]load[\x5D]=1

    will match the URL:

    http://www.example.com/search?do[load]=1

    Example usage:

    const purl = new Apify.PseudoUrl('http://www.example.com/pages/[(\w|-)*]');
    
    if (purl.matches('http://www.example.com/pages/my-awesome-page')) console.log('Match!');

    Constructor

    new PseudoUrl(purl, requestTemplate)

    Parameters:
    • purl ( String ) - Pseudo URL.
    • requestTemplate ( Object ) - Options for the new Request instances created for matching URLs.
    See:

    Methods (2)

    createRequest(url) → {Request}

    Creates a Request object from requestTemplate and given URL.

    Parameters:
    • url ( String )
    Returns:
  • ( Request )
  • matches(url) → {Boolean}

    Determines whether a URL matches this pseudo-URL pattern.

    Parameters:
    • url ( String ) - URL to be matched.
    Returns:
  • ( Boolean ) - Returns true if given URL matches pseudo URL.
  • PuppeteerCrawler

    Provides a simple framework for parallel crawling of web pages using headless Chrome with Puppeteer. The URLs of pages to visit are given by Request objects that are fed from a list (see RequestList class) or from a dynamic queue (see RequestQueue class).

    PuppeteerCrawler opens a new Chrome page (i.e. tab) for each Request object to crawl and then calls the function provided by user as the handlePageFunction option. New tasks are only started if there is enough free CPU and memory available, using the AutoscaledPool class internally.

    Example usage:

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            // This function is called to extract data from a single web page
            // 'page' is an instance of Puppeteer.Page with page.goto(request.url) already called
            // 'request' is an instance of Request class with information about the page to load
            await Apify.pushData({
                title: await page.title(),
                url: request.url,
                succeeded: true,
            })
        },
        handleFailedRequestFunction: async ({ request }) => {
            // This function is called when crawling of a request failed too many time
            await Apify.pushData({
                url: request.url,
                succeeded: false,
                errors: request.errorMessages,
            })
        },
    });
    
    await crawler.run();

    Constructor

    new PuppeteerCrawler(options)

    Parameters:
    • options ( Object )
      • handlePageFunction ( function ) - Function that is called to process each request. It is passed an object with the following fields: request is an instance of the Request object with details about the URL to open, HTTP method etc. page is an instance of the Puppeteer.Page class with page.goto(request.url) already called.
      • requestList ( RequestList ) - List of the requests to be processed. Either RequestList or RequestQueue must be provided. See the requestList parameter of BasicCrawler for more details.
      • requestQueue ( RequestQueue ) - Queue of the requests to be processed. Either RequestList or RequestQueue must be provided. See the requestQueue parameter of BasicCrawler for more details.
      • handlePageTimeoutSecs ( Number ) <optional> - Timeout in which the function passed as options.handlePageFunction needs to finish, in seconds. Defaults to 300.
      • gotoFunction ( function ) <optional> - Overrides the function that opens the request in Puppeteer. The function should return a result of Puppeteer's page.goto() function, i.e. a promise resolving to the Response object.

        For example, this is useful if you need to extend the page load timeout or select different criteria to determine that the navigation succeeded.

        Note that a single page object is only used to process a single request and it is closed afterwards.

        See source code on GitHub for default behavior.

      • handleFailedRequestFunction ( function ) <optional> - Function to handle requests that failed more than option.maxRequestRetries times. See the handleFailedRequestFunction parameter of Apify.BasicCrawler for details. See source code on GitHub for default behavior.
      • maxRequestRetries ( Number ) <optional> - Indicates how many times each request is retried if handleRequestFunction failed. See maxRequestRetries parameter of BasicCrawler. Defaults to 3.
      • maxRequestsPerCrawl ( Number ) <optional> - Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. See maxRequestsPerCrawl parameter of BasicCrawler.
      • maxOpenPagesPerInstance ( Number ) <optional> - Maximum number of opened tabs per browser. If this limit is reached then a new browser instance is started. See maxOpenPagesPerInstance parameter of PuppeteerPool. Defaults to 50.
      • retireInstanceAfterRequestCount ( Number ) <optional> - Maximum number of requests that can be processed by a single browser instance. After the limit is reached the browser will be retired and new requests will be handled by a new browser instance. See retireInstanceAfterRequestCount parameter of PuppeteerPool. Defaults to 100.
      • instanceKillerIntervalMillis ( Number ) <optional> - How often the launched Puppeteer instances are checked whether they can be closed. See instanceKillerIntervalMillis parameter of PuppeteerPool. Defaults to 60000.
      • killInstanceAfterMillis ( Number ) <optional> - If Puppeteer instance reaches the options.retireInstanceAfterRequestCount limit then it is considered retired and no more tabs will be opened. After the last tab is closed the whole browser is closed too. This parameter defines a time limit for inactivity after which the browser is closed even if there are pending tabs. See killInstanceAfterMillis parameter of PuppeteerPool. Defaults to 300000.
      • launchPuppeteerFunction ( function ) <optional> - Overrides the default function to launch a new Puppeteer instance. See launchPuppeteerFunction parameter of PuppeteerPool. See source code on GitHub for default behavior.
      • launchPuppeteerOptions ( LaunchPuppeteerOptions ) <optional> - Options used by Apify.launchPuppeteer() to start new Puppeteer instances. See launchPuppeteerOptions parameter of PuppeteerPool.
      • autoscaledPoolOptions ( Object ) <optional> - Custom options passed to the underlying AutoscaledPool instance constructor. Note that the runTaskFunction, isTaskReadyFunction and isFinishedFunction options are provided by PuppeteerCrawler and should not be overridden.
      • minConcurrency ( Object ) <optional> - Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1.
      • maxConcurrency ( Object ) <optional> - Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option. Defaults to 1000.
    See:

    Methods (2)

    abort() → {Promise}

    Stops the crawler by preventing crawls of additional pages and terminating the running ones.

    Returns:
  • ( Promise )
  • run() → {Promise}

    Runs the crawler. Returns promise that gets resolved once all the requests got processed.

    Returns:
  • ( Promise )
  • PuppeteerPool

    Manages a pool of Chrome browser instances controlled using Puppeteer. PuppeteerPool reuses Chrome instances and tabs using specific browser rotation and retirement policies. This is useful in order to facilitate rotation of proxies, cookies or other settings in order to prevent detection of your web scraping bot, access web pages from various countries etc. Additionally, the reuse of browser instances instances speeds up crawling, and the retirement of instances helps mitigate effects of memory leaks in Chrome.

    PuppeteerPool is internally used by the PuppeteerCrawler class.

    Example usage:

    const puppeteerPool = new PuppeteerPool({
      launchPuppeteerFunction: () => {
        // Use a new proxy with a new IP address for each new Chrome instance
        return Apify.launchPuppeteer({
           apifyProxySession: Math.random(),
        });
      },
    });
    
    const page1 = await puppeteerPool.newPage();
    const page2 = await puppeteerPool.newPage();
    const page3 = await puppeteerPool.newPage();
    
    // ... do something with the pages ...
    
    // Close all browsers.
    await puppeteerPool.destroy();

    Constructor

    new PuppeteerPool()

    Parameters:
    • options.maxOpenPagesPerInstance ( Number ) <optional> - Maximum number of open pages (i.e. tabs) per browser. When this limit is reached, new pages are loaded in a new browser instance. Defaults to 50.
    • options.retireInstanceAfterRequestCount ( Number ) <optional> - Maximum number of requests that can be processed by a single browser instance. After the limit is reached, the browser is retired and new requests are be handled by a new browser instance. Defaults to 100.
    • options.instanceKillerIntervalMillis ( Number ) <optional> - Indicates how often opened Puppeteer instances are checked whether they can be closed. Defaults to 60000.
    • options.killInstanceAfterMillis ( Number ) <optional> - When Puppeteer instance reaches the options.retireInstanceAfterRequestCount limit then it is considered retired and no more tabs will be opened. After the last tab is closed the whole browser is closed too. This parameter defines a time limit between the last tab was opened and before the browser is closed even if there are pending open tabs. Defaults to 300000.
    • options.launchPuppeteerFunction ( function ) <optional> - Overrides the default function to launch a new Puppeteer instance. Defaults to launchPuppeteerOptions => Apify.launchPuppeteer(launchPuppeteerOptions).
    • options.launchPuppeteerOptions ( LaunchPuppeteerOptions ) <optional> - Options used by Apify.launchPuppeteer() to start new Puppeteer instances.
    • options.recycleDiskCache ( Boolean ) <optional> - Enables recycling of disk cache directories by Chrome instances. When a browser instance is closed, its disk cache directory is not deleted but it's used by a newly opened browser instance. This is useful to reduce amount of data that needs to be downloaded to speed up crawling and reduce proxy usage. Note that the new browser starts with empty cookies, local storage etc. so this setting doesn't affect anonymity of your crawler.

      Beware that the disk cache directories can consume a lot of disk space. To limit the space consumed, you can pass the --disk-cache-size=X argument to options.launchPuppeteerOptions.args, where X is the approximate maximum number of bytes for disk cache.

      IMPORTANT: Currently this feature only works in headful mode, because of a bug in Chromium.

      The options.recycleDiskCache setting should not be used together with --disk-cache-dir argument in options.launchPuppeteerOptions.args.

    Methods (3)

    destroy() → {Promise}

    Closes all the browsers.

    Returns:
  • ( Promise )
  • newPage() → {Promise.<Puppeteer.Page>}

    Opens new tab in one of the browsers and returns promise that resolves to its Puppeteer.Page.

    Returns:
  • ( Promise.<Puppeteer.Page> )
  • retire(browser) → {Promise}

    Manually retires a Puppeteer Browser instance from the pool. The browser will continue to process open pages so that they may gracefully finish. This is unlike browser.close() which will forcibly terminate the browser and all open pages will be closed.

    Parameters:
    • browser ( Puppeteer.Browser )
    Returns:
  • ( Promise )
  • Request

    Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata. The Request object also stores information about errors that occurred during processing of the request.

    Each Request instance has the uniqueKey property, which is be either specified manually in constructor or generated automatically from the URL. Two requests with the same uniqueKey are considered as pointing to the same web page. This behavior applies to all Apify SDK classes, such as RequestList, RequestQueue or PuppeteerCrawler.

    Example use:

    const request = new Apify.Request({
        url: 'http://www.example.com',
        headers: { Accept: 'application/json' },
    });
    
    ...
    
    request.userData.foo = 'bar';
    request.pushErrorMessage(new Error('Request failed!'));
    
    ...
    
    const foo = request.userData.foo;

    Constructor

    new Request(opts)

    Parameters:
    • opts ( object )
      • url ( String ) - URL of the web page to crawl.
      • uniqueKey ( String ) <optional> - A unique key identifying the request. Two requests with the same uniqueKey are considered as pointing to the same URL.

        If uniqueKey is not provided, then it is automatically generated by normalizing the URL. For example, the URL of HTTP://www.EXAMPLE.com/something/ will be generated the uniqueKey of http://www.example.com/something. The keepUrlFragment option determines whether URL hash fragment is included in the uniqueKey or not. Beware that the HTTP method and payload is not included in the uniqueKey, so requests to the same URL but with different HTTP methods or different POST payloads are all considered equal.

        You can set uniqueKey property to arbitrary non-empty text value in order to override the default behavior and specify which URLs shall be considered equal.

      • method ( String ) <optional> - Defaults to 'GET'.
      • payload ( String | Buffer ) <optional> - HTTP request payload, e.g. for POST requests.
      • retryCount ( Number ) <optional> - Indicates how many times the URL was retried in a case of error. Defaults to 0.
      • errorMessages ( Array.<String> ) <optional> - An array of error messages from request processing.
      • headers ( String ) <optional> - HTTP headers. Defaults to {}.
      • userData ( Object ) <optional> - Custom user data assigned to the request. Defaults to {}.
      • keepUrlFragment ( Boolean ) <optional> - If false then hash part is removed from the URL when computing the uniqueKey property. For example, this causes the http://www.example.com#foo and http://www.example.com#bar URLs to have the same uniqueKey of http://www.example.com and thus the URLs are considered equal. Note that this option only has effect if uniqueKey is not set. Defaults to false.
      • ignoreErrors ( String ) <optional> - If true then errors in processing of this will be ignored. For example, the request won't be retried in a case of an error for example. Defaults to false.
    See:

    Methods (1)

    pushErrorMessage(errorOrMessage)

    Stores information about an error occurred during processing of this request.

    Parameters:
    • errorOrMessage ( Error | String ) - Error object or error message to be stored in request.

    RequestList

    Represents a static list of URLs to crawl. The URLs can be provided either in code or parsed from a text file hosted on the web.

    Each URL is represented using an instance of the Request class. The list can only contain unique URLs. More precisely, it can only contain Request instances with distinct uniqueKey properties. By default, uniqueKey is generated from the URL, but it can also be overridden. To add a single URL multiple times to the list, corresponding Request objects will need to have different uniqueKey properties. You can use the keepDuplicateUrls option to do this for you.

    Once you create an instance of RequestList, you need to call initialize() before the instance can be used. After that, no more URLs can be added to the list.

    RequestList is used by BasicCrawler, CheerioCrawler and PuppeteerCrawler as a source of URLs to crawl. Unlike RequestQueue, RequestList is static but it can contain even millions of URLs.

    RequestList has an internal state where it stores information which requests were handled, which are in progress or which were reclaimed. The state might be automatically persisted to the default key-value store by setting the persistStateKey option so that if the Node.js process is restarted, the crawling can continue where it left off. For more details, see KeyValueStore.

    Example usage:

    const requestList = new Apify.RequestList({
        sources: [
            // Separate requests
            { url: 'http://www.example.com/page-1', method: 'GET', headers: {} },
            { url: 'http://www.example.com/page-2', userData: { foo: 'bar' }},
    // Bulk load of URLs from file `http://www.example.com/my-url-list.txt`
            // Note that all URLs must start with http:// or https://
            { requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
        ],
        persistStateKey: 'my-crawling-state'
    });
    
    // This call loads and parses the URLs from the remote file.
    await requestList.initialize();
    
    // Get requests from list
    const request1 = await requestList.fetchNextRequest();
    const request2 = await requestList.fetchNextRequest();
    const request3 = await requestList.fetchNextRequest();
    
    // Mark some of them as handled
    await requestList.markRequestHandled(request1);
    
    // If processing fails then reclaim it back to the list
    await requestList.reclaimRequest(request2);

    Constructor

    new RequestList(options)

    Parameters:
    • options ( Object )
      • sources ( Array ) - An array of sources for the RequestList. Its contents can either be just plain objects, defining at least the 'url' property or instances of the Request class. Additionally a requestsFromUrl property may be used instead of url, which will instruct the RequestList to download the sources from the given remote location. The URLs will be parsed from the received response.
        [
            // One URL
            { method: 'GET', url: 'http://example.com/a/b' },
            // Batch import of URLs from a file hosted on the web
            { method: 'POST', requestsFromUrl: 'http://example.com/urls.txt' },
        ]
      • persistStateKey ( String ) <optional> - Identifies the key in the default key-value store under which the RequestList persists its state. If this is set then RequestList persists its state in regular intervals and loads the state from there in case it is restarted due to an error or system reboot.
      • state ( Object ) <optional> - The state object that the RequestList will be initialized from. It is in the form as returned by RequestList.getState(), such as follows:
        {
            nextIndex: 5,
            nextUniqueKey: 'unique-key-5'
            inProgress: {
                'unique-key-1': true,
                'unique-key-4': true,
            },
        }

        Note that the preferred (and simpler) way to persist the state of crawling of the RequestList is to use the persistStateKey parameter instead.

      • keepDuplicateUrls ( Boolean ) <optional> - By default, RequestList will deduplicate the provided URLs. Default deduplication is based on the uniqueKey property of passed source Request objects. If the property is not present, it is generated by normalizing the URL. If present, it is kept intact. In any case, only one request per uniqueKey is added to the RequestList resulting in removing of duplicate URLs / unique keys. Setting keepDuplicateUrls to true will append an additional identifier to the uniqueKey of each request that does not already include a uniqueKey. Therefore, duplicate URLs will be kept in the list. It does not protect the user from having duplicates in user set uniqueKeys however. It is the user's responsibility to ensure uniqueness of their unique keys, if they wish to keep more than just a single copy in the RequestList. Defaults to false.

    Methods (8)

    fetchNextRequest() → {Promise.<Request>}

    Gets the next Request to process. First, the function gets a request previously reclaimed using reclaimRequest() function, if there is any. Otherwise it gets a next request from the sources.

    The function gets null if there are no more requests to process.

    Returns:
  • ( Promise.<Request> )
  • getState()

    Returns an object representing the internal state of the RequestList instance. Note that the objects fields can change in future releases.

    Returns:

    initialize() → {Promise}

    Loads all remote sources of URLs and potentially starts periodic state persistence. This function must be called before you can start using the instance in a meaningful way.

    Returns:
  • ( Promise )
  • isEmpty() → {Promise.<Boolean>}

    Resolves to true if the next call to fetchNextRequest() will return null, otherwise it resolves to false. Note that even if the list is empty, there might be some pending requests currently being processed.

    Returns:
  • ( Promise.<Boolean> )
  • isFinished() → {Promise.<boolean>}

    Returns true if all requests were already handled and there are no more left.

    Returns:
  • ( Promise.<boolean> )
  • length()

    Returns the total number of unique requests present in the RequestList.

    markRequestHandled(request) → {Promise}

    Marks request as handled after successful processing.

    Parameters:
    Returns:
  • ( Promise )
  • reclaimRequest(request) → {Promise}

    Reclaims request to the list if its processing failed. The request will become available in the next this.fetchNextRequest().

    Parameters:
    Returns:
  • ( Promise )
  • RequestQueue

    Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.

    Each URL is represented using an instance of the Request class. The queue can only contain unique URLs. More precisely, it can only contain Request instances with distinct uniqueKey properties. By default, uniqueKey is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding Request objects will need to have different uniqueKey properties.

    Do not instantiate this class directly, use the Apify.openRequestQueue() function instead.

    RequestQueue is used by BasicCrawler, CheerioCrawler and PuppeteerCrawler as a source of URLs to crawl. Unlike RequestList, RequestQueue supports dynamic adding and removing of requests. On the other hand, the queue is not optimized for operations that add or remove a large number of URLs in a batch.

    RequestQueue stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variable is set.

    If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the queue data is stored in that local directory as follows:

    [APIFY_LOCAL_STORAGE_DIR]/request_queues/[QUEUE_ID]/[STATE]/[NUMBER].json

    Note that [QUEUE_ID] is the name or ID of the request queue. The default queue has ID default, unless you override it by setting the APIFY_DEFAULT_REQUEST_QUEUE_ID environment variable. Each request in the queue is stored as a separate JSON file, where [STATE] is either handled or pending, and [NUMBER] is an integer indicating the position of the request in the queue.

    If the APIFY_TOKEN environment variable is provided instead, the data is stored in the Apify Request Queue cloud storage.

    Example usage:

    // Open the default request queue associated with the actor run
    const queue = await Apify.openRequestQueue();
    
    // Open a named request queue
    const queueWithName = await Apify.openRequestQueue('some-name');
    
    // Enqueue few requests
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/aaa'}));
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/bbb'}));
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/foo/bar'}), { forefront: true });
    
    // Get requests from queue
    const request1 = await queue.fetchNextRequest();
    const request2 = await queue.fetchNextRequest();
    const request3 = await queue.fetchNextRequest();
    
    // Mark a request as handled
    await queue.markRequestHandled(request1);
    
    // If processing of a request fails then reclaim it back to the queue, so that it's crawled again
    await queue.reclaimRequest(request2);

    Constructor

    new RequestQueue()

    Methods (8)

    addRequest(request, optsopt) → {RequestOperationInfo}

    Adds a request to the queue.

    Parameters:
    • request ( Request ) - Request object
    • opts ( Object ) <optional>
      • forefront ( Boolean ) <optional> - If true, the request will be added to the foremost position in the queue.
    Returns:
  • ( RequestOperationInfo )
  • delete() → {Promise}

    Removes the queue either from the Apify cloud storage or from the local directory, depending on the mode of operation.

    Returns:
  • ( Promise )
  • fetchNextRequest() → {Promise.<Request>}

    Returns next request in the queue to be processed.

    Returns:
  • ( Promise.<Request> )
  • getRequest(requestId) → {Promise.<Request>}

    Gets the request from the queue specified by ID.

    Parameters:
    • requestId ( String ) - Request ID
    Returns:
  • ( Promise.<Request> )
  • isEmpty() → {Promise.<Boolean>}

    Resolves to true if the next call to fetchNextRequest() will return null, otherwise it resolves to false. Note that even if the queue is empty, there might be some pending requests currently being processed.

    Due to the nature of distributed storage systems, the function might occasionally return a false negative, but it should never return a false positive!

    Returns:
  • ( Promise.<Boolean> )
  • isFinished() → {Promise.<Boolean>}

    Resolves to true if all requests were already handled and there are no more left. Due to the nature of distributed storage systems, the function might occasionally return a false negative, but it will never return a false positive.

    Returns:
  • ( Promise.<Boolean> )
  • markRequestHandled(request) → {Promise.<RequestOperationInfo>}

    Marks request handled after successfull processing.

    Parameters:
    Returns:
  • ( Promise.<RequestOperationInfo> )
  • reclaimRequest(request, optsopt) → {Promise.<RequestOperationInfo>}

    Reclaims failed request back to the queue, so that it can be processed later again.

    Parameters:
    • request ( Request )
    • opts ( Object ) <optional>
      • forefront ( Boolean ) <optional> - If true then requests gets returned to the begining of the queue and to the back of the queue otherwise. Defaults to false.
    Returns:
  • ( Promise.<RequestOperationInfo> )
  • SettingsRotator

    Rotates settings created by a user-provided function passed via newSettingsFunction. This is useful during web crawling to dynamically change settings and thus avoid detection of the crawler.

    This class is still work in progress, more features will be added soon.

    Constructor

    new SettingsRotator(options)

    Parameters:
    • options ( Object )
      • newSettingsFunction ( function )
      • maxUsages ( Number )

    Methods (2)

    fetchSettings() → {*}

    Fetches a settings object.

    Returns:
  • ( * )
  • reclaimSettings(settings)

    Reclaims settings after use.

    Parameters:
    • settings ( * )

    utils

    A namespace that contains various utilities.

    Example usage:

    const Apify = require('apify);
    
    ...
    
    // Sleep 1.5 seconds
    await Apify.utils.sleep(1500);

    Members (2)

    (static, constant) URL_NO_COMMAS_REGEX

    Default regular expression to match URLs in a string that may be plain text, JSON, CSV or other. It supports common URL characters and does not support URLs containing commas or spaces. The URLs also may contain Unicode letters (not symbols).

    (static, constant) URL_WITH_COMMAS_REGEX

    Regular expression that, in addition to the default regular expression URL_NO_COMMAS_REGEX, supports matching commas in URL path and query. Note, however, that this may prevent parsing URLs from comma delimited lists, or the URLs may become malformed.

    Methods (4)

    downloadListOfUrls(url, encodingopt, urlRegExpopt) → {Promise}

    Returns a promise that resolves to an array of urls parsed from the resource available at the provided url. Optionally, custom regular expression and encoding may be provided.

    Parameters:
    • url ( String ) -
    • encoding ( String ) <optional> - Defaults to 'utf8'.
    • urlRegExp ( RegExp ) <optional> - Defaults to URL_NO_COMMAS_REGEX.
    Returns:
  • ( Promise )
  • extractUrls(string, urlRegExpopt) → {Array}

    Collects all URLs in an arbitrary string to an array, optionally using a custom regular expression.

    Parameters:
    • string ( String ) -
    • urlRegExp ( RegExp ) <optional> - Defaults to URL_NO_COMMAS_REGEX.
    Returns:
  • ( Array )
  • getRandomUserAgent() → {String}

    Returns a randomly selected User-Agent header out of a list of the most common headers.

    Returns:
  • ( String )
  • sleep(millis) → {Promise}

    Returns a promise that resolves after a specific period of time. This is useful to implement waiting in your code, e.g. to prevent overloading of target website or to avoid bot detection.

    Example usage:

    const Apify = require('apify);
    
    ...
    
    // Sleep 1.5 seconds
    await Apify.utils.sleep(1500);
    Parameters:
    • millis ( Number ) - Period of time to sleep, in milliseconds. If not a positive number, the returned promise resolves immediately.
    Returns:
  • ( Promise )
  • utils.puppeteer

    A namespace that contains various Puppeteer utilities.

    Example usage:

    const Apify = require('apify');
    
    // Open https://www.example.com in Puppeteer
    const browser = await Apify.launchPuppeteer();
    const page = await browser.newPage();
    await page.goto('https://www.example.com');
    
    // Inject jQuery into a page
    await Apify.utils.puppeteer.injectJQuery(page);

    Methods (7)

    blockResources(page, resourceTypes) → {Promise.<void>}

    Forces the browser tab to block loading certain page resources, using the Page.setRequestInterception(value) method. This is useful to speed up crawling of websites.

    The resource types to block can be controlled using the resourceTypes parameter, which indicates the types of resources as they are perceived by the rendering engine. The following resource types are currently supported: document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other. For more details, see Puppeteer's Request.resourceType() documentation.

    By default, the function blocks these resource types: stylesheet, font, image, media.

    Parameters:
    • page ( Page ) - Puppeteer's Page object
    • resourceTypes ( Array.<String> ) - Array of resource types to block.
    Returns:
  • ( Promise.<void> )
  • compileScript(scriptString) → {function}

    Compiles a Puppeteer script into an async function that may be executed at any time by providing it with the following object:

    {
       page: Puppeteer.Page,
       request: Apify.Request,
    }

    The function is compiled by using the scriptString parameter as the function's body, so any limitations to function bodies apply. Return value of the function is the return value of the function body = scriptString parameter.

    As a security measure, no globals such as 'process' or 'require' are accessible from within the function body. Note that the function does not provide a safe sandbox and even though globals are not easily accessible, malicious code may still execute in the main process via prototype manipulation. Therefore you should only use this function to execute sanitized or safe code.

    Parameters:
    • scriptString ( String )
    Returns:
  • ( function ) - async ({ page, request }) => { scriptString }
  • hideWebDriver(page) → {Promise}

    Hides certain Puppeteer fingerprints from the page, in order to help avoid detection of the crawler. The function should be called on a newly-created page object before navigating to the target crawled page.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • injectFile(page, filePath) → {Promise}

    Injects a JavaScript file into a Puppeteer page. Unlike Puppeteer's addScriptTag function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    • filePath ( String ) - File path
    Returns:
  • ( Promise )
  • injectJQuery(page) → {Promise}

    Injects jQuery library into a Puppeteer page. jQuery is often useful for various web scraping and crawling tasks, e.g. to extract data from HTML elements using CSS selectors.

    Beware that the injected jQuery object will be set to the window.$ variable and thus it might cause conflicts with libraries included by the page that use the same variable (e.g. another version of jQuery).

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • injectUnderscore(page) → {Promise}

    Injects Underscore.js library into a Puppeteer page. Beware that the injected Underscore object will be set to the window._ variable and thus it might cause conflicts with libraries included by the page that use the same variable.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • Globals

    Type Definitions

    ActorRun

    Represents information about an actor run, as returned by the Apify.call() function. The object is almost equivalent to the JSON response of the Actor run Apify API endpoint and extended with certain fields. For more details, see Runs in Apify actor documentation.

    Properties:
    • id ( String ) - Actor run ID
    • actId ( String ) - Actor ID
    • startedAt ( Date ) - Time when the actor run started
    • finishedAt ( Date ) - Time when the actor run finished. Contains null for running actors.
    • status ( String ) - Status of the run. For possible values, see Run lifecycle in Apify actor documentation.
    • meta ( Object ) - Actor run meta-data. For example:
        {
          "origin": "API",
          "clientIp": "1.2.3.4",
          "userAgent": "ApifyClient/0.2.13 (Linux; Node/v8.11.3)"
        }
    • stats ( Object ) - An object containing various actor run statistics. For example:
        {
          "inputBodyLen": 22,
          "restartCount": 0,
          "workersUsed": 1,
        }

      Beware that object fields might change in future releases.

    • options ( Object ) - Actor run options. For example:
        {
          "build": "latest",
          "timeoutSecs": 0,
          "memoryMbytes": 256,
          "diskMbytes": 512
        }
    • buildId ( String ) - ID of the actor build used for the run. For details, see Builds in Apify actor documentation.
    • buildNumber ( String ) - Number of the actor build used for the run. For example, 0.0.10.
    • exitCode ( Number ) - Exit code of the actor run process. It's null if actor is still running.
    • defaultKeyValueStoreId ( String ) - ID of the default key-value store associated with the actor run. See KeyValueStore for details.
    • defaultDatasetId ( String ) - ID of the default dataset associated with the actor run. See Dataset for details.
    • defaultRequestQueueId ( String ) - ID of the default request queue associated with the actor run. See RequestQueue for details.
    • containerUrl ( String ) - URL on which the web server running inside actor run's Docker container can be accessed. For more details, see Container web server in Apify actor documentation.
    • output ( Object ) - Contains output of the actor run. The value is null or undefined in case the actor is still running, or if you pass false to the fetchOutput option of Apify.call().

      For example:

        {
          "contentType": "application/json; charset=utf-8",
          "body": {
            "message": "Hello world!"
          }
        }

    ApifyCallError

    The class represents exceptions thrown by the Apify.call() function.

    Properties:
    • message ( String ) - Error message
    • name ( String ) - Contains ApifyCallError
    • run ( ActorRun ) - Object representing the failed actor run.

    LaunchPuppeteerOptions

    Represents options passed to the Apify.launchPuppeteer() function.

    Properties:
    • opts.proxyUrl ( String ) - URL to a HTTP proxy server. It must define the port number, and it might also contain proxy username and password.

      For example: http://bob:pass123@proxy.example.com:1234.

    • opts.userAgent ( String ) - The User-Agent HTTP header used by the browser. If not provided, the function sets User-Agent to a reasonable default to reduce the chance of detection of the crawler.
    • opts.useChrome ( Boolean ) - If true and opts.executablePath is not set, Puppeteer will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the APIFY_CHROME_EXECUTABLE_PATH environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. By default, this option is false. Defaults to false.
    • opts.useApifyProxy ( Boolean ) - If set to true, Puppeteer will be configured to use Apify Proxy for all connections. For more information, see the documentation Defaults to false.
    • opts.apifyProxyGroups ( Array.<String> ) - An array of proxy groups to be used by the Apify Proxy. Only applied if the useApifyProxy option is true.
    • opts.apifyProxySession ( String ) - Apify Proxy session identifier to be used by all the Chrome browsers. All HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier can only contain the following characters: 0-9, a-z, A-Z, ".", "_" and "~". Only applied if the useApifyProxy option is true.
    • opts.liveView ( Boolean ) - If set to true, a PuppeteerLiveViewServer will be started to enable screenshot and html capturing of visited pages using PuppeteerLiveViewBrowser. Defaults to false.
    • opts.liveViewOptions ( Object ) - Settings for PuppeteerLiveViewBrowser started using launchPuppeteer().
      • id ( String ) - Custom ID of a browser instance in live view.
      • screenshotTimeoutMillis ( Number ) - Time in milliseconds before a screenshot capturing will time out and the actor continues with execution. Screenshot capturing pauses execution within the given page.

    PaginationList

    Represents one page of data items from the Dataset. For more details, see Dataset.getData().

    Properties:
    • items ( Array ) - Array of returned items.
    • total ( Number ) - Total number of object.
    • offset ( Number ) - Number of Request objects that was skipped at the start.
    • count ( Number ) - Number of returned objects.
    • limit ( Number ) - Requested limit on the number of items.

    RequestOperationInfo

    A helper class that is used to report results from the Apify.utils.puppeteer.enqueueLinks() function.

    Properties:
    • wasAlreadyPresent ( Boolean ) - Indicates if request was already present in the queue.
    • wasAlreadyHandled ( Boolean ) - Indicates if request was already marked as handled.
    • requestId ( String ) - The ID of the added request