Web scraping and automation SDK 0.5.42

The apify NPM package enables development of web scrapers, crawlers and web automation projects either locally or running on Apify Actor - a serverless computing platform that enables execution of arbitrary code in the cloud.

The package provides helper functions to launch web browsers with proxies, access the storage etc. Note that the usage of the package is optional, you can create acts on Apify platform without it.

For more information about the Apify Actor platform, please see https://www.apify.com/docs/actor

Common use-cases

Main goal of this package is to help with implementation of web scraping and automation projects. Some of the most common use-cases are:

  • If you want to crawl a website using for example Request package then take a look at BasicCrawler in combination with RequestList for fix list of urls or RequestQueue for recursive crawl.
  • If you want to crawl a website using a real browser. Then use PuppeteerCrawler which uses Puppeteer (headless/non-headless Chrome browser). PuppeteerCrawler supports both RequestList for fix list of urls or RequestQueue for recursive crawl.
  • If you need to process high volume of asynchronous tasks in parallel then take a look at AutoscaledPool. This class executes defined tasks in a pool which size is scaled based on available memory and CPU.
  • If you want to automate filling of forms or any other web interaction then you can use Puppeteer (headless/non-headless Chrome browser).

If you deploy your code to Apify platform then you can set up scheduler or execute your code with web API.

Quick start

To use Apify SDK you must have Node JS (version 7.0.0 or newer) and NPM installed. If you have both then the easiest way how to start is to use Apify CLI (command line tool).

Install the tool with:

npm -g install apify-cli

and create your project with:

apify create my_hello_world

cd my_hello_world

Apify CLI asks to you choose a template and then creates a directory my_hello_world containing:

  • package.json with Apify SDK as dependency
  • main.js containing basic code for your project
  • apify_local directory containing local emultation of Apify storage types
  • files needed for optional deployment to Apify platform (Dockerfile, apify.json)
  • node_modules directory containing all the required NPM packages

If you chose template Puppeteer then the main.js file looks like:

const Apify = require('apify');

Apify.main(async () => {
    const input = await Apify.getValue('INPUT');

    if (!input || !input.url) throw new Error('INPUT must contain a url!');

    console.log('Launching Puppeteer...');
    const browser = await Apify.launchPuppeteer();

    console.log(`Opening page ${input.url}...`);
    const page = await browser.newPage();
    await page.goto(input.url);
    const title = await page.title();
    console.log(`Title of the page "${input.url}" is "${title}".`);

    console.log('Closing Puppeteer...');
    await browser.close();

    console.log('Done.');
});

It simply takes a url field of its input opens that page using Puppeteer in Chrome browser and prints its title. Input is always stored in default key-value store of run. Local emulation of this store you can find in directory apify_local/key-value-stores/default. To create an input simply create a file apify_local/key-value-stores/default/INPUT.json containing:

{
  "url": "https://news.ycombinator.com"
}

Now can then run you code with:

apify run

and see following output:

Launching Puppeteer...

Opening page https://news.ycombinator.com...

Title of the page "https://news.ycombinator.com" is "Hacker News".

Closing Puppeteer...

Done.

Check examples below to see what you can do with Apify SDK. After you are done with your code you can deploy your project to Apify platform with following 2 steps:

apify login
apify push

Puppeteer

For those who are using Puppeteer (headless/non-headless Chrome browser) we have few helper classes and functions:

  • launchPuppeteer() function starts new instance of Puppeteer browser and returns its browser object.
  • PuppeteerPool helps to mantain a pool of Puppeteer instances. This is usefull when you need to restart browser after certain number of requests to rotate proxy servers.
  • PuppeteerCrawler helps to crawl a RequestList or RequestQueue in parallel using autoscaled pool.
const url = 'https://news.ycombinator.com';

const browser = await Apify.launchPuppeteer();
const page = await browser.newPage();
await page.goto(url);
const title = await page.title();

console.log(`Title of the page "${url}" is "${title}".`);

For more information on Puppeteer see its documenation.

Local usage

The easiest way how to use apify locally is with Apify CLI as shown in quick start section. Other way is to manually define required environment variables:

Environment variable Description
APIFY_LOCAL_EMULATION_DIR Directory where apify package locally emulates Apify storages - key-value store and dataset. Key-value stores will be emulated in directory [APIFY_LOCAL_EMULATION_DIR]/key-value-stores/[STORE_ID] and datasets in directory [APIFY_LOCAL_EMULATION_DIR]/datasets/[DATESET_ID].
APIFY_DEFAULT_KEY_VALUE_STORE_ID ID of default key-value store.
APIFY_DEFAULT_DATASET_ID ID of default dataset.
APIFY_DEFAULT_REQUEST_QUEUE_ID ID of default request queue.

Apify will then store key-value store records in files named [KEY].[EXT] where [KEY] is the record key and [EXT] is based on the record content type. Dataset items will be stored in files named [ID].json where [ID] is sequence number of your dataset item. * If you want to use Apify Proxy locally then you must define an environment variable PROXY_PASSWORD with password you find at https://my.apify.com/proxy.

Promises vs. callbacks

By default, all asynchronous functions provided by this package return a promise. But Apify uses a Bluebird promise implementation so you can easily convert any function that returns a Promise into callback style function. See Bluebird documentation for more information.

Examples

Directory examples of this repository demonstrates different usages of this package.

Recursive crawling

Following 2 examples demonstrate recursive crawling of https://news.ycombinator.com. Crawler starts at https://news.ycombinator.com and in each step enqueues a new page linked by "more" button at the bottom of the page and stores posts from the opened page in a Dataset. As a queue crawler uses Request Queue.

Former example crawls page simply using NPM Request and Cheerio packages and former one uses Puppeteer that provides full Chrome browser.

Crawling url list

These examples show how to scrape data from a fix list of urls using Puppeteer or Request and Cheerio.

Call to another act

This example shows how to call another act on Apify platform - in this case apify/send-mail to send email.

Check source code here

Act used and synchronous API

This example shows an act that has short runtime - just few seconds. It opens a webpage http://goldengatebridge75.org/news/webcam.html that contains webcam stream from Golden Gate bridge, takes a screenshot and saves it as output. This makes act executable on Apify platform synchronously with a single request that also returns its output.

Example is shared in library under https://www.apify.com/apify/example-golden-gate-webcam so you can easily run it with request to https://api.apify.com/v2/acts/apify~example-golden-gate-webcam/run-sync?token=[YOUR_API_TOKEN] and get image as response. Then you can for example use it directly in html:

<img src="https://api.apify.com/v2/acts/apify~example-golden-gate-webcam/run-sync?token=[YOUR_API_TOKEN]"/>

Check source code here

Other

Programmer's reference

The following sections describe all functions and properties provided by the apify package. All of them are instance members exported directly by the main module.

Members (2)

(static) events

Event emitter providing access to events from Actor infrastructure. Event emitter is initiated by Apify.main(). If you don't use Apify.main() then you must call await Apify.initializeEvents() yourself.

Example usage:

import { ACTOR_EVENT_NAMES } from 'apify/constants';

Apify.main(async () => {
   
  Apify.events.on(ACTOR_EVENT_NAMES.CPU_INFO, (data) => {
    if (data.isCpuOverloaded) console.log('OH NO! We are overloading CPU!');
  });
.   
});

Event types:

Event Name Constant Message Description
cpuInfo ACTOR_EVENT_NAMES.CPU_INFO { "isCpuOverloaded": true } This event is send every second and contains information if act is using maximum amount of available CPU power. If maximum is reached then there is no point in adding more workload.
migrating ACTOR_EVENT_NAMES.MIGRATING null This event is send when act is going to be migrated to another worker machine. In this case act run will be stopped and then reinitialized at another server.
persistState ACTOR_EVENT_NAMES.PERSIST_STATE { "isMigrating": true } This event is send in regular intervals to notify all components of Apify SDK that it's time to persist state. This prevents situation when act gets restarted due to a migration to another worker machine and needs to start from scratch. This event is also send as a result of ACTOR_EVENT_NAMES.MIGRATING and in this case the message is { "isMigrating": true }.

See NodeJs documentation for more information on event emitter use.

client

A default instance of the ApifyClient class provided by the apify-client NPM package. The instance is created when the apify package is first imported and it is configured using the APIFY_API_BASE_URL, APIFY_USER_ID and APIFY_TOKEN environment variables.

After that, the instance is used for all underlying calls to the Apify API in functions such as Apify.getValue() or Apify.call(). The settings of the client can be globally altered by calling the Apify.client.setOptions() function. Just be careful, it might have undesired effects on other functions provided by this package.

Methods (16)

call(actId, inputopt, optsopt) → {Promise}

Runs another act under the current user account, waits for the act to finish and fetches its output.

By passing the waitSecs option you can reduce the maximum amount of time to wait for the run to finish. If the value is less than or equal to zero, the function returns immediately after the run is started.

The result of the function is an object that contains details about the run and potentially its output. For example:

{
    "id": "ErYkuTTsmKiXccNGT",
    "actId": "E2jjCZBezvAZnX8Rb",
    "userId": "mb7q2dycFBHDhae6A",
    "startedAt": "2017-10-25T14:23:44.376Z",
    "finishedAt": "2017-10-25T14:23:46.723Z",
    "status": "SUCCEEDED",
    "meta": { "origin": "API", "clientIp": "1.2.3.4", "userAgent": null },
    "stats": {
        "netRxBytes": 180,
        "netTxBytes": 0,
        ...
    },
    "options": {
       "build": "latest",
       "timeoutSecs": 0,
       "memoryMbytes": 512,
       "diskMbytes": 1024
    },
    "buildId": "Bwkqk59MCkdexDP34",
    "exitCode": 0,
    "defaultKeyValueStoreId": "ccFfRptZru2uqdQHP",
    "defaultDatasetId": "tZru2uqdQHPcgFtRo",
    "buildNumber": "0.1.2",
    "output": {
        "contentType": "application/json; charset=utf-8",
        "body": { "message": "Hello world!" }
    }
}

Internally, the function calls the Run act API endpoint and few others.

Example usage:

const run = await Apify.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);
Parameters:
  • actId ( String ) - Either username/act-name or act ID.
  • input ( Object | String | Buffer ) <optional> - Act input body. If it is an object, it is stringified to JSON and the content type set to application/json; charset=utf-8.
  • opts ( Object ) <optional>
    • token ( String ) <optional> - User API token. By default, it is taken from the APIFY_TOKEN environment variable.
    • build ( String ) <optional> - Tag or number of act build to be run (e.g. beta or 1.2.345). If not provided, the default build tag or number from act configuration is used (typically latest).
    • contentType ( String ) <optional> - Content type for the input. If not specified, input is expected to be an object that will be stringified to JSON and content type set to application/json; charset=utf-8. If opts.contentType is specified, then input must be a String or Buffer.
    • timeoutSecs ( Number ) <optional> - Time limit for act to finish, in seconds. If the limit is reached the resulting run will have the RUNNING status. By default, there is no timeout.
    • waitSecs ( String ) <optional> - Maximum time to wait for act run to finish, in seconds. If the limit is reached, the returned promise is resolved to a run object that will have status READY or RUNNING and it will not contain the act run output. If waitSecs is null or undefined, the function waits for the act to finish (default behavior).
    • fetchOutput ( Boolean ) <optional> - If false then the function does not fetch output of the act. Defaults to true.
    • disableBodyParser ( Boolean ) <optional> - If true then the function will not attempt to parse the act's output and will return it in a raw Buffer. Defaults to false.
Throws:

If run doesn't succeed.

Type
( ApifyCallError )
Returns:
  • ( Promise )
  • getApifyProxyUrl(opts) → {String}

    Constructs the URL to the Apify Proxy using the specified settings. The proxy URL can be used from Apify Actor acts, web browsers or any other HTTP proxy-enabled applications.

    For more information, see the Apify Proxy page in the app or the documentation.

    Parameters:
    • opts ( Object )
      • password ( String ) - User's password for the proxy. By default, it is taken from the APIFY_PROXY_PASSWORD environment variable, which is automatically set by the system when running the acts on the Apify cloud.
      • groups ( Array.<String> ) <optional> - Array of Apify Proxy groups to be used. If not provided, the proxy will select the groups automatically.
      • session ( String ) <optional> - Apify Proxy session identifier to be used by the Chrome browser. All HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier can only contain the following characters: 0-9, a-z, A-Z, ".", "_" and "~".
    Returns:
  • ( String ) - Returns the proxy URL, e.g. http://auto:[email protected]:8000.
  • getEnv() → {Object}

    Returns a new object which contains information parsed from the APIFY_XXX environment variables. It has the following properties:

    {
        // ID of the act (APIFY_ACT_ID)
        actId: String,
     
        // ID of the act run (APIFY_ACT_RUN_ID)
        actRunId: String,
     
        // ID of the user who started the act - note that it might be
        // different than the owner of the act (APIFY_USER_ID)
        userId: String,
     
        // Authentication token representing privileges given to the act run,
        // it can be passed to various Apify APIs (APIFY_TOKEN).
        token: String,
     
        // Date when the act was started (APIFY_STARTED_AT)
        startedAt: Date,
     
        // Date when the act will time out (APIFY_TIMEOUT_AT)
        timeoutAt: Date,
     
        // ID of the key-value store where input and output data of this
        // act is stored (APIFY_DEFAULT_KEY_VALUE_STORE_ID)
        defaultKeyValueStoreId: String,
     
       // ID of the dataset where input and output data of this
        // act is stored (APIFY_DEFAULT_DATASET_ID)
        defaultDatasetId: String,
     
        // Amount of memory allocated for the act run,
        // in megabytes (APIFY_MEMORY_MBYTES)
        memoryMbytes: Number,
    }

    For the list of the APIFY_XXX environment variables, see Actor documentation. If some of the variables is not defined or is invalid, the corresponding value in the resulting object will be null.

    Returns:
  • ( Object )
  • initializeEvents()

    Initializes Apify.events event emitter by creating connection to a websocket that provides them. This is automatically called by Apify.main().

    main(userFunc)

    Runs a user function that performs the logic of the act. The Apify.main(userFunct) function does the following actions:

    1. Invokes the user function passed as the userFunc parameter
    2. If the user function returned a promise, waits for it to resolve
    3. If the user function throws an exception or some other error is encountered, prints error details to console so that they are stored to the log file
    4. Exits the process

    In the simplest case, the user function is synchronous:

    Apify.main(() => {
        // My synchronous function that returns immediately
    });

    If the user function returns a promise, it is considered as asynchronous:

    const request = require('request-promise');
    Apify.main(() => {
        // My asynchronous function that returns a promise
        return Promise.resolve()
        .then(() => {
            return request('http://www.example.com');
        })
        .then((html) => {
            console.log(html);
        });
    });

    To simplify your code, you can take advantage of the async/await keywords:

    const request = require('request-promise');
    Apify.main(async () => {
         const html = await request('http://www.example.com');
         console.log(html);
    });

    Note that the use of Apify.main() in acts is optional; the function is provided merely for user convenience and acts don't need to use it.

    Parameters:
    • userFunc ( function ) - User function to be executed

    stopEvents()

    Closes websocket providing events from Actor infrastructure and also stops sending internal events of Apify package such as persistState. This is automatically called at the end of Apify.main().

    getMemoryInfo() → {Promise}

    Returns memory statistics of the container, which is an object with the following properties:

    {
      // Total memory available to the act
      totalBytes: Number,
       
      // Amount of free memory
      freeBytes: Number,
       
      // Amount of memory used (= totalBytes - freeBytes)
      usedBytes: Number,
      // Amount of memory used by main NodeJS process
      mainProcessBytes: Number,
      // Amount of memory used by child processes of main NodeJS process
      childProcessesBytes: Number,
    }
    Returns:
  • ( Promise ) - Returns a promise.
  • getValue(key) → {Promise}

    Gets a value from the default key-value store for the current act run using the Apify API. The key-value store is created automatically for each act run and its ID is passed by the Actor platform in the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable. It is used to store input and output of the act under keys named INPUT and OUTPUT, respectively. However, the store can be used for storage of any other values under arbitrary keys.

    Example usage:

    const input = await Apify.getValue('INPUT');
    
    console.log('My input:');
    console.dir(input);

    The result of the function is the body of the record. Bodies with the application/json content type are automatically parsed to an object. Similarly, for text/plain content types the body is parsed as String. For all other content types, the body is a raw Buffer. If the record cannot be found, the result is null.

    If the APIFY_LOCAL_EMULATION_DIR environment variable is defined, the value is read from a that directory rather than the key-value store, specifically from a file that has the key as a name. file does not exists, the returned value is null. The file will get extension based on it's content type. This feature is useful for local development and debugging of your acts.

    Parameters:
    • key ( String ) - Key of the record.
    Returns:
  • ( Promise ) - Returns a promise.
  • isAtHome() → {Boolean}

    Returns true when code is running on Apify platform and false otherwise (for example locally).

    Returns:
  • ( Boolean )
  • isDocker() → {Promise}

    Returns promise that resolves to true if the code is running in a Docker container.

    Returns:
  • ( Promise )
  • launchPuppeteer(optsopt) → {Promise}

    Launches headless Chrome using Puppeteer pre-configured to work within the Apify platform. The function has the same argument and the return value as puppeteer.launch(). See Puppeteer documentation for more details.

    The launchPuppeteer() function alters the following Puppeteer options:

    • Passes the setting from the APIFY_HEADLESS environment variable to the headless option, unless it was already defined by the caller or APIFY_XVFB environment variable is set to 1. Note that Apify Actor cloud platform automatically sets APIFY_HEADLESS=1 to all running acts.
    • Takes the proxyUrl option, checks it and adds it to args as --proxy-server=XXX. If the proxy uses authentication, the function sets up an anonymous proxy HTTP to make the proxy work with headless Chrome. For more information, read the blog post about proxy-chain library.
    • If opts.useApifyProxy is true then the function generates a URL of Apify Proxy based on opts.apifyProxyGroups and opts.apifyProxySession and passes it as opts.proxyUrl.
    • The function adds --no-sandbox to args to enable running headless Chrome in a Docker container on the Apify platform.

    To use this function, you need to have the puppeteer NPM package installed in your project. When running on the Apify cloud platform, you can achieve that simply by using the apify/actor-node-chrome base Docker image for your act - see Apify Actor documentation for details.

    For an example of usage, see the apify/example-puppeteer actor.

    Parameters:
    • opts ( LaunchPuppeteerOptions ) <optional> - Optional settings passed to puppeteer.launch(). Additionally the object can contain the following fields:
    Returns:
  • ( Promise ) - Promise object that resolves to Puppeteer's Browser instance.
  • launchWebDriver(optsopt) → {Promise}

    Opens a new instance of Chrome web browser controlled by Selenium WebDriver. The result of the function is the new instance of the WebDriver class.

    To use this function, you need to have Google Chrome and ChromeDriver installed in your environment. For example, you can use the apify/actor-node-chrome base Docker image for your act - see documentation for more details.

    For an example of usage, see the apify/example-selenium act.

    Parameters:
    • opts ( Object ) <optional> - Optional settings passed to puppeteer.launch(). Additionally the object can contain the following fields:
      • proxyUrl ( String ) <optional> - URL to a proxy server. Currently only http:// scheme is supported. Port number must be specified. For example, http://example.com:1234.
      • headless ( String ) <optional> - Indicates that the browser will be started in headless mode. If the option is not defined, and the APIFY_HEADLESS environment variable has value 1 and APIFY_XVFB is NOT 1, the value defaults to true, otherwise it will be false.
      • userAgent ( String ) <optional> - User-Agent for the browser. If not provided, the function sets it to a reasonable default.
    Returns:
  • ( Promise )
  • openDataset(datasetIdOrName) → {Promise.<Dataset>}

    Opens a dataset and returns a promise resolving to an instance of the Dataset object.

    Dataset is an append-only storage that is useful for storing sequential or tabular results. For more information, see Dataset documentation.

    Example usage:

    const store = await Apify.openDataset(); // Opens the default dataset of the run.
    const storeWithName = await Apify.openDataset('some-name'); // Opens dataset with name 'some-name'.
    
    // Write a single row to dataset
    await dataset.pushData({ foo: 'bar' });
    
    // Write multiple rows
    await dataset.pushData([
      { foo: 'bar2', col2: 'val2' },
      { col3: 123 },
    ]);

    If the APIFY_LOCAL_EMULATION_DIR environment variable is set, the result of this function is an instance of the DatasetLocal class which stores the data in a local directory rather than Apify cloud. This is useful for local development and debugging of your acts.

    Parameters:
    • datasetIdOrName ( string ) - ID or name of the dataset to be opened. If no value is provided then the function opens the default dataset associated with the act run.
    Returns:
  • ( Promise.<Dataset> ) - Returns a promise that resolves to a Dataset object.
  • openKeyValueStore(storeIdOrName) → {Promise.<KeyValueStore>}

    Opens a key-value store and returns a promise resolving to an instance of the KeyValueStore class.

    Key-value store is a simple storage for records, where each record has a unique key. For more information, see Key-value store documentation.

    Example usage:

    const store = await Apify.openKeyValueStore('my-store-id');
    await store.setValue('some-key', { foo: 'bar' });

    If the APIFY_LOCAL_EMULATION_DIR environment variable is set, the result of this function is an instance of the KeyValueStoreLocal class which stores the records in a local directory rather than Apify cloud. This is useful for local development and debugging of your acts.

    Parameters:
    • storeIdOrName ( string ) - ID or name of the key-value store to be opened. If no value is provided then the function opens the default key-value store associated with the act run.
    Returns:
  • ( Promise.<KeyValueStore> ) - Returns a promise that resolves to a KeyValueStore object.
  • pushData(data) → {Promise}

    Stores object or an array of objects in the default dataset for the current act run using the Apify API Default id of the dataset is in the APIFY_DEFAULT_DATASET_ID environment variable The function has no result, but throws on invalid args or other errors.

    await Apify.pushData(data);

    The data is stored in default dataset associated with this act.

    If the APIFY_LOCAL_EMULATION_DIR environment variable is defined, the data gets pushed into local directory. This feature is useful for local development and debugging of your acts.

    IMPORTANT: Do not forget to use the await keyword when calling Apify.pushData(), otherwise the act process might finish before the data is stored!

    Parameters:
    • data ( Object | Array ) - Object or array of objects containing data to by stored in the dataset (9MB Max)
    Returns:
  • ( Promise ) - Returns a promise that gets resolved once data are saved.
  • setValue(key, value, optionsopt) → {Promise}

    Stores a value in the default key-value store for the current act run using the Apify API. The data is stored in the key-value store created specifically for the act run, whose ID is defined in the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable. The function has no result, but throws on invalid args or other errors.

    await Apify.setValue('OUTPUT', { someValue: 123 });

    By default, value is converted to JSON and stored with the application/json; charset=utf-8 content type. To store a value with another content type, pass it in the options as follows:

    await Apify.setValue('OUTPUT', 'my text data', { contentType: 'text/plain' });

    In this case, the value must be a string or Buffer.

    If the APIFY_LOCAL_EMULATION_DIR environment variable is defined, the value is written to that local directory rather than the key-value store on Apify cloud, to a file named as the key. This is useful for local development and debugging of your acts.

    IMPORTANT: Do not forget to use the await keyword when calling Apify.setValue(), otherwise the act process might finish before the value is stored!

    Parameters:
    • key - Key of the record
    • value - Value of the record:
      • If null, the record in the key-value store is deleted.
      • If no options.contentType is specified, value can be any object and it will be stringified to JSON.
      • If options.contentType is specified, value is considered raw data and it must be a String or Buffer.
      For any other value an error will be thrown.
    • options ( Object ) <optional>
      • contentType ( String ) <optional> - Sets the MIME content type of the value.
    Returns:
  • ( Promise ) - Returns a promise that resolves to the value.
  • AutoscaledPool

    Manages a pool of asynchronous resource-intensive tasks that are executed in parallel. The pool only starts new tasks if there is enough free CPU and memory available. The information about the CPU and memory usage is obtained either from the local system or from the Apify cloud infrastructure in case the process is running on the Apify platform.

    The auto-scaled pool is started by calling the run() function and it finishes when the last running task gets resolved and the next call to the function passed via isFinishedFunction resolves to false. If any of the tasks throws then the run() function also throws.

    The pool evaluates whether is should start a new task every time some of the tasks is finished and also in the interval set by the options.maybeRunIntervalMillis parameter.

    Basic usage of AutoscaledPool:

    const pool = new Apify.AutoscaledPool({
        maxConcurrency: 50,
        runTaskFunction: () => {
            // Run some resource-intensive asynchronous operation here and return a promise...
        },
    });
    
    await pool.run();

    Constructor

    new AutoscaledPool(options)

    Parameters:
    • options ( Object )
      • runTaskFunction ( function ) <optional> - A function that performs an asynchronous resource-intensive task. The function must either return a promise or null if no task is currently available.
      • isFinishedFunction ( function ) <optional> - A function that is called every time there are no tasks being processed. If it resolves to true then the pool's run finishes. If isFinishedFunction is not provided then the pool is finished whenever there are no running tasks.
      • isTaskReadyFunction ( function ) <optional> - A function that indicates if runTaskFunction should be called. By default, this function is called every time there is a free capacity for new task. But by overriding this you can throttle number of calls to runTaskFunction or to prevent calls to runTaskFunction when you know that it would return null.
      • minConcurrency ( Number ) <optional> - Minimum number of tasks running in parallel. Defaults to 1.
      • maxConcurrency ( Number ) <optional> - Maximum number of tasks running in parallel. Defaults to 1000.
      • maxMemoryMbytes ( Number ) <optional> - Maximum memory available in the system. By default the pool uses the totalMemory value provided by Apify.getMemoryInfo().
      • minFreeMemoryRatio ( Number ) <optional> - Minimum ratio of free memory kept in the system. Defaults to 0.2.
      • maybeRunIntervalMillis ( Number ) <optional> - Indicates how often should the pool try to call opts.runTaskFunction to start a new task. Defaults to 500.
      • ignoreMainProcess ( Boolean ) <optional> - If set to true then the auto-scaling manager does not consider memory consumption of the main Node.js process when scaling the pool up or down. This is mainly useful when tasks are running as separate processes (e.g. web browsers). Defaults to false.
      • loggingIntervalMillis ( Number ) <optional> - Specifies a period in which the instance logs it state, in milliseconds. Set to null to disable periodic logging. Defaults to 60000.

    Methods (1)

    run() → {Promise}

    Runs the auto-scaled pool. Returns promise that gets resolved or rejected once all the task got finished or some of them fails.

    Returns:
  • ( Promise )
  • BasicCrawler

    Provides a simple framework for parallel crawling of web pages, whose URLs are fed either from a static list (using the RequestList class) or from a dynamic queue of URLs (using the RequestQueue class).

    BasicCrawler invokes handleRequestFunction for each Request object fetched from options.requestList or options.requestQueue, as long as any of them is not empty. New requests are only handled if there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class. Note that all AutoscaledPool configuration options can be passed to options parameter of the BasicCrawler constructor.

    If both requestList and requestQueue is used, the instance first processes URLs from the requestList and automatically enqueues all of them to requestQueue before it starts their processing. This guarantees that a single URL is not crawled multiple times.

    Example usage:

    const rp = require('request-promise');
    
    // Prepare a list of URLs to crawl
    const requestList = new Apify.RequestList({
      sources: [
          { url: 'http://www.example.com/page-1' },
          { url: 'http://www.example.com/page-2' },
      ],
    });
    await requestList.initialize();
    
    // Crawl the URLs
    const crawler = new Apify.BasicCrawler({
        requestList,
        handleRequestFunction: async ({ request }) => {
            // 'request' contains an instance of the Request class
            // Here we simply fetch the HTML of the page and store it to a dataset
            await Apify.pushData({
                url: request.url,
                html: await rp(request.url),
            })
        },
    });
    
    await crawler.run();

    Constructor

    new BasicCrawler(options)

    Parameters:
    • options ( Object ) -
      • requestList ( RequestList ) <optional> - Static list of URLs to be processed.
      • requestQueue ( RequestQueue ) <optional> - Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites.
      • handleRequestFunction ( function ) <optional> - Function that processes a single Request object. It must return a promise.
      • handleFailedRequestFunction ( function ) <optional> - Function that handles requests that failed more then option.maxRequestRetries times. Defaults to ({ request, error }) => log.error('Request failed', _.pick(request, 'url', 'uniqueKey'))`.
      • maxRequestRetries ( Number ) <optional> - How many times the request is retried if handleRequestFunction failed. Defaults to 3.
      • maxRequestsPerCrawl ( Number ) <optional> - Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
      • maxMemoryMbytes ( Number ) <optional> - Maximum memory available in the system See AutoscaledPool for details.
      • minConcurrency ( Number ) <optional> - Minimum number of request to process in parallel. See AutoscaledPool for details. Defaults to 1.
      • maxConcurrency ( Number ) <optional> - Maximum number of request to process in parallel. See AutoscaledPool for details. Defaults to 1000.
      • minFreeMemoryRatio ( Number ) <optional> - Minimum ratio of free memory kept in the system. See AutoscaledPool for details. Defaults to 0.2.
    • opts.isFinishedFunction ( function ) <optional> - By default BasicCrawler finishes when all the requests have been processed. You can override this behaviour by providing custom isFinishedFunction. This function that is called every time there are no requests being processed. If it resolves to true then the crawler's run finishes. See AutoscaledPool for details.
    • options.ignoreMainProcess ( Boolean ) <optional> - If set to true then the auto-scaling manager does not consider memory consumption of the main Node.js process when scaling the pool up or down. This is mainly useful when tasks are running as separate processes (e.g. web browsers). See AutoscaledPool for details. Defaults to false.

    Methods (1)

    run() → {Promise}

    Runs the crawler. Returns a promise that gets resolved once all the requests are processed.

    Returns:
  • ( Promise )
  • Dataset

    The Dataset class provides a simple interface to the Apify Dataset storage. You should not instantiate this class directly, use the Apify.openDataset() function.

    Example usage:

    const dataset = await Apify.openDataset('my-dataset-id');
    await dataset.pushData({ foo: 'bar' });

    Constructor

    new Dataset(datasetId)

    Parameters:
    • datasetId ( String ) - ID of the dataset.

    Methods (6)

    delete() → {Promise}

    Deletes the dataset.

    Returns:
  • ( Promise )
  • forEach(iteratee, opts, index) → {Promise.<undefined>}

    Iterates over the all dataset items, yielding each in turn to an iteratee function. Each invocation of iteratee is called with three arguments: (element, index).

    If iteratee returns a Promise then it's awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<undefined> )
  • getData(options) → {Promise}

    Returns items in the dataset based on the provided parameters.

    If format is json then doesn't return an array of records but PaginationList instead.

    Parameters:
    • options ( Object )
      • format ( String ) <optional> - Format of the items, possible values are: json, csv, xlsx, html, xml and rss. Defaults to 'json'.
      • offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
      • limit ( Number ) <optional> - Maximum number of array elements to return. Defaults to 250000.
      • desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
      • fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
      • unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
      • disableBodyParser ( Boolean ) <optional> - If true then response from API will not be parsed
      • attachment ( Number ) <optional> - If 1 then the response will define the Content-Disposition: attachment header, forcing a web browser to download the file rather than to display it. By default this header is not present.
      • delimiter ( String ) <optional> - A delimiter character for CSV files, only used if format=csv. You might need to URL-encode the character (e.g. use %09 for tab or %3B for semicolon). Defaults to ','.
      • bom ( Number ) <optional> - All responses are encoded in UTF-8 encoding. By default, the csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=1 query parameter to include the BOM or bom=0 to skip it.
      • xmlRoot ( String ) <optional> - Overrides default root element name of xml output. By default the root element is results.
      • xmlRow ( String ) <optional> - Overrides default element name that wraps each page or page function result object in xml output. By default the element name is page or result based on value of simplified parameter.
      • skipHeaderRow ( Number ) <optional> - If set to 1 then header row in csv format is skipped.
    Returns:
  • ( Promise )
  • map(iteratee, opts, index) → {Promise.<Array>}

    Produces a new array of values by mapping each value in list through a transformation function (iteratee). Each invocation of iteratee is called with three arguments: (element, index).

    If iteratee returns a Promise then it's awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<Array> )
  • pushData() → {Promise}

    Stores object or an array of objects in the dataset. The function has no result, but throws on invalid args or other errors.

    Returns:
  • ( Promise ) - That resolves when data gets saved into the dataset.
  • reduce(iteratee, memo, opts, index) → {Promise.<*>}

    Memo is the initial state of the reduction, and each successive step of it should be returned by iteratee. The iteratee is passed three arguments: the memo, then the value and index of the iteration.

    If no memo is passed to the initial invocation of reduce, the iteratee is not invoked on the first element of the list. The first element is instead passed as the memo in the invocation of the iteratee on the next element in the list.

    If iteratee returns a Promise then it's awaited before a next call.

    Parameters:
    • iteratee ( function ) -
    • memo ( * ) -
    • opts ( Opts ) -
    • options.offset ( Number ) <optional> - Number of array elements that should be skipped at the start. Defaults to 0.
    • options.desc ( Number ) <optional> - If 1 then the objects are sorted by createdAt in descending order.
    • options.fields ( Array ) <optional> - If provided then returned objects will only contain specified keys
    • options.unwind ( String ) <optional> - If provided then objects will be unwound based on provided field.
    • options.limit ( Number ) <optional> - How many items to load in one request. Defaults to 250000.
    • index ( Number ) - [description]
    Returns:
  • ( Promise.<*> )
  • KeyValueStore

    The KeyValueStore class provides a simple interface to the Apify Key-value stores. You should not instantiate this class directly, use the Apify.openKeyValueStore() function.

    Example usage:

    // Opens default key-value store of the run.
    const store = await Apify.openKeyValueStore();
    
    // Opens key-value store called 'some-name', belonging to the current Apify user account.
    const storeWithName = await Apify.openKeyValueStore('some-name');
    
    // Write and read data record
    await store.setValue('some-key', { foo: 'bar' });
    const value = store.getValue('some-key');

    Constructor

    new KeyValueStore(storeId)

    Parameters:
    • storeId ( String ) - ID of the key-value store.

    Methods (3)

    delete() → {Promise}

    Deletes the store.

    Returns:
  • ( Promise )
  • getValue(key) → {Promise}

    Gets a record from the current key-value store using its key. For more details, see Apify.getValue.

    Parameters:
    • key ( String ) - Record key.
    Returns:
  • ( Promise )
  • setValue(key, value, Optionsopt) → {Promise}

    Stores a record to the key-value stores. The function has no result, but throws on invalid arguments or other errors.

    Parameters:
    • key ( String ) - Record key.
    • value ( Object | String | Buffer ) - Record value. If content type is not provided then the value is stringified to JSON.
    • Options ( Object ) <optional>
      • contentType ( Object ) <optional> - Content type of the record.
    Returns:
  • ( Promise )
  • PseudoUrl

    Represents a pseudo URL (PURL), which is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [regexp], which defines a JavaScript-style regular expression to match against the URL.

    For example, a PURL http://www.example.com/pages/[(\w|-)*] will match all of the following URLs:

    • http://www.example.com/pages/
    • http://www.example.com/pages/my-awesome-page
    • http://www.example.com/pages/something

    Example use:

    const purl = new Apify.PseudoUrl('http://www.example.com/pages/[(\w|-)*]');
    
    if (purl.matches('http://www.example.com/pages/my-awesome-page')) console.log('Match!');

    Constructor

    new PseudoUrl(purl, requestTemplate)

    Parameters:
    • purl ( String ) - Pseudo url.
    • requestTemplate ( Object ) - Request options for created requests.

    Methods (2)

    createRequest(url) → {Request}

    Creates a Request object from requestTemplate and given URL.

    Parameters:
    • url ( String )
    Returns:
  • ( Request )
  • matches(url) → {Boolean}

    Determines whether a URL matches this pseudo-URL pattern.

    Parameters:
    • url ( String ) - URL to be matched.
    Returns:
  • ( Boolean ) - Returns true if given URL matches pseudo URL.
  • PuppeteerCrawler

    Provides a simple framework for parallel crawling of web pages using headless Chrome with Puppeteer. The URLs of pages to visit are given by Request objects that are fed from a list (see RequestList class) or from a dynamic queue (see RequestQueue class).

    PuppeteerCrawler opens a new Chrome page (i.e. tab) for each Request object to crawl and then calls the function provided by user as the handlePageFunction option. New tasks are only started if there is enough free CPU and memory available, using the AutoscaledPool class internally.

    Basic usage:

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction: async ({ page, request }) => {
            // This function is called to extract data from a single web page
            // 'page' is an instance of Puppeteer.Page with page.goto(request.url) already called
            // 'request' is an instance of Request class with information about the page to load
            await Apify.pushData({
                title: await page.title(),
                url: request.url,
                succeeded: true,
            })
        },
        handleFailedRequestFunction: async ({ request }) => {
            // This function is called when crawling of a request failed too many time
            await Apify.pushData({
                url: request.url,
                succeeded: false,
                errors: request.errorMessages,
            })
        },
    });
    
    await crawler.run();

    Constructor

    new PuppeteerCrawler()

    Parameters:
    • options.requestList ( RequestList ) <optional> - List of the requests to be processed. See the requestList parameter of BasicCrawler for more details.
    • options.requestQueue ( RequestQueue ) <optional> - Queue of the requests to be processed. See the requestQueue parameter of BasicCrawler for more details.
    • options.handlePageFunction ( function ) <optional> - Function that is called to process each request. It is passed an object with the following fields: request is an instance of the Request object with details about the URL to open, HTTP method etc. page is an instance of the Puppeteer.Page class with page.goto(request.url) already called.
    • options.pageOpsTimeoutMillis ( Number ) <optional> - Timeout in which the function passed as options.handlePageFunction needs to finish. Defaults to 300000.
    • options.gotoFunction ( function ) <optional> - Overrides the function that opens the request in Puppeteer. This function should return a result of page.goto(), i.e. the Puppeteer's Response object. Note that one page is only used to process one request, and it is closed afterwards. Defaults to ({ request, page }) => page.goto(request.url).
    • options.handleFailedRequestFunction ( function ) <optional> - Function to handle requests that failed more than option.maxRequestRetries times. See the handleFailedRequestFunction parameter of Apify.BasicCrawler for details. Defaults to ({ request }) => log.error('Request failed', _.pick(request, 'url', 'uniqueKey')).
    • options.maxRequestRetries ( Number ) <optional> - Indicates how many times each request is retried if handleRequestFunction failed. See maxRequestRetries parameter of BasicCrawler. Defaults to 3.
    • options.maxRequestsPerCrawl ( Number ) <optional> - Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value. See maxRequestsPerCrawl parameter of BasicCrawler.
    • options.maxMemoryMbytes ( Number ) <optional> - Maximum memory available for crawling. See maxMemoryMbytes parameter of AutoscaledPool.
    • options.maxConcurrency ( Number ) <optional> - Maximum concurrency of request processing. See maxConcurrency parameter of AutoscaledPool. Defaults to 1000.
    • options.minConcurrency ( Number ) <optional> - Minimum concurrency of requests processing. See minConcurrency parameter of AutoscaledPool. Defaults to 1.
    • options.minFreeMemoryRatio ( Number ) <optional> - Minimum ratio of free memory kept in the system. See minFreeMemoryRatio parameter of AutoscaledPool. Defaults to 0.2.
    • opts.isFinishedFunction ( function ) <optional> - By default PuppeteerCrawler finishes when all the requests have been processed. You can override this behaviour by providing custom isFinishedFunction. This function that is called every time there are no requests being processed. If it resolves to true then the crawler's run finishes. See isFinishedFunction parameter of AutoscaledPool.
    • options.maxOpenPagesPerInstance ( Number ) <optional> - Maximum number of opened tabs per browser. If this limit is reached then a new browser instance is started. See maxOpenPagesPerInstance parameter of PuppeteerPool. Defaults to 50.
    • options.retireInstanceAfterRequestCount ( Number ) <optional> - Maximum number of requests that can be processed by a single browser instance. After the limit is reached the browser will be retired and new requests will be handled by a new browser instance. See retireInstanceAfterRequestCount parameter of PuppeteerPool. Defaults to 100.
    • options.instanceKillerIntervalMillis ( Number ) <optional> - How often the launched Puppeteer instances are checked whether they can be closed. See instanceKillerIntervalMillis parameter of PuppeteerPool. Defaults to 60000.
    • options.killInstanceAfterMillis ( Number ) <optional> - If Puppeteer instance reaches the options.retireInstanceAfterRequestCount limit then it is considered retired and no more tabs will be opened. After the last tab is closed the whole browser is closed too. This parameter defines a time limit for inactivity after which the browser is closed even if there are pending tabs. See killInstanceAfterMillis parameter of PuppeteerPool. Defaults to 300000.
    • options.puppeteerConfig ( Object ) <optional> - Default options for each new Puppeteer instance. See puppeteerConfig parameter of PuppeteerPool. Defaults to { dumpio: process.env.NODE_ENV !== 'production', slowMo: 0, args: []}.
    • options.launchPuppeteerFunction ( function ) <optional> - Overrides the default function to launch a new Puppeteer instance. See launchPuppeteerFunction parameter of PuppeteerPool. Defaults to launchPuppeteerOptions => Apify.launchPuppeteer(launchPuppeteerOptions).
    • options.launchPuppeteerOptions ( LaunchPuppeteerOptions ) <optional> - Options used by Apify.launchPuppeteer() to start new Puppeteer instances. See launchPuppeteerOptions parameter of PuppeteerPool.
    • options.pageCloseTimeoutMillis ( Number ) <optional> - Timeout for page.close() in milliseconds. Defaults to 30000.

    Methods (1)

    run() → {Promise}

    Runs the crawler. Returns promise that gets resolved once all the requests got processed.

    Returns:
  • ( Promise )
  • PuppeteerPool

    Manages a pool of Chrome browser instances controlled by Puppeteer. PuppeteerPool rotates Chrome instances to change proxies and other settings, in order to prevent detection of your web scraping bot, access web pages from various countries etc.

    Example usage:

    const puppeteerPool = new PuppeteerPool({
      launchPuppeteerFunction: () => {
        // Use a new proxy with a new IP address for each new Chrome instance
        return Apify.launchPuppeteer({
           apifyProxySession: Math.random(),
        });
      },
    });
    
    const page1 = await puppeteerPool.newPage();
    const page2 = await puppeteerPool.newPage();
    const page3 = await puppeteerPool.newPage();
    
    // ... do something with pages ...
    
    // Close all browsers.
    await puppeteerPool.destroy();

    Constructor

    new PuppeteerPool()

    Parameters:
    • options.maxOpenPagesPerInstance ( Number ) <optional> - Maximum number of open pages (i.e. tabs) per browser. When this limit is reached, new pages are loaded in a new browser instance. Defaults to 50.
    • options.retireInstanceAfterRequestCount ( Number ) <optional> - Maximum number of requests that can be processed by a single browser instance. After the limit is reached, the browser is retired and new requests are be handled by a new browser instance. Defaults to 100.
    • options.instanceKillerIntervalMillis ( Number ) <optional> - Indicates how often opened Puppeteer instances are checked whether they can be closed. Defaults to 60000.
    • options.killInstanceAfterMillis ( Number ) <optional> - When Puppeteer instance reaches the options.retireInstanceAfterRequestCount limit then it is considered retired and no more tabs will be opened. After the last tab is closed the whole browser is closed too. This parameter defines a time limit for inactivity after which the browser is closed even if there are pending open tabs. Defaults to 300000.
    • options.launchPuppeteerFunction ( function ) <optional> - Overrides the default function to launch a new Puppeteer instance. Defaults to launchPuppeteerOptions => Apify.launchPuppeteer(launchPuppeteerOptions).
    • options.launchPuppeteerOptions ( LaunchPuppeteerOptions ) <optional> - Options used by Apify.launchPuppeteer() to start new Puppeteer instances.

    Methods (2)

    destroy()

    Closes all the browsers.

    newPage() → {Promise.<Puppeteer.Page>}

    Opens new tab in one of the browsers and returns promise that resolves to it's Puppeteer.Page.

    Returns:
  • ( Promise.<Puppeteer.Page> )
  • Request

    Requests class defines a web request to be processed and stores info about error that occurred during the processing.

    Example use:

    const request = new Apify.Request({
        url: 'http://example.com',
        headers: { Accept: 'application/json' },
    });
    
    ...
    
    request.userData.foo = 'bar';
    request.pushErrorMessage(new Error('Request failed!'));
    
    ...
    
    const foo = request.userData.foo;

    Constructor

    new Request(opts)

    Parameters:
    • opts ( object )
      • url ( String ) -
      • uniqueKey ( String ) <optional> - Unique key identifying request. In not provided then it is computed as normalized URL.
      • method ( String ) <optional> - Defaults to 'GET'.
      • payload ( String | Buffer ) <optional> - Request payload. If method='GET' then the payload is not allowed.
      • retryCount ( Number ) <optional> - How many times the url was retried in a case of exception. Defaults to 0.
      • errorMessages ( String ) <optional> - Array of error messages from request processing.
      • headers ( String ) <optional> - HTTP headers. Defaults to {}.
      • userData ( Object ) <optional> - Custom data that user can assign to request. Defaults to {}.
      • keepUrlFragment ( Boolean ) <optional> - If false then hash part is removed from url when computing uniqueKey. Defaults to false.
      • ignoreErrors ( String ) <optional> - If set to true then errors in processing of this will be ignored (request won't be retried in a case of an error for example). Defaults to false.

    Methods (1)

    pushErrorMessage(errorOrMessage)

    Stores information about processing error of this request.

    Parameters:
    • errorOrMessage ( Error | String ) - Error object or error message to be stored in request.

    RequestList

    Provides way to handle a list of URLs to be crawled. Each URL is represented using an instance of the Request class.

    RequestList has an internal state where it remembers handled requests, requests in progress and also reclaimed requests. The state might be persisted in a key-value store as shown in the example below so if an act is restarted (due to internal error or restart of the host machine) then the crawling can continue where it left off.

    Basic usage of RequestList:

    const requestList = new Apify.RequestList({
        sources: [
            // Separate requests
            { url: 'http://www.example.com/page-1', method: 'GET', headers: {} },
            { url: 'http://www.example.com/page-2', userData: { foo: 'bar' }},
    // Bulk load of URLs from file `http://www.example.com/my-url-list.txt`
            // Note that all URLs must start with http:// or https://
            { requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
        ],
        persistStateKey: 'my-crawling-state'
    });
    
    await requestList.initialize(); // Load requests.
    
    // Get requests from list
    const request1 = await requestList.fetchNextRequest();
    const request2 = await requestList.fetchNextRequest();
    const request3 = await requestList.fetchNextRequest();
    
    // Mark some of them as handled
    await requestList.markRequestHandled(request1);
    
    // If processing fails then reclaim it back to the list
    await requestList.reclaimRequest(request2);

    Constructor

    new RequestList(options)

    Parameters:
    • options ( Object )
      • sources ( Array ) - Function that processes a request. It must return a promise.
        [
            // One URL
            { method: 'GET', url: 'http://example.com/a/b' },
            // Batch import of URLa from a file hosted on the web
            { method: 'POST', requestsFromUrl: 'http://example.com/urls.txt' },
        ]
      • persistStateKey ( String ) <optional> - Key-value store key under which the RequestList persists its state. If this is set then RequestList persists its state in regular intervals and loads the state from there in a case that's restarted due to some error or migration to another worker machine.
      • state ( Object ) <optional> - The state object that the RequestList will be initialized from. It is in the form returned by requestList.getState(), such as follows:
        {
            nextIndex: 5,
            nextUniqueKey: 'unique-key-5'
            inProgress: {
                'unique-key-1': true,
                'unique-key-4': true,
            },
        }

        Note that the preferred (and simpler) way to persist the state of crawling of the RequestList is to use the persistStateKey parameter instead.

    Methods (8)

    fetchNextRequest() → {Promise.<Request>}

    Returns next request which is the reclaimed one if available or next upcoming request otherwise.

    Returns:
  • ( Promise.<Request> )
  • getState()

    Returns an object representing the state of the RequestList instance. Do not alter the resulting object!

    Returns:

    initialize() → {Promise}

    Loads all sources specified.

    Returns:
  • ( Promise )
  • isEmpty() → {Promise.<boolean>}

    Returns true if the next call to fetchNextRequest() will return null, otherwise it returns false. Note that even if the list is empty, there might be some pending requests currently being processed.

    Returns:
  • ( Promise.<boolean> )
  • isFinished() → {Promise.<boolean>}

    Returns true if all requests were already handled and there are no more left.

    Returns:
  • ( Promise.<boolean> )
  • length()

    Returns the total number of unique requests present in the RequestList.

    markRequestHandled(request) → {Promise}

    Marks request handled after successfull processing.

    Parameters:
    Returns:
  • ( Promise )
  • reclaimRequest(request) → {Promise}

    Reclaims request to the list if its processing failed. The request will become available in the next this.fetchNextRequest().

    Parameters:
    Returns:
  • ( Promise )
  • RequestQueue

    Provides a simple interface to the Apify Request Queue storage, which is used to manage a dynamic queue of web pages to crawl.

    You should not instantiate this class directly, but use the Apify.openRequestQueue() function.

    Example usage:

    // Opens default request queue of the run.
    const queue = await Apify.openRequestQueue();
    
    // Opens request queue called 'some-name'.
    const queueWithName = await Apify.openRequestQueue('some-name');
    
    // Enqueue few requests
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/aaa'}));
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/bbb'}));
    await queue.addRequest(new Apify.Request({ url: 'http://example.com/foo/bar'}, { forefront: true }));
    
    // Get requests from queue
    const request1 = queue.fetchNextRequest();
    const request2 = queue.fetchNextRequest();
    const request3 = queue.fetchNextRequest();
    
    // Mark some of them as handled
    queue.markRequestHandled(request1);
    
    // If processing fails then reclaim it back to the queue
    queue.reclaimRequest(request2);

    Constructor

    new RequestQueue(queueId)

    Parameters:
    • queueId ( String ) - ID of the request queue.

    Methods (8)

    addRequest(request, optsopt) → {RequestOperationInfo}

    Adds a request to the queue.

    Parameters:
    • request ( Request ) - Request object
    • opts ( Object ) <optional>
      • forefront ( Boolean ) <optional> - If true, the request will be added to the foremost position in the queue.
    Returns:
  • ( RequestOperationInfo )
  • delete() → {Promise}

    Deletes the queue.

    Returns:
  • ( Promise )
  • fetchNextRequest() → {Request}

    Returns next upcoming request.

    Returns:
  • ( Request )
  • getRequest(requestId) → {Request}

    Gets a request from the queue.

    Parameters:
    • requestId ( String ) - Request ID
    Returns:
  • ( Request )
  • isEmpty() → {boolean}

    Returns true if the next call to fetchNextRequest() will return null, otherwise it returns false. Note that even if the queue is empty, there might be some pending requests currently being processed.

    The function might occasionally return a false negative, but it should never return a false positive!

    Returns:
  • ( boolean )
  • isFinished() → {boolean}

    Returns true if all requests were already handled and there are no more left. The function might occasionally return a false negative, but it should never return a false positive!

    Returns:
  • ( boolean )
  • markRequestHandled(request) → {RequestOperationInfo}

    Marks request handled after successfull processing.

    Parameters:
    Returns:
  • ( RequestOperationInfo )
  • reclaimRequest(request, optsopt) → {RequestOperationInfo}

    Reclaims request after unsuccessfull operation. Requests gets returned into the queue.

    Parameters:
    • request ( Request )
    • opts ( Object ) <optional>
      • forefront ( Boolean ) <optional> - If true then requests gets returned to the begining of the queue and to the back of the queue otherwise. Defaults to false.
    Returns:
  • ( RequestOperationInfo )
  • SettingsRotator

    Rotates settings created by a user-provided function passed via newSettingsFunction. This is useful during web crawling to dynamically change settings and thus avoid detection of the crawler.

    This class is still work in progress, more features will be added soon.

    Constructor

    new SettingsRotator(options)

    Parameters:
    • options ( Object )
      • newSettingsFunction ( function )
      • maxUsages ( Number )

    Methods (2)

    fetchSettings() → {*}

    Fetches a settings object.

    Returns:
  • ( * )
  • reclaimSettings(settings)

    Reclaims settings after use.

    Parameters:
    • settings ( * )

    utils

    A namespace that contains various utilities.

    Example usage:

    const Apify = require('apify);
    
    ...
    
    // Sleep 1.5 seconds
    await Apify.utils.sleep(1500);

    Methods (1)

    sleep(millis) → {Promise}

    Returns a promise that resolves after a specific period of time. This is useful to implement waiting in your code, e.g. to prevent overloading of target website or to avoid bot detection.

    Example usage:

    const Apify = require('apify);
    
    ...
    
    // Sleep 1.5 seconds
    await Apify.utils.sleep(1500);
    Parameters:
    • millis - Period of time to sleep, in milliseconds. If not a positive number, the returned promise resolves immediately.
    Returns:
  • ( Promise )
  • utils.puppeteer

    A namespace that contains various Puppeteer utilities.

    Example usage:

    const Apify = require('apify');
    
    // Open https://www.example.com in Puppeteer
    const browser = await Apify.launchPuppeteer();
    const page = await browser.newPage();
    await page.goto('https://www.example.com');
    
    // Inject jQuery into a page
    await Apify.utils.puppeteer.injectJQuery(page);

    Methods (4)

    hideWebDriver(page) → {Promise}

    Hides certain Puppeteer fingerprints from the page, in order to help avoid detection of the crawler. The function should be called on a newly-created page object before navigating to the target crawled page.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • injectJQuery(page) → {Promise}

    Injects jQuery library into a Puppeteer page. jQuery is often useful for various web scraping and crawling tasks, e.g. to extract data from HTML elements using CSS selectors.

    Beware that the injected jQuery object will be set to the window.$ variable and thus it might cause conflicts with libraries included by the page that use the same variable (e.g. another version of jQuery).

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • injectUnderscore(page) → {Promise}

    Injects Underscore.js library into a Puppeteer page. Beware that the injected Underscore object will be set to the window._ variable and thus it might cause conflicts with libraries included by the page that use the same variable.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    Returns:
  • ( Promise )
  • Globals

    Members (1)

    (constant) createTimeoutPromise

    Creates a promise that after given time gets rejected with given error.

    Methods (2)

    injectFile(page, filePath) → {Promise}

    Injects a JavaScript file into a Puppeteer page. Unlike Puppeteer's addScriptTag function, this function works on pages with arbitrary Cross-Origin Resource Sharing (CORS) policies.

    Parameters:
    • page ( Page ) - Puppeteer Page object.
    • filePath ( String ) - File path
    Returns:
  • ( Promise )
  • parsePurl()

    Parses PURL into Regex string.

    Type Definitions

    ApifyCallError

    Properties:
    • message ( String ) - Defaults to Apify.call() wasn't succeed.
    • name ( String ) - Defaults to APIFY_CALL_ERROR.
    • run ( Object ) - Object of the failed run.

    LaunchPuppeteerOptions

    Properties:
    • opts.proxyUrl ( String ) - URL to a HTTP proxy server. It must define the port number, and it might also contain proxy username and password. For example: http://bob:[email protected]:1234.
    • opts.userAgent ( String ) - HTTP User-Agent header used by the browser. If not provided, the function sets User-Agent to a reasonable default to reduce the chance of detection of the crawler.
    • opts.useChrome ( Boolean ) - If true and opts.executablePath is not set, Puppeteer will launch full Google Chrome browser available on the machine rather than the bundled Chromium. The path to Chrome executable is taken from the APIFY_CHROME_EXECUTABLE_PATH environment variable if provided, or defaults to the typical Google Chrome executable location specific for the operating system. By default, this option is false. Defaults to false.
    • opts.useApifyProxy ( Boolean ) - If set to true, Puppeteer will be configured to use Apify Proxy for all connections. For more information, see the documentation Defaults to false.
    • opts.apifyProxyGroups ( Array.<String> ) - An array of proxy groups to be used by the Apify Proxy. Only applied if the useApifyProxy option is true.
    • opts.apifyProxySession ( String ) - Apify Proxy session identifier to be used by all the Chrome browsers. All HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier can only contain the following characters: 0-9, a-z, A-Z, ".", "_" and "~". Only applied if the useApifyProxy option is true.

    PaginationList

    Properties:
    • items ( Array ) - List of returned objects
    • total ( Number ) - Total number of object
    • offset ( Number ) - Number of Request objects that was skipped at the start.
    • count ( Number ) - Number of returned objects
    • limit ( Number ) - Requested limit

    RequestOperationInfo

    Properties:
    • wasAlreadyPresent ( Boolean ) - Indicates if request was already present in the queue.
    • wasAlreadyHandled ( Boolean ) - Indicates if request was already marked as handled.
    • requestId ( String ) - The ID of the added request