Apify Actors

Actors run on the Apify serverless computing platform and enable the execution of arbitrary pieces of code. Unlike traditional serverless platforms, the run of an actor is not limited to the lifetime of a single HTTP transaction. It can run for as long as necessary, even forever. The actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

A single isolated actor consists of source code and various settings. You can think of an actor as a cloud app or service.

Quick start

Go to the Actor section in the app, create a new actor and go to Source tab. Paste the following Node.js code into the Source code editor:

const Apify = require('apify');

Apify.main(async () => {
   console.log('Hello world from actor!');
});

Click Quick run to build and run your actor. After the run is finished you should see something like:

Congratulations, you have successfully created and run your first actor!

Let's try something little more complicated. We will change the actor to accept input and generate output (see Input and output for more details):

const Apify = require('apify');

Apify.main(async () => {
    // Get input and print it
    const input = await Apify.getValue('INPUT');
    console.log('My input:');
    console.dir(input);

    // Save output
    const output = { message: 'Hello world!' };
    await Apify.setValue('OUTPUT', output);
});

Save your actor by clicking Save and then rebuild it by clicking Build. After the build is finished, go to Console and set Input to:

{ "hello": 123 }

Then set Content type to application/json; charset=utf-8 and click Run. You will see something like:

Excellent, you have just created your first actor to accept input and store output! Now you can start adding some magic.

Note that the above actor is also available in the library as apify/hello-world. It uses the apify NPM package, which provides various helper functions to simplify the development of actors. For example, the Apify.main() function invokes a user function and waits for its finish, it logs exception details, etc. Note that the apify package is optional and actors do not need to use it at all.

For more complicated actors, you'll probably prefer to host the source code on Git. To do that, follow these steps:

  1. Create a new Git repository
  2. Copy the boilerplate actor code from the apify/quick-start actor
  3. Set Source type to Git repository for your actor in the app
  4. Paste the Git repo link to Git URL, save changes and build your actor.
  5. That's it, now you can develop your actor locally on your computer and run it in the Apify cloud!

For more information, go to the Git repository section.

Source code

The Source type setting determines the location of the source code for the actor. It can have one of the following values: Hosted source, Git repository, Zip file or GitHub Gist.

Hosted source

The source code of the actor can be hosted directly on Apify. All the code needs to be in a single file and written in JavaScript / Node.js. The version of Node.js is determined by the Base image setting - see Base images for the description of possible options.

The hosted source is especially useful for simple actors. The source code can require arbitrary NPM packages. For example:

const _ = require('underscore');
const request = require('request');

During the build process, the source code is scanned for occurrences of the require() function and the corresponding NPM dependencies are automatically added to the package.json file by running:

npm install underscore request --save --only=prod --no-optional

Note that certain NPM packages need additional tools for their installation, such as a C compiler or Python interpreter. If these tools are not available in the base Docker image, the build will fail. If that happens, try to change the base image to Node.js 8 + Puppeteer on Debian, because it contains much more tools than other images. Alternatively, the source code can be hosted in a Git repository, where it is possible to specify a custom Dockerfile with arbitrary dependencies.

Git repository

If the source code of the actor is hosted externally in a Git repository, it can consist of multiple files and directories, use its own Dockerfile to control the build process (see Custom Dockerfile for details) and have a user description in library fetched from the README.md file. The location of the repository is specified by the Git URL setting, which can be an https, git or ssh URL.

To help you get started quickly, you can use the apify/quick-start actor which contains all the boilerplate necessary when creating a new actor hosted on Git. The source code is available on GitHub.

To specify a Git branch or tag to check out, add a URL fragment to the URL. For example, to check out the develop branch, specify a URL such as https://github.com/jancurn/act-analyse-pages.git#develop

Optionally, the second part of the fragment in the Git URL (separated by a colon) specifies the context directory for the Docker build. For example, https://github.com/jancurn/act-analyse-pages.git#develop:some/dir will check out the develop branch and set some/dir as a context directory for the Docker build.

Note that you can easily set up an integration where the actor is automatically rebuilt on every commit to the Git repository. For more details, see GitHub integration.

Zip file

The source code for the actor can also be located in a Zip archive hosted on an external URL. This option enables integration with arbitrary source code or continuous integration systems. Similarly as with the Git repository, the source code can consist of multiple files and directories, can contain a custom Dockerfile and the actor description is taken from README.md.

GitHub Gist

Sometimes having a full Git repository or a hosted Zip file might be overly complicated for your small project, but you still want to have the source code in multiple files. In this case, you can simply put your source code into a GitHub Gist. For example:

https://gist.github.com/jancurn/2dbe83fea77c439b1119fb3f118513e7

Then set the Source Type to GitHub Gist and paste the Gist URL as follows:

Note that the example actor is available in the library as apify/example-act-in-gist.

Similarly as with the Git repository, the source code can consist of multiple files and directories, it can contain a custom Dockerfile and the actor description is taken from README.md.

Custom Dockerfile

Internally, Apify uses Docker to build and run actors. To control the build of the actor, you can create a custom Dockerfile in the root of the Git repository or Zip directory. Note that this option is not available for the Hosted source option. If the Dockerfile is missing, the system uses the following default:

FROM apify/actor-node-basic
ENV NODE_ENV=production
COPY . ./
RUN npm install --production --no-optional

For more information about Dockerfile syntax and commands, see the Dockerfile reference open_in_new.

Note that apify/actor-node-basic is a base Docker image provided by Apify. There are other base images with other features available. However, you can use arbitrary Docker images as the base for your actors, although using the Apify images has some performance advantages. See Base images for details.

GitHub integration

If the source code of an actor is hosted in a Git repository, it is possible to set up integration so that on every push to the Git repository the actor is automatically rebuilt. For that, you only need to set up a webhook in your Git source control system that will invoke the Build actor API endpoint on every push to Git repository.

For example, for repositories on GitHub it can be done using the following steps. First, go to the actor detail page, open the API tab and copy the Build actor API endpoint URL. It should look something like this:

https://api.apify.com/v2/acts/apify~hello-world/builds?token=<API_TOKEN>&version=0.1

Then go to your GitHub repository, click Settings, select Webhooks tab and click Add webhook. Paste the API URL to the Payload URL as follows:

And that's it! Now your actor should automatically rebuild on every push to the GitHub repository.

Custom environment variables

The actor owner can specify custom environment variables that are set to the actor's process during the run. Sensitive environment variables such as passwords or API tokens can be protected by setting the Secret option. With this option enabled, the value of the environment variable is encrypted and it will not be visible in the app or APIs, and the value is redacted from actor logs to avoid the accidental leakage of sensitive data.

Note that the custom environment variables are fixed during the build of the actor and cannot be changed later. See the Build section for details.

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.SMTP_HOST);

The actor runtime sets additional environment variables for the actor process during the run. See Environment variables for details.

Versioning

In order to enable active development, the actor can have multiple versions of the source code and associated settings, such as the Base image and Environment. Each version is denoted by a version number of the form MAJOR.MINOR; the version numbers should adhere to the Semantic Versioning open_in_new logic.

For example, the actor can have a production version 1.1, a beta version 1.2 that contains new features but is still backwards compatible, and a development version 2.0 that contains breaking changes.

The versions of the actors are built and run separately. For details, see Build and Run.

Local development

It is possible to develop actors locally on your computer and then only deploy them to the Apify cloud when they are ready. This is especially useful if you're using Git integration. See Git repository for more details. The boilerplate for creating an actor in a Git repository is available on GitHub.

In order to test the input and output of your actors on your local machine, you might define the APIFY_DEV_KEY_VALUE_STORE_DIR environment variable, which will cause the apify NPM package to emulate the key-value store locally using files in a directory. For more details, please see the apify package documentation.

Unfortunately, not all features of the Apify platform can be emulated locally, therefore you might still need to let the apify NPM package use your API token in order to interact with the Apify platform. The simplest way to achieve that is by setting the APIFY_TOKEN environment variable on your local development machine.

Build

Before the actor can be run, it first needs to be built. The build effectively creates a snapshot of a specific version of the actor's settings such as the Source code and Environment variables, and creates a Docker image that contains everything the actor needs for its run, including necessary NPM packages, web browsers, etc.

Each build is assigned a unique build number of the form MAJOR.MINOR.BUILD (e.g. 1.2.345), where MAJOR.MINOR corresponds to the actor version number (see Versions) and BUILD is an automatically-incremented number starting at 1.

By default, the build has a timeout of 300 seconds and consumes 1024 MB of memory from the user's memory limit. See the Resource limits section for more details.

Tags

When running the actor, the caller needs to specify which actor build should actually be used. To simplify this process, the builds can be associated with a tag such latest or beta, which can be used instead of the version number when running the actor. The tags are unique - only one build can be associated with a specific tag.

To set a tag for builds of a specific actor version, set the Build tag property. Whenever a new build of the version is successfully finished, it is automatically assigned the tag. By default, the builds are set the latest tag.

Base images

Apify provides the following Docker images that can be used as a base for user actors:

All images come in two versions: the latest tag corresponds to the stable version and beta to images where we test new features. Use the beta version at your own risk.

Note that all Apify Docker images are pre-cached on Apify servers in order to speed-up the actor builds and runs. The source code used to generate the images is available in the apify-actor-docker open_in_new GitHub repository.

Cache

By default, the build process pulls latest copies of all necessary Docker images and builds each new layer of Docker image from scratch. To speed up the build process, the user can invoke the build using the Quick run option in the Source tab, or by passing the useCache parameter in the API. See API reference for more details.

Lifecycle

Each build starts with the initial status READY and goes through one or more transitional statuses to one of the terminal statuses.

Status Type Description
READY initial Started but not allocated to any worker yet
RUNNING transitional Executing on a worker
SUCCEEDED terminal Finished successfully
FAILED terminal Build failed
TIMING-OUTtransitional Timing out now
TIMED-OUT terminal Timed out
ABORTING transitional Being aborted by user
ABORTED terminal Aborted by user

Run

The actor can be invoked in a number of ways. One option is to start the actor manually in Console in the app:

The following table describes the run settings:

Build Tag or number of the build to run (e.g. latest or 1.2.34).
Timeout Timeout for the actor run in seconds. Zero value means there is no timeout.
Memory Amount of memory allocated for the actor run, in megabytes.
Input Input data for the actor. The maximum length is 1M characters.
Content type Indicates what kind of data is in the input (e.g. application/json).

The owner of the actor can specify default values for all the above settings in the Default run configuration section in the app. If the actor caller does not specify a particular setting, the default value is used.

The actor can also be invoked using the Apify API by sending a HTTP POST request to the Run actor API endpoint, such as:

https://api.apify.com/v2/acts/apify~hello-world/runs?token=<YOUR_API_TOKEN>

The actor's input and its content type can be passed as a payload of the POST request and additional options can be specified using URL query parameters. For more details, see the Run actor section in the API reference.

Actors can also be invoked programmatically from other actors using the call() function provided by the apify NPM package. For example:

const run = await Apify.call('apify/hello-world', { message: 'Hello!' });
console.dir(run.output);

The newly started actor runs under the same user account as the initial actor and therefore all resources consumed are charged to the same user account. This allows more complex actors to be built using simpler actors built and owned by other users.

Internally, the call() function takes the user's API token from the APIFY_TOKEN environment variable, then it invokes the Run actor API endpoint, waits for the actor to finish and reads its output using the Get record API endpoint.

Input and output

As demonstrated in the hello world example above, actors can accept input and generate output. Both input and output are stored in a key-value store that is created when the actor is started, under the INPUT and OUTPUT keys, respectively. Note that the actor can store other values under arbitrary keys, for example crawling results or screenshots of web pages.

The key-value store associated with the actor run can be conveniently accessed using the getValue() and setValue() functions provided by the apify NPM package. Internally, these functions read the ID of the key-value store from the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable and then access the store using the Apify API. For more details about the key-value stores, go to the Storage section.

The input can be passed to the actor either manually in the Console or using a POST payload when running the actor using API. See Run section for details.

Environment variables

Aside from custom environment variables, the actor's process has several environment variables set to provide it with context:

APIFY_ACT_ID ID of the actor.
APIFY_ACT_RUN_ID ID of the actor run.
APIFY_ACTOR_EVENTS_WS_URL Websocket URL where actor may listen for events from Actor plaform. See documentation for more information.
APIFY_DEFAULT_DATASET_ID ID of the dataset where you can push the data.
APIFY_DEFAULT_KEY_VALUE_STORE_ID ID of the key-value store where the actor's input and output data is stored.
APIFY_DEFAULT_REQUEST_QUEUE_ID ID of the request queue that stores and handles requests that you enqueue.
APIFY_HEADLESS If set to 1, the web browsers inside the actor should run in the headless mode because there is no windowing system available.
APIFY_IS_AT_HOME Returns 1 if the act is running on Apify servers.
APIFY_MEMORY_MBYTES Indicates the size of memory allocated for the actor run, in megabytes. It can be used by actors to optimize their memory usage.
APIFY_PROXY_PASSWORD The Apify Proxy password of the user who started the actor.
APIFY_STARTED_AT Date when the actor was started.
APIFY_TIMEOUT_AT Date when the actor will time out.
APIFY_TOKEN The API token of the user who started the actor.
APIFY_USER_ID ID of the user who started the actor. Note that it might be different than the owner of the actor.
APIFY_CONTAINER_PORT TCP port on which the actor can start a HTTP server to receive messages from the outside world. See Container web server section for more details.
APIFY_CONTAINER_URL A unique public URL under which the actor run web server is accessible from the outside world. See Container web server section for more details.

Dates are always in the UTC timezone and are represented in simplified extended ISO format (ISO 8601 open_in_new), e.g. 2017-10-13T14:23:37.281Z

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.APIFY_USER_ID);

Resource limits

Actors run inside a Docker container whose resources are limited. When invoking the actor, the caller has to specify the amount of memory allocated for the actor. Additionally, each user has a certain total limit of memory for running actors. The sum of memory allocated for all running actors and builds needs to fit into this limit, otherwise the user cannot start a new actor. For more details, see Limits.

The share of CPU is computed automatically from the memory as follows: for each 4096 MB of memory, the actor gets 1 full CPU core. For other amounts of memory the number of CPU cores is computed fractionally. For example, an actor with 1024 MB of memory will have 1/4 of CPU share. Note that CPU throttling is only applied if the system is under load; if there is a free CPU capacity, the actors are not throttled as long as the Use spare CPU capacity setting is enabled.

The actor has hard disk space limited by twice the amount of memory. For example, an actor with 1024 MB of memory will have 2048 MB of disk available.

State persistence

Unlike traditional serverless platforms, actors have no limits on the duration of an actor run. However, that means that an actor might need to be restarted from time to time, e.g. when the server it's running on is to be shutdown. Actors need to account for this possibility. For short-running actors, the chance of a restart is quite low and the cost of repeated runs is low, so restarts can be ignored. However, for long-running actors a restart might be very costly and therefore such actors should periodically persist their state, possibly to the key-value store associated with the actor run. On start, actors should first check whether there is some state stored and if so they should continue where they left off.

Lifecycle

Each run starts with the initial status READY and goes through one or more transitional statuses to one of the terminal statuses.

Status Type Description
READY initial Started but not allocated to any worker yet
RUNNING transitional Executing on a worker
SUCCEEDED terminal Finished successfully
FAILED terminal Run failed
TIMING-OUTtransitional Timing out now
TIMED-OUT terminal Timed out
ABORTING transitional Being aborted by user
ABORTED terminal Aborted by user

Container web server

Each actor run is assigned a unique hard-to-guess URL (e.g. http://kmdo7wpzlshygi.runs.apify.net), which enables HTTP access to an optional web server running inside the actor run's Docker container. The URL is available in the following places:

  • In the web application, on the actor run details page as the Container URL field.
  • In the API as the containerUrl property of the Run object.
  • In the actor run's container as the APIFY_CONTAINER_URL environment variable.

The web server running inside the container must listen at the port defined by the APIFY_CONTAINER_PORT environment variable (typically 4321). If you want to use another port, simply define the APIFY_CONTAINER_PORT environment variable with the desired port number in your actor version configuration - see Custom environment variable for details.

The following example demonstrates how to start a simple web server in your actor:

const Apify = require('apify');
const express = require('express');

const app = express()
const port = process.env.APIFY_CONTAINER_PORT;

app.get('/', (req, res) => {
    res.send('Hello World!');
});

app.listen(port, () => console.log(`Web server is listening and can be accessed at ${process.env.APIFY_CONTAINER_URL}!`))

Apify.main(async () => {
    // Let the actor run for an hour.
    await Apify.sleep(60 * 60 * 1000);
});

Publishing

Actors can be private or public. Private actors can only be accessed and started by their owner, while public actor are shown in the library and can be run by anyone. Each public actor has a globally unique identifier that consists of the owner's username and the actor name, e.g. apify/hello-world.

To publish your actor, go to Settings → Permissions on the actor detail page and click the Publish button. You'll need to have a username set. This can be done on the Profile page.

The short actor description shown in the library is taken from Settings → Description. Additionally, if the actor's source code is hosted in a Git repository, Zip file or GitHub Gist, you can add a long description in Markdown open_in_new language to the README.md or README files in the root of the source code directory. To see an example of how this looks, go to apify/crawl-url-list.

IMPORTANT: Note that if your actor is public and used by other people, its usage is not charged towards your account. The user running the actor is always the one who pays for the computational resources consumed by an actor's execution.

Examples

This section provides examples of actors using various features of the Apify platform. All these examples and many more are also available in the library.

Puppeteer

This example demonstrates how to use headless Chrome with Puppeteer to open a web page, determines its dimensions, save a screenshot and print it to PDF. The actor can be found in the Apify library as apify/example-puppeteer.

const Apify = require('apify');

Apify.main(async () => {
   const input = await Apify.getValue('INPUT');

   if (!input || !input.url) throw new Error('Invalid input, must be a JSON object with the "url" field!');

   console.log('Launching Puppeteer...');
   const browser = await Apify.launchPuppeteer();

   console.log(`Opening URL: ${input.url}`);
   const page = await browser.newPage();
   await page.goto(input.url);

   // Get the "viewport" of the page, as reported by the page.
   console.log('Determining page dimensions...');
   const dimensions = await page.evaluate(() => ({
       width: document.documentElement.clientWidth,
       height: document.documentElement.clientHeight,
       deviceScaleFactor: window.devicePixelRatio
   }));
   console.log(`Dimension: ${JSON.stringify(dimensions)}`);

   // Grab a screenshot
   console.log('Saving screenshot...');
   const screenshotBuffer = await page.screenshot();
   await Apify.setValue('screenshot.png', screenshotBuffer, { contentType: 'image/png' });

   console.log('Saving PDF snapshot...');
   const pdfBuffer = await page.pdf({ format: 'A4'});
   await Apify.setValue('page.pdf', pdfBuffer, { contentType: 'application/pdf' });

   console.log('Closing Puppeteer...');
   await browser.close();

   console.log('Done.');
   console.log('You can check the output in the key-value on the following URLs:');
   const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/screenshot.png`)
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/page.pdf`);
});

The code above uses the launchPuppeteer() function provided by the apify NPM package. The function launches Puppeteer with several settings that enable it to run in an actor. Note that the actor needs to have Base image set to Node.js 8 + Puppeteer on Debian in order to run Puppeteer.

Custom Dockerfile

This example demonstrates how to create an actor written in PHP using a custom Dockerfile. For more information, see the Custom Dockerfile section. The Dockerfile is based on the php:7.0-cli open_in_new Docker image that contains everything needed to run PHP in a terminal.

Dockerfile contains only two commands. The first copies source code into the container and the second executes main.php.

The actor can be found in the Apify library as apify/example-php.

Dockerfile

FROM php:7.0-cli
COPY ./* ./
CMD [ "php", "./main.php" ]

main.php

<?php
print "Starting ...\n";
print "ENV vars:\n";
print_r($_ENV);
print "Fetching http://example.com ...\n";
$exampleComHtml = file_get_contents('http://example.com');
print "Searching for <h1> tag contents ...\n";
preg_match_all('/<h1>(.*?)<\/h1>/', $exampleComHtml, $matches);
print "Found: " . $matches[1][0] . "\n";
print "I am done!\n";

State persistence

This actor demonstrates how to persist a state, so that on restart the actor can continue where it left off. For more information, see the State persistence section. The actor simply counts from one up. In each run it prints one number. Its state (counter position) is stored in a named key-value store called example-counter. You will find it in the Storage section of the app after you run the actor.

The actor can be found in the Apify library as apify/example-counter.

const Apify = require('apify');

Apify.main(async () => {
    const keyValueStores = Apify.client.keyValueStores;

    // Get store with name 'example-counter'.
    const store = await keyValueStores.getOrCreateStore({
        storeName: 'example-counter',
    });

    // Get counter state record from store.
    const record = await keyValueStores.getRecord({
        key: 'counter',
        storeId: store.id,
    });

    // If there is no such record then start from zero.
    let counter = record ? record.body : 0;

    // Increase counter, print and set as output.
    counter ++;
    console.log(`Counter: ${counter}`);
    Apify.setValue('OUTPUT', counter);

    // Save increased value back to store.
    await keyValueStores.putRecord({
        storeId: store.id,
        key: 'counter',
        body: counter.toString(), // Record body must be a string or buffer!
    });
});

Limits

This section describes various resource limits of the Apify platform. Do you need to increase any of them? Please contact us.

Description Value
Build memory size 1024 MB
Build timeout 600 secs
Build/run disk size 2x run task memory limit
Run minimum memory 128 MB
Run maximum memory 16384 MB
Maximum combined memory of all running tasks (free accounts) 2048 MB
Maximum combined memory of all running tasks (paid accounts) 16384 MB