Actor

Actor is a serverless computing platform that enables the execution of arbitrary pieces of code in the Apify cloud. Unlike traditional serverless platforms, the run of an act is not limited to the lifetime of a single HTTP transaction. It can run for as long as necessary, even forever. The act can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

A single isolated Actor job is called an act, and it consists of a source code and various settings. You can think of an act as a cloud app or service.

Quick start

Go to the Actor section in the app, create a new act and go to Source tab. Paste the following Node.js code into to Source code editor:

const Apify = require('apify');

Apify.main(async () => {
   console.log('Hello world from Actor!');
});

Click Quick run to build and run your act. After the run is finished you should see something like:

Congratulations, you have successfully created and run your first act!

Let's try something little more complicated. We will change the act to accept input and generate output (see Input and output for more details):

const Apify = require('apify');

Apify.main(async () => {
    // Get input and print it
    const input = await Apify.getValue('INPUT');
    console.log('My input:');
    console.dir(input);

    // Save output
    const output = { message: 'Hello world!' };
    await Apify.setValue('OUTPUT', output);
});

Save your act by clicking Save and then rebuild it by clicking Build. After the build is finished, go to Console and set Input to:

{ "hello": 123 }

Then set Content type to application/json; charset=utf-8 and click Run. You will see something like:

Excellent, you have just created your first act that accepts input and stores output! Now you can start adding some magic.

Note that the above act is also available in the library as apify/hello-world. It uses the apify NPM package, which provides various helper functions to simplify the development of acts. For example, the Apify.main() function invokes a user function and waits for its finish, it logs exception details, etc. Note that the apify package is optional and acts do not need to use it at all.

For more complicated acts, you'll probably prefer to host the source code on Git. To do that, follow these steps:

  1. Create a new Git repository
  2. Copy the boilerplate act code from the apify/quick-start act
  3. Set Source type to Git repository for your act in the app
  4. Paste the Git repo link to Git URL, save changes and build your act.
  5. That's it, now you can develop your act locally on your computer and run it on the Apify cloud!

For more information, go to the Git repository section.

Source code

The Source type setting determines the location of the source code for the act. It can have one of the following values: Hosted source, Git repository, Zip file or GitHub Gist.

Hosted source

The source code of the act can be hosted directly on Apify. All the code needs to be in a single file and written in JavaScript / Node.js. The version of Node.js is determined by the Base image setting - see Base images for the description of possible options.

The hosted source is especially useful for simple acts. The source code can require arbitrary NPM packages. For example:

const _ = require('underscore');
const request = require('request');

During the build process, the source code is scanned for occurrences of the require() function and the corresponding NPM dependencies are automatically added to the package.json file by running:

npm install underscore request --save --only=prod --no-optional

Note that certain NPM packages need additional tools for their installation, such as a C compiler or Python interpreter. If these tools are not available in the base Docker image, the build will fail. If that happens, try to change the base image to Node.js 8 + Puppeteer on Debian, because it contains much more tools than other images. Alternatively, the source code can be hosted in a Git repository, where it is possible to specify a custom Dockerfile with arbitrary dependencies.

Git repository

If the source code of the act is hosted externally in a Git repository, it can consist of multiple files and directories, use its own Dockerfile to control the build process (see Custom Dockerfile for details) and have a user description in library fetched from the README.md file. The location of the repository is specified by the Git URL setting, which can be an https, git or ssh URL.

To help you get started quickly, you can use the apify/quick-start act which contains all the boilerplate necessary when creating a new act hosted on Git. The source code is available on GitHub.

To specify a Git branch or tag to check out, add a URL fragment to the URL. For example, to check out the develop branch, specify a URL such as https://github.com/jancurn/act-analyse-pages.git#develop

Optionally, the second part of the fragment in the Git URL (separated by colon) specifies the context directory for the Docker build. For example, https://github.com/jancurn/act-analyse-pages.git#develop:some/dir will check out the develop branch and set some/dir as a context directory for the Docker build.

Note that you can easily set up an integration where the act is automatically rebuilt on every commit to the Git repository. For more details, see GitHub integration.

Zip file

The source code for the act can also be located in a Zip archive hosted on an external URL. This option enables integration with arbitrary source code or continuous integration systems. Similarly as with the Git repository, the source code can consist of multiple files and directories, can contain a custom Dockerfile and the act description is taken from README.md.

GitHub Gist

Sometimes having a full Git repository or a hosted Zip file might be overly complicated for your small project, but you still want to have source code in multiple files. In this case, you can simply put your source code into a GitHub Gist. For example:

https://gist.github.com/jancurn/2dbe83fea77c439b1119fb3f118513e7

Then set the Source Type to GitHub Gist and paste the Gist URL as follows:

Note that the example act is available in the library as apify/example-act-in-gist.

Similarly as with the Git repository, the source code can consist of multiple files and directories, it can contain a custom Dockerfile and the act description is taken from README.md.

Custom Dockerfile

Internally, Apify uses Docker to build and run the acts. To control the build of the act, you can create a custom Dockerfile in the root of the Git repository or Zip directory. Note that this option is not available for the Hosted source option. If the Dockerfile is missing, the system uses the following default:

FROM apify/actor-node-basic
ENV NODE_ENV=production
COPY . ./
RUN npm install --production --no-optional

For more information about Dockerfile syntax and commands, see the Dockerfile reference open_in_new.

Note that apify/actor-node-basic is a base Docker image provided by Apify. There are other base images with other features available. However, you can use arbitrary Docker images as the base for your acts, although using the Apify images has some performance advantages. See Base images for details.

GitHub integration

If the source code of that act is hosted in a Git repository, it is possible to set up integration so that on every push to the Git repository the act is automatically rebuilt. For that, you only need to set up a webhook in your Git source control system that will invoke the Build act API endpoint on every push to Git repository.

For example, for repositories on GitHub it can be done using the following steps. First, go to the act detail page, open the API tab and copy the Build act API endpoint URL. It look something like this:

https://api.apify.com/v2/acts/apify~hello-world/builds?token=<API_TOKEN>&version=0.1

Then go to your GitHub repository, click Settings, select Webhooks tab and click Add webhook. Paste the API URL to the Payload URL as follows:

And that's it! Now you act should automatically rebuild on every push to the GitHub repository.

Custom environment variables

The act owner can specify custom environment variables that are set to the act's process during the run. Sensitive environment variables such as passwords or API tokens can be protected by setting the Secret option. With this option enabled, the value of the environment variable is encrypted and it will not be visible in the app or APIs, and the value is redacted from act logs to avoid the accidental leakage of sensitive data.

Note that the custom environment variables are fixed during the build of the act and cannot be changed later. See the Build section for details.

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.SMTP_HOST);

The Actor runtime sets additional environment variables to the act process during the run. See Environment variables for details.

Versioning

In order to enable active development, the act can have multiple versions of the source code and associated settings, such as the Base image and Environment. Each version is denoted by a version number of the form MAJOR.MINOR; the version numbers should adhere to the Semantic Versioning open_in_new logic.

For example, the act can have production version 1.1, a beta version 1.2 that contains new features but is still backwards compatible, and a development version 2.0 that contains breaking changes.

The versions of the acts are built and run separately. For details, see Build and Run.

Local development

It is possible to develop acts locally on your computer and then only deploy them to Apify cloud when they are ready. This is especially useful if you're using Git integration. See Git repository for more details. The boilerplate for creating an act in a Git repository is available on GitHub.

In order to test input and output of your acts on your local machine, you might define the APIFY_DEV_KEY_VALUE_STORE_DIR environment variable, which will cause the apify NPM package to emulate the key-value store locally using files in a directory. For more details, please see the apify package documentation.

Unfortunately, not all features of the Apify platform can be emulated locally, therefore you might still need to let the apify NPM package use your API token in order to interact with the Apify platform. The simplest way to achieve that is by setting the APIFY_TOKEN environment variable on your local development machine.

Build

Before the act can be run, it first needs to be built. The build effectively creates a snapshot of a specific version of the act's settings such as the Source code and Environment variables, and creates a Docker image that contains everything the act needs for its run, including necessary NPM packages, web browsers, etc.

Each build is assigned a unique build number of the form MAJOR.MINOR.BUILD (e.g. 1.2.345), where MAJOR.MINOR corresponds to the act version number (see Versions) and BUILD is an automatically-incremented number starting at 1.

By default, the build has a timeout of 300 seconds and consumes 1024 MB of memory from the user's memory limit. See the Resource limits section for more details.

Tags

When running the act, the caller needs to specify which act build should actually be used. To simplify this process, the builds can be associated with a tag such latest or beta, which can be used instead of the version number when running the act. The tags are unique - only one build can be associated with a specific tag.

To set a tag for builds of a specific act version, set the Build tag property. Whenever a new build of the version is successfully finished, it is automatically assigned the tag. By default, the builds are set the latest tag.

Base images

Apify provides the following Docker images that can be used as a base for user acts:

Note that all the Apify images are pre-cached on Apify servers in order to speed-up the act builds and runs. The source code used to generate the images is available in the apify-actor-docker open_in_new GitHub repository.

Cache

By default, the build process pulls latest copies of all necessary Docker images and builds each new layer of Docker image from scratch. To speed up the build process, the user can invoke the build using the Quick run option in the Source tab, or by passing the useCache parameter in the API. See API reference for more details.

Lifecycle

Each build starts with the initial status READY and goes through one or more transitional statuses to one of the terminal statuses.

Status Type Description
READY initial Started but not allocated to any worker yet
RUNNING transitional Executing on a worker
SUCCEEDED terminal Finished successfully
FAILED terminal Build failed
TIMING-OUTtransitional Timing out now
TIMED-OUT terminal Timed out
ABORTING transitional Being aborted by user
ABORTED terminal Aborted by user

Run

The act be invoked in a number of ways. One option is to start the act manually in Console in the app:

The following table describes the run settings:

Build Tag or number of the build to run (e.g. latest or 1.2.34).
Timeout Timeout for the act run in seconds. Zero value means there is no timeout.
Memory Amount of memory allocated for the act run, in megabytes.
Input Input data for the act. The maximum length is 1M characters.
Content type Indicates what kind of data is in the input (e.g. application/json).

The owner of the act can specify default values for all the above settings in the Default run configuration section in the app. If the act caller does not specify a particular setting, the default value is used.

The act can also be invoked using the Apify API by sending a HTTP POST request to the Run act API endpoint, such as:

https://api.apify.com/v2/acts/apify~hello-world/runs?token=<YOUR_API_TOKEN>

The act's input and its content type can be passed as a payload of the POST request and additional options can be specified using URL query parameters. For more details, see the Run act section in the API reference.

Acts can also be invoked programmatically from other acts using the call() function provided by the apify NPM package. For example:

const run = await Apify.call('apify/hello-world', { message: 'Hello!' });
console.dir(run.output);

The newly started act runs under the same user account as the initial act and therefore all resources consumed are charged to the same user account. This allows more complex acts to be built using simpler acts built and owned by other users.

Internally, the call() function takes the user's API token from the APIFY_TOKEN environment variable, then it invokes the Run act API endpoint, waits for the act to finish and reads its output using the Get record API endpoint.

Input and output

As demonstrated in the hello world example above, acts can accept input and generate output. Both input and output are stored in a key-value store that is created when the act is started, under the INPUT and OUTPUT keys, respectively. Note that the act can store other values under arbitrary keys, for example crawling results or screenshots of web pages.

The key-value store associated with the act run can be conveniently accessed using the getValue() and setValue() functions provided by the apify NPM package. Internally, these functions read the ID of the key-value store from the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable and then access the store using the Apify API. For more details about the key-value stores, go to the Storage section.

The input can be passed to the act either manually in the Console or using a POST payload when running the act using API. See Run section for details.

Environment variables

Aside from custom environment variables, the act's process has several environment variables set to provide it with context:

APIFY_ACT_ID ID of the act.
APIFY_ACT_RUN_ID ID of the act run.
APIFY_USER_ID ID of the user who started the act. Note that it might be different than the owner of the act.
APIFY_TOKEN The API token of the user who started the act.
APIFY_STARTED_AT Date when the act was started.
APIFY_TIMEOUT_AT Date when the act will time out.
APIFY_DEFAULT_KEY_VALUE_STORE_ID ID of the key-value store where the act's input and output data is stored.
APIFY_HEADLESS Set to '1', which indicates that web browsers should run in headless mode.
APIFY_MEMORY_MBYTES Indicates the size of memory allocated for the act run, in megabytes. It can be used by acts to optimize their memory usage.

The dates are always in the UTC timezone and are represented in simplified extended ISO format (ISO 8601 open_in_new), e.g. 2017-10-13T14:23:37.281Z

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.APIFY_USER_ID);

Resource limits

Acts run inside a Docker container whose resources are limited. When invoking the act, the caller has to specify the amount of memory allocated for the act, the minimum is 512 MB and maximum is 8192 MB. Each user has a certain total limit of memory for running acts, based on his or her subscription plan. For example, free accounts have a limit of 1024 MB. The sum of memory allocated for all running acts and builds needs to fit into this limit, otherwise the user cannot start a new act. See pricing for more details.

The share of CPU is computed automatically from the memory as follows: for each 4096 MB of memory the act gets 1 full CPU core. For other amounts of memory the number of CPU cores is computed fractionally. For example, an act with 1024 MB of memory will have 1/4 of CPU share. Note that CPU throttling is only applied if the system is under load; if there is a free CPU capacity, the acts are not throttled.

The act has hard disk space limited by twice the amount of memory. For example, an act with 1024 MB of memory will have 2048 MB of disk available.

State persistence

Unlike traditional serverless platforms, Actor has no limits on the duration of act run. However, that means that the act might need to be restarted from time to time, e.g. when the server it's running on is to be shutdown. The acts need to account for this possibility. For short-running acts, the chance of a restart is quite low and the cost of repeated runs is low, so restarts can be ignored. However, for long-running acts a restart might be very costly and therefore such acts should periodically persist their state, possibly to the key-value store associated with the act run. On start the acts should first check whether there is some state stored and if so they should continue where they left off.

Lifecycle

Each run starts with the initial status READY and goes through one or more transitional statuses to one of the terminal statuses.

Status Type Description
READY initial Started but not allocated to any worker yet
RUNNING transitional Executing on a worker
SUCCEEDED terminal Finished successfully
FAILED terminal Run failed
TIMING-OUTtransitional Timing out now
TIMED-OUT terminal Timed out
ABORTING transitional Being aborted by user
ABORTED terminal Aborted by user

Publishing

The act can be private or public. Private acts can only be accessed and started by their owner, while public acts are shown in the library and can be run by anyone. Each public act has globally unique identifier that consists of owner's username and the act name, e.g. apify/hello-world.

To publish your act, go to Settings → Permissions on the act detail page and click the Publish button. You'll need to have a username set, which can be done on the Profile page.

The short act description shown in the library is taken from Settings → Description. Additionally, if the act's source code is hosted in a Git repository, Zip file or GitHub Gist, you can add a long description in Markdown open_in_new language to the README.md or README files in the root of the source code directory. To see an example how this looks, go to apify/crawl-url-list.

IMPORTANT: Note that if your act is public and used by other people, its usage is not charged towards your account. The user running the act is always the one who pays for the computational resources consumed by an act's execution.

Examples

This section provides example of various features of the Actor platform. All these examples and many more are also available in the library.

Puppeteer

This example demonstrates how to use headless Chrome with Puppeteer to open a web page, determines its dimensions, save a screenshot and print it to PDF. The act can be found in the Apify library as apify/example-puppeteer.

const Apify = require('apify');

Apify.main(async () => {
   const input = await Apify.getValue('INPUT');

   if (!input || !input.url) throw new Error('Invalid input, must be a JSON object with the "url" field!');

   console.log('Launching Puppeteer...');
   const browser = await Apify.launchPuppeteer();

   console.log(`Opening URL: ${input.url}`);
   const page = await browser.newPage();
   await page.goto(input.url);

   // Get the "viewport" of the page, as reported by the page.
   console.log('Determining page dimensions...');
   const dimensions = await page.evaluate(() => ({
       width: document.documentElement.clientWidth,
       height: document.documentElement.clientHeight,
       deviceScaleFactor: window.devicePixelRatio
   }));
   console.log(`Dimension: ${JSON.stringify(dimensions)}`);

   // Grab a screenshot
   console.log('Saving screenshot...');
   const screenshotBuffer = await page.screenshot();
   await Apify.setValue('screenshot.png', screenshotBuffer, { contentType: 'image/png' });

   console.log('Saving PDF snapshot...');
   const pdfBuffer = await page.pdf({ format: 'A4'});
   await Apify.setValue('page.pdf', pdfBuffer, { contentType: 'application/pdf' });

   console.log('Closing Puppeteer...');
   await browser.close();

   console.log('Done.');
   console.log('You can check the output in the key-value on the following URLs:');
   const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/screenshot.png`)
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/page.pdf`);
});

The code above uses the launchPuppeteer() function provided by the apify NPM package. The function launches Puppeteer with several settings that enable it to run in Actor. Note that the act needs to have Base image set to Node.js 8 + Puppeteer on Debian in order to run Puppeteer.

Custom Dockerfile

This example demonstrates how to create an act written in PHP using a custom Dockerfile. For more information, see the Custom Dockerfile section. The Dockerfile is based on the php:7.0-cli open_in_new Docker image that contains everything needed to run PHP in a terminal.

Dockerfile contains only two commands. The first copies source code into the container and the second executes main.php.

The act can be found in the Apify library as apify/example-php.

Dockerfile

FROM php:7.0-cli
COPY ./* ./
CMD [ "php", "./main.php" ]

main.php

<?php
print "Starting ...\n";
print "ENV vars:\n";
print_r($_ENV);
print "Fetching http://example.com ...\n";
$exampleComHtml = file_get_contents('http://example.com');
print "Searching for <h1> tag contents ...\n";
preg_match_all('/<h1>(.*?)<\/h1>/', $exampleComHtml, $matches);
print "Found: " . $matches[1][0] . "\n";
print "I am done!\n";

State persistence

This act demonstrates how to persist a state, so that on restart the act can continue where it left off. For more information, see the State persistence section. The act simply counts from one up. In each run it prints one number. Its state (counter position) is stored in a named key-value store called example-counter. You will find it in the Storage section of the app after you run the act.

The act can be found in the Apify library as apify/example-counter.

const Apify = require('apify');

Apify.main(async () => {
    const keyValueStores = Apify.client.keyValueStores;

    // Get store with name 'example-counter'.
    const store = await keyValueStores.getOrCreateStore({
        storeName: 'example-counter',
    });

    // Get counter state record from store.
    const record = await keyValueStores.getRecord({
        key: 'counter',
        storeId: store.id,
    });

    // If there is no such record then start from zero.
    let counter = record ? record.body : 0;

    // Increase counter, print and set as output.
    counter ++;
    console.log(`Counter: ${counter}`);
    Apify.setValue('OUTPUT', counter);

    // Save increased value back to store.
    await keyValueStores.putRecord({
        storeId: store.id,
        key: 'counter',
        body: counter.toString(), // Record body must be a string or buffer!
    });
});