Actor

Actor is a serverless computing platform that enables execution of arbitrary pieces of code in the Apify cloud. Unlike traditional serverless platforms, the run of an act is not limited to the lifetime of a single HTTP transaction. It can run for as long as necessary, even forever. The act can perform anything from a simple actions such as filling a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

A single isolated Actor job is called an act, and it consists of a source code and various settings. You can think of an act as a cloud app or service.

Quick start

Go to Actor section in the app, create a new act and go to Source tab. Paste the following Node.js code into to Source code editor:

const Apify = require('apify');

Apify.main(async () => {
   console.log('Hello world from Actor!');
});

Click Quick run to build and run your act. After the run is finished you should see something like:

Congratulations, you have successfully created and ran your first act!

Let's try something little more complicated. We will change the act to accept input and generate output (see Input and output for more details):

const Apify = require('apify');

Apify.main(async () => {
    // Get input and print it
    const input = await Apify.getValue('INPUT');
    console.log('My input:');
    console.dir(input);

    // Save output
    const output = { message: 'Hello world!' };
    await Apify.setValue('OUTPUT', output);
});

Save your act by clicking Save and then rebuild it by clicking Build. After the build is finished, go to Console and set Input to:

{ "hello": 123 }

Then set Content type to application/json; charset=utf-8 and click Run. You will see something like:

Excellent, you have just created your first act that accepts input and stores output! Now you can start adding some magic.

Note that above act is also available in the library as apify/hello-world. It uses the apify NPM package, which provides various helper functions to simplify developments of acts. For example, the Apify.main() function invokes a user function and waits for its finish, it logs exception details etc. Note that the apify package is optional and acts do not need to use it at all.

For more complicated acts, you'll probably prefer to host the source code on Git. To do that, follow these steps:

  1. Create a new Git repository
  2. Copy the boilerplate act code from the apify/quick-start act
  3. Set Source type to Git repository for your act in the app
  4. Paste the Git repo link to Git URL, save changes and build your act.
  5. That's it, now you can develop your act locally on your computer and run it on Apify cloud!

For more information, go to Git repository section.

Source code

The Source type setting determines the location of the source code for the act. It can have one of the following values: Hosted source, Git repository or Zip file.

Hosted source

The source code of the act can be hosted directly on Apify. All the code needs to be in a single file and written in JavaScript / Node.js. The version of Node.js is determined by the Base image setting - see Base images for the description of possible options.

The hosted source is especially useful for simple acts. The source code can require arbitrary NPM packages, for example:

const _ = require('underscore');
const request = require('request');

During the build process, the source code is scanned for occurrences of the require() function and the corresponding NPM dependencies are automatically added to the package.json file by running:

npm install underscore request --save --only=prod --no-optional

Note that certain NPM packages need additional tools for their installation, such as a C compiler or Python interpreter. If these tools are not available in the base Docker image, the build will fail. If that happens, try to change the base image to Node.js 8 + Puppeteer on Debian, because it contains much more tools than other images. Alternatively, the source code can be hosted in a Git repository, where it is possible to specify a custom Dockerfile with arbitrary dependencies.

Git repository

If the source code of the act is hosted externally in a Git repository, it can consist of multiple files and directories, use its own Dockerfile to control the build process and have a user description in library fetched from the README.md file. The location of the repository is specified by the Git URL setting, which can be an https, git or ssh URL.

To help you get started quickly, you can use the apify/quick-start act which contains all the boilerplate necessary when creating a new act hosted on Git. The source code is available on GitHub.

To specify a Git branch or tag to checkout, add a URL fragment to the URL. For example, to checkout the develop branch specify a URL such as https://github.com/jancurn/act-analyse-pages.git#develop

Optionally, the second part of the fragment in the Git URL (separated by colon) specifies the context directory for the Docker build. For example, https://github.com/jancurn/act-analyse-pages.git#develop:some/dir will checkout the develop branch and set some/dir as a context directory for the Docker build.

Note that you can easily setup an integration where the act is automatically rebuilt on every commit to the Git repository. For more details, see GitHub integration.

Zip file

The source code for the act can also be located in a Zip archive hosted on an external URL. This option enables integration with arbitrary source code or continuous integration systems. Similarly as with the Git repository, the source code can consist of multiple files and directories, can contain a custom Dockerfile and act description is taken from README.md.

Custom Dockerfile

Internally, Apify uses Docker to build and run the acts. To control the build of the act, you can create a custom Dockerfile in the root of the Git repository or Zip directory. Note that this option is not available for the Hosted source option. If the Dockerfile is missing, the system uses the following default:

FROM apify/actor-node-basic
ENV NODE_ENV=production
COPY . ./
RUN npm install --production --no-optional

For more information about Dockerfile syntax and commands, see the Dockerfile reference open_in_new.

Note that apify/actor-node-basic is a base Docker image provided by Apify. There are other base images with other features available. However, you can use arbitrary Docker images as the base for your acts, although using the Apify images has some performance advantages. See Base images for details.

GitHub integration

If the source code of that act is hosted in a Git repository, it is possible to setup integration so that on every push to the Git repository the act is automatically rebuilt. For that, you only need to setup a webhook in your Git source control system that will invoke the Build act API endpoint on every push to Git repository.

For example, for repositories on GitHub it can be done using the following steps. First, go to the act detail page, open the API tab and copy the Build act API endpoint URL. It look something like this:

https://api.apify.com/v2/acts/apify~hello-world/builds?token=<API_TOKEN>&version=0.1

Then go to your GitHub repository, click Settings, select Webhooks tab and click Add webhook. Paste the API URL to the Payload URL as follows:

And that's it! Now you act should automatically rebuild on every push to the GitHub repository.

Custom environment variables

The act owner can specify custom environment variables that are set to the act's process during the run. Sensitive environment variables such as passwords or API tokens can be protected by setting the Secret option. With this option enabled, the value of the environment variable is encrypted and it will not be visible in the app or APIs, and the value is redacted from act logs to avoid accidental leakage of sensitive data.

Note that the custom environment variables are fixed during the build of the act and cannot be changed later. See the Build section for details.

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.SMTP_HOST);

The Actor runtime sets additional environment variables to the act process during the run. See Environment variables for details.

Versioning

In order to enable active development, the act can have multiple versions of the source code and associated settings, such as the Base image and Environment. Each version is denoted by a version number of the form MAJOR.MINOR; the version numbers should adhere to the Semantic Versioning open_in_new logic.

For example, the act can have production version 1.1, a beta version 1.2 that contains new features but is still backwards compatible, and a development version 2.0 that contains breaking changes.

The versions of the acts are built and run separately. For details, see Build and Run.

Build

Before the act can be run, it first needs to be built. The build effectively creates a snapshot of a specific version of act's settings such as the Source code and Environment variables, and creates a Docker image that contains everything the act needs for its run, including necessary NPM packages, web browsers etc.

Each build is assigned a unique build number of the form MAJOR.MINOR.BUILD (e.g. 1.2.345), where MAJOR.MINOR corresponds to the act version number (see Versions) and BUILD is an automatically-incremented number starting at 1.

By default the build has a timeout of 300 seconds and consumes 1024 MB of memory from user's memory limit. See Resource limits section for more details.

Tags

When running the act, the caller needs to specify which act build should be actually used. To simplify this process, the builds can be associated with a tag such latest or beta, which can be used instead of the version number when running the act. The tags are unique - only one build can be associated with a specific tag.

To set a tag for builds of a specific act version, set the Build tag property. Whenever a new build of the version is successfully finished, it is automatically assigned the tag. By default, the builds are set the latest tag.

Base images

Apify provides the following Docker images that can be used as a base for user acts:

Note that all the Apify images are pre-cached on Apify servers in order to speed-up the act builds and runs. The source code used to generate the images is available in the apify-actor-docker open_in_new GitHub repository.

Cache

By default, the build process pulls latest copies of all necessary Docker images and builds each new layer of Docker image from scratch. To speed up the build process, the user can invoke the build using the Quick run option in the Source tab, or by passing the useCache parameter in the API. See API reference for more details.

Run

The act be invoked in a number of ways. One option is to start the act manually in Console in the app:

The following table describes the run settings:

BuildTag or number of the build to run (e.g. latest or 1.2.34).
TimeoutTimeout for the act run in seconds. Zero value means there is no timeout.
MemoryAmount of memory allocated for the act run, in megabytes.
InputInput data for the act. The maximum length is 1M characters.
Content typeIndicates what kind of data is in the input (e.g. application/json).

The owner of the act can specify default values for all the above settings in the Default run configuration section in the app. If the act caller does not specify a particular setting, the default value is used.

The act can also be invoked using the Apify API by sending a HTTP POST request to the Run act API endpoint, such as:

https://api.apify.com/v2/acts/apify~hello-world/runs?token=<YOUR_API_TOKEN>

The act's input and its content type can be passed as a payload of the POST request and additional options can be specified using URL query parameters. For more details, see the Run act section in the API reference.

Acts can also be invoked programmatically from other acts using the call() function provided by the apify NPM package. For example:

const run = await Apify.call('apify/hello-world', { message: 'Hello!' });
console.dir(run.output);

The newly started act runs under the same user account as the initial act and therefore all resources consumed are charged to the same user account. This allows more complex acts to be built using simpler acts built and owned by other users.

Internally, the call() function takes the user's API token from the APIFY_TOKEN environment variable, then it invokes the Run act API endpoint, waits for the act to finish and reads its output using the Get record API endpoint.

Input and output

As demonstrated in the hello world example above, acts can accept input and generate output. Both input and output are stored in a key-value store that is created when the act is started, under the INPUT and OUTPUT keys, respectively. Note that the act can store other values under arbitrary keys, for example crawling results or screenshots of web pages.

The key-value store associated with the act run can be conveniently accessed using the getValue() and setValue() functions provided by the apify NPM package. Internally, these functions read the ID of the key-value store from the APIFY_DEFAULT_KEY_VALUE_STORE_ID environment variable and then access the store using the Apify API. For more details about the key-value stores, go to the Storage section.

The input can be passed to the act either manually in the Console or using a POST payload when running the act using API. See Run section for details.

Environment variables

Aside from custom environment variables, the act's process has several environment variables set to provide it with a context:

APIFY_ACT_IDID of the act.
APIFY_ACT_RUN_IDID of the act run.
APIFY_USER_IDID of the user who started the act. Note that it might be different than the owner of the act.
APIFY_TOKENThe API token of the user who started the act.
APIFY_STARTED_ATDate when the act was started.
APIFY_TIMEOUT_ATDate when the act will time out.
APIFY_DEFAULT_KEY_VALUE_STORE_IDID of the key-value store where act's input and output data is stored.
APIFY_HEADLESSSet to '1', which indicates that web browsers should run in headless mode.
APIFY_MEMORY_MBYTESIndicates the size of memory allocated for the act run, in megabytes. It can be used by acts to optimize their memory usage.

The dates are always in the UTC timezone and are represented in simplified extended ISO format (ISO 8601 open_in_new), e.g. 2017-10-13T14:23:37.281Z

To access environment variables in Node.js, use the process.env object, for example:

console.log(process.env.APIFY_USER_ID);

Resource limits

Acts run inside a Docker container whose resources are limited. When invoking the act, the caller has to specify the amount of memory allocated for the act, the minimum is 512 MB and maximum is 8192 MB. Each user has a certain total limit of memory for running acts, based on his or her subscription plan. For example, free accounts have a limit of 1024 MB. The sum of memory allocated for all running acts and builds needs to fit into this limit, otherwise the user cannot start a new act. See pricing for more details.

The share of CPU is computed automatically from the memory as follows: for each 4096 MB of memory the act gets 1 full CPU core. For other amounts of memory the number of CPU cores is computed fractionally. For example, an act with 1024 MB of memory will have 1/4 of CPU share. Note that the CPU throttling is only applied if the system is under load; if there is a free CPU capacity, the acts are not throttled.

The act has a hard disk space limited by twice the amount of memory. For example, an act with 1024 MB of memory will have 2048 MB of disk available.

State persistence

Unlike traditional serverless platforms, Actor has no limits on the duration of act run. However, that means that the act might need to be restarted from time to time, e.g. when the server it's running on is to be shutdown. The acts need to account for this possibility. For short-running acts, the chance of restart is quite low and the cost of repeated run is low, therefore the restarts can be ignored. However, for long-running acts the restart might be very costly and therefore such acts should periodically persist their state, possibly to the key-value store associated with the act run. On start the acts should first check whether there is some state stored and if so they should continue where they left off.

Publishing

The act can be private or public. Private acts can only be accessed and started by their owner, while public acts are shown in the library and can be run by anyone. Each public act has globally unique identifier that consists of owner's username and the act name, e.g. apify/hello-world.

To publish your act, go to Settings → Permissions on the act detail page and click the Publish button. You'll need to have a username set, which can be done on the Profile page.

The short act description shown in the library is taken from Settings → Description. Additionally, if the act's source code is hosted in a Git repository or Zip file, you can add a long description in Markdown open_in_new language to the README.md or README files in the root of the source code directory. To see example how it looks, go to apify/crawl-url-list.

IMPORTANT: Note that if your act is public and used by other people, its usage is not charged towards your account. The user running the act is always the one who pays for the computational resources consumed by act's execution.

Examples

This section provides example of various features of the Actor platform. All these examples and many more are also available in the library.

Puppeteer

This example demonstrates how to use headless Chrome with Puppeteer to open a web page, determines its dimensions, save a screenshot and print it to PDF. The act can be found in the Apify library as apify/example-puppeteer.

const Apify = require('apify');

Apify.main(async () => {
   const input = await Apify.getValue('INPUT');

   if (!input || !input.url) throw new Error('Invalid input, must be a JSON object with the "url" field!');

   console.log('Launching Puppeteer...');
   const browser = await Apify.launchPuppeteer();

   console.log(`Opening URL: ${input.url}`);
   const page = await browser.newPage();
   await page.goto(input.url);

   // Get the "viewport" of the page, as reported by the page.
   console.log('Determining page dimensions...');
   const dimensions = await page.evaluate(() => ({
       width: document.documentElement.clientWidth,
       height: document.documentElement.clientHeight,
       deviceScaleFactor: window.devicePixelRatio
   }));
   console.log(`Dimension: ${JSON.stringify(dimensions)}`);

   // Grab a screenshot
   console.log('Saving screenshot...');
   const screenshotBuffer = await page.screenshot();
   await Apify.setValue('screenshot.png', screenshotBuffer, { contentType: 'image/png' });

   console.log('Saving PDF snapshot...');
   const pdfBuffer = await page.pdf({ format: 'A4'});
   await Apify.setValue('page.pdf', pdfBuffer, { contentType: 'application/pdf' });

   console.log('Closing Puppeteer...');
   await browser.close();

   console.log('Done.');
   console.log('You can check the output in the key-value on the following URLs:');
   const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/screenshot.png`)
   console.log(`- https://api.apify.com/v2/key-value-stores/${storeId}/records/page.pdf`);
});

The code above uses the launchPuppeteer() function provided by the apify NPM package. The function launches Puppeteer with several settings that enable it to run in Actor. Note that the act needs to have Base image set to Node.js 8 + Puppeteer on Debian in order to run Puppeteer.

Custom Dockerfile

This example demonstrates how to create an act written in PHP using a custom Dockerfile. For more information, see the Custom Dockerfile section. The Dockerfile is based on the php:7.0-cli open_in_new Docker image that contains everything needed to run PHP in a terminal.

Dockerfile contains only two commands. The first one copies source code into the container and the second ont executes main.php.

The act can be found in the Apify library as apify/example-php.

Dockerfile

FROM php:7.0-cli
COPY ./* ./
CMD [ "php", "./main.php" ]

main.php

<?php
print "Starting ...\n";
print "ENV vars:\n";
print_r($_ENV);
print "Fetching http://example.com ...\n";
$exampleComHtml = file_get_contents('http://example.com');
print "Searching for <h1> tag contents ...\n";
preg_match_all('/<h1>(.*?)<\/h1>/', $exampleComHtml, $matches);
print "Found: " . $matches[1][0] . "\n";
print "I am done!\n";

State persistence

This act demonstrates how to persist a state, so that on restart the act can continue where it left off. For more information, see the State persistence section. The act simply counts from one up. In each run it prints one number. Its state (counter position) is stored in a named key-value store called example-counter. You will find it in the Storage section of the app after you run the act.

The act can be found in the Apify library as apify/example-counter.

const Apify = require('apify');

Apify.main(async () => {
    const keyValueStores = Apify.client.keyValueStores;

    // Get store with name 'example-counter'.
    const store = await keyValueStores.getOrCreateStore({
        storeName: 'example-counter',
    });

    // Get counter state record from store.
    const record = await keyValueStores.getRecord({
        key: 'counter',
        storeId: store.id,
    });

    // If there is no such record then start from zero.
    let counter = record ? record.body : 0;

    // Increase counter, print and set as output.
    counter ++;
    console.log(`Counter: ${counter}`);
    Apify.setValue('OUTPUT', counter);

    // Save increased value back to store.
    await keyValueStores.putRecord({
        storeId: store.id,
        key: 'counter',
        body: counter.toString(), // Record body must be a string or buffer!
    });
});