Actor

mtrunkat/puppeteer-promise-pool-example

  • Builds
  • latest 0.0.11 / 2017-12-18
  • Created 2017-12-18
  • Last modified 2017-12-18
  • grade 3

Description

Example how to use Puppeteer in parallel using 'es6-promise-pool' npm package.


API

To run the actor, send a HTTP POST request to:

https://api.apify.com/v2/acts/mtrunkat~puppeteer-promise-pool-example/runs?token=<YOUR_API_TOKEN>

The POST payload will be passed as input for the actor. For more information, read the docs.


Example input

Content type: application/json; charset=utf-8

{ "hello": 123 }

Source code

Based on the apify/actor-node-puppeteer:beta Docker image (see docs).

const Apify = require('apify');
const PromisePool = require('es6-promise-pool');

// How may urls we want to process in parallel.
const CONCURRENCY = 5;

// Urls to process.
const URLS = [
    'http://example.com',
    'http://news.ycombinator.com',
    'https://news.ycombinator.com/news?p=2',
    'https://news.ycombinator.com/news?p=3',
    'https://news.ycombinator.com/news?p=4',
    'https://news.ycombinator.com/news?p=5',
    'https://www.reddit.com/',
];

let browser;
let results = [];

// This function returns promise that gets resolved once Puppeteer
// opens url, evaluates content and closes it.
const crawlUrl = async (url) => {
    const page = await browser.newPage();
        
    console.log(`Opening ${url}`);
    await page.goto(url);
        
    console.log(`Evaluating ${url}`);
    const result = await page.evaluate(() => {
        return {
            title: document.title,
            url: window.location.href,
        };
    });
        
    results.push(result);
        
    console.log(`Closing ${url}`);
    await page.close();
};

// Every time it's called takes one url from URLS constant and returns 
// crawlUrl(url) promise. When URLS gets empty returns null.
const promiseProducer = () => {
    const url = urls.pop();
    
    return url ? crawlUrl(url) : null;
};

Apify.main(async () => {
    // Starts browser.
    browser = await Apify.launchPuppeteer();

    // Runs thru all the urls in a pool of given concurrency.
    const pool = new PromisePool(promiseProducer, CONCURRENCY);
    await pool.start();
    
    // Print results.
    console.log('Results:');
    console.log(JSON.stringify(results, null, 2));
    
    await Apify.setValue('OUTPUT', results);
    await browser.close();
});