Actor

mtrunkat/url-list-download-html

  • Builds
  • latest 0.0.3 / 2018-05-10
  • Created 2018-02-16
  • Last modified 2018-09-05
  • grade 12

Description

This act accepts a url list and downloads HTML of each page. It has input parameter - "sources" (see soursec parameter of UrlList https://www.apify.com/docs/sdk/apify-runtime-js/beta#RequestList).


API

To run the actor, send a HTTP POST request to:

https://api.apify.com/v2/acts/mtrunkat~url-list-download-html/runs?token=<YOUR_API_TOKEN>

The POST payload will be passed as input for the actor. For more information, read the docs.


Example input

Content type: application/json; charset=utf-8

{
  "sources": [{
    "requestsFromUrl": "http://example.com/my-url-list.txt"
  }]
}

Source code

Based on the apify/actor-node-puppeteer:beta Docker image (see docs).

const Apify = require('apify');

Apify.main(async () => {
    const { sources } = await Apify.getValue('INPUT');

    if (!sources) throw new Error('input.sources is missing!!!!');
  
    const requestList = new Apify.RequestList({
        sources,
        persistStateKey: 'request-list-state',
    });
    
    await requestList.initialize();
    
    const handlePageFunction = async ({ request, page }) => {
        await Apify.pushData({
            request,
            finishedAt: new Date(),
            html: await page.evaluate(() => document.body.outerHTML),
        });
    };
    
    const handleFailedRequestFunction = async ({ request }) => {
        await Apify.pushData({
            request,
            finishedAt: new Date(),
            isFailed: true,
        });
    };

    const puppeteerCrawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction,
        handleFailedRequestFunction,
    });

    await puppeteerCrawler.run();
});