Act

mtrunkat/xmls-to-dataset

  • Builds
  • latest 0.0.38 / 2018-02-09
  • Created 2018-02-09
  • Last modified 2018-02-09
  • grade 2

Description

This act loads list of urls from INPUT.sources. Each of these links should point to a xml file. It downloads all the files and saves them to it's default dataset. Groups parameter in INPUT allows to choose Apify proxy groups to use.


API

To run the act, send a HTTP POST request to:

https://api.apify.com/v2/acts/mtrunkat~xmls-to-dataset/runs?token=<YOUR_API_TOKEN>

The POST payload will be passed as input for the act. For more information, read the docs.


Example input

Content type: application/json; charset=utf-8

{
  "sources": [{ "requestsFromUrl": "http://some-text-file-with-links-to-xmls.com/testfile.txt" }],
  "groups": ["SHADER"]
}

Source code

Based on the apify/actor-node-basic:beta Docker image (see docs).

const Apify = require('apify');
const requestPromise = require('request-promise');
const Promise = require('bluebird');
const parseString = require('xml2js').parseString;
const _ = require('underscore');
const uuid = require('uuid/v1');


// With this requests package ignores ssl errors.
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';

const parseStringPromised = Promise.promisify(parseString);
const randomString = () => Math.random().toString(36).substring(2, 15);

Apify.main(async () => {
    const {
        sources,
        groups = ['DEFAULT'],
        maxConcurrency = 100,
    } = await Apify.getValue('INPUT');
    
    if (!_.isArray(sources)) throw new Error('INPUT.sources must be an array!');
    
    const requestList = new Apify.RequestList({
        sources,
        // Initialize from previous state if act was restarted due to some error
        state: await Apify.getValue('request-list-state'),
    });
    
    await requestList.initialize(); // Load requests.
    
    // Save state of the request list every 5 seconds.
    setInterval(() => Apify.setValue('request-list-state', requestList.getState()), 5000);
    
    const crawler = new Apify.BasicCrawler({
        requestList,
        maxConcurrency,
        // Process each url - download, parse and save xml
        handleRequestFunction: async ({ request }) => {
            const { statusCode, body } = await requestPromise({
                url: request.url,
                resolveWithFullResponse: true,
                proxy: Apify.getApifyProxyUrl({ groups, session: randomString() }),
            });
            
            if (statusCode >= 300) throw new Error(`Request failed with statusCode=${statusCode}`);

            await Apify.pushData({
                data: await parseStringPromised(body),
                request,
            });
        },
        // Handle failed requests
        handleFailedRequestFunction: async ({ request }) => {
            await Apify.pushData({
                failed: true,
                request,
            });
        },
    });
    
    await crawler.run();
});