Act

jancurn/extract-metadata

  • Builds
  • latest 0.0.8 / 2018-02-09
  • Created 2018-02-08
  • Last modified 2018-02-09
  • grade 4

Description

A small efficient act that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.


API

To run the act, send a HTTP POST request to:

https://api.apify.com/v2/acts/jancurn~extract-metadata/runs?token=<YOUR_API_TOKEN>

The POST payload will be passed as input for the act. For more information, read the docs.


Example input

Content type: application/json; charset=utf-8

{ "url": "https://www.apify.com/" }

Source code

Based on the apify/actor-node-basic Docker image (see docs).

const Apify = require('apify');
const request = require('request-promise');
const cheerio = require('cheerio');


Apify.main(async () => {
    // Get input of the act
    const input = await Apify.getValue('INPUT');
    if (!input || typeof(input.url) !== 'string') {
        throw new Error("Invalid input, it needs to contain 'url' field.");
    }
    
    // Load the web page and extract meta-data
    console.log(`Opening ${input.url}`);
    const html = await request(input.url);
    
    const $ = cheerio.load(html);
    
    const meta = {};
    $('head meta').each(function () {
        const name = $(this).attr('name');
        const content = $(this).attr('content');
        if (name) meta[name] = content ? content.trim() : null;
    });
    
    const result = {
        url: input.url,
        title: ($('head title').text() || '').trim(),
        meta,
    }

    // Show and save result
    console.log('Result:');
    console.dir(result);
    await Apify.setValue('OUTPUT', result);
});