Actor

mtrunkat/article-text-extractor

  • Builds
  • latest 0.0.17 / 2018-03-15
  • Created 2018-03-15
  • Last modified 2018-09-05
  • grade 17

Description

Simply extracts article text and other meta info from given url. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.


API

To run the actor, send a HTTP POST request to:

https://api.apify.com/v2/acts/mtrunkat~article-text-extractor/runs?token=<YOUR_API_TOKEN>

The POST payload will be passed as input for the actor. For more information, read the docs.


Example input

Content type: application/json; charset=utf-8

{ "url": "https://techcrunch.com/2018/03/15/blue-vision-labs-which-builds-collaborative-ar-emerges-from-stealth-with-14-5m-led-by-gv/" }

Source code

Based on the apify/actor-node-chrome Docker image (see docs).

const Apify = require('apify');
const request = require('request-promise');
const extractor = require('unfluff');

Apify.main(async () => {
    const { url } = await Apify.getValue('INPUT');
    
    if (!url) throw new Error('INPUT.url must be provided!!!');
    
    console.log('Opening browser ...');
    const browser = await Apify.launchPuppeteer();
    
    console.log('Loading url ...');
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: 'domcontentloaded' });
    const html = await page.evaluate(() => document.documentElement.outerHTML);

    await Apify.setValue('page.html', html, { contentType: 'text/html' });
    
    console.log('Extracting article data and saving results to key-value store ...');
    await Apify.setValue('OUTPUT', extractor(html));
    
    console.log('Done!');
});