No credit card required

Article Text Extractor

mtrunkat/article-text-extractor

No credit card required

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Dockerfile

1# Dockerfile contains instructions how to build a Docker image that
2# will contain all the code and configuration needed to run your actor.
3# For a full Dockerfile reference,
4# see https://docs.docker.com/engine/reference/builder/
5
6# First, specify the base Docker image. Apify provides the following
7# base images for your convenience:
8#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
9#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
10#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
11# For more information, see https://apify.com/docs/actor#base-images
12# Note that you can use any other image from Docker Hub.
13FROM apify/actor-node-chrome
14
15# Second, copy just package.json since it should be the only file
16# that affects NPM install in the next step
17COPY package.json ./
18
19# Install NPM packages, skip optional and development dependencies to
20# keep the image small. Avoid logging too much and print the dependency
21# tree for debugging
22RUN npm --quiet set progress=false \
23 && npm install --only=prod --no-optional \
24 && echo "Installed NPM packages:" \
25 && npm list \
26 && echo "Node.js version:" \
27 && node --version \
28 && echo "NPM version:" \
29 && npm --version
30
31# Next, copy the remaining files and directories with the source code.
32# Since we do this after NPM install, quick build will be really fast
33# for most source file changes.
34COPY . ./
35
36# Optionally, specify how to launch the source code of your actor.
37# By default, Apify's base Docker images define the CMD instruction
38# that runs the source code using the command specified
39# in the "scripts.start" section of the package.json file.
40# In short, the instruction looks something like this:
41# CMD npm start

INPUT_SCHEMA.json

1{
2    "title": "Article text extractor input",
3    "description": "",
4    "type": "object",
5    "schemaVersion": 1,
6    "properties": {
7        "url": {
8            "title": "Article URL",
9            "type": "string",
10            "description": "Fill the article URL, from which you want to extract data.",
11            "prefill": "https://www.bbc.com/news/world-asia-china-48659073",
12            "editor": "textfield"
13        }
14    },
15    "required": ["url"]
16}

main.js

1const Apify = require('apify');
2const request = require('request-promise');
3const extractor = require('unfluff');
4
5Apify.main(async () => {
6    const { url } = await Apify.getValue('INPUT');
7    
8    if (!url) throw new Error('INPUT.url must be provided!!!');
9    
10    console.log('Opening browser ...');
11    const browser = await Apify.launchPuppeteer();
12    
13    console.log('Loading url ...');
14    const page = await browser.newPage();
15    await page.goto(url, { waitUntil: 'domcontentloaded' });
16    const html = await page.evaluate(() => document.documentElement.outerHTML);
17
18    await Apify.setValue('page.html', html, { contentType: 'text/html' });
19    
20    console.log('Extracting article data and saving results to key-value store ...');
21    await Apify.setValue('OUTPUT', extractor(html));
22    
23    console.log('Done!');
24});

package.json

1{
2    "name": "my-actor",
3    "version": "0.0.1",
4    "dependencies": {
5        "apify": "^0.14.15",
6        "request-promise": "latest",
7        "unfluff": "latest"
8    },
9    "scripts": {
10        "start": "node main.js"
11    },
12    "author": "Me!"
13}

Developer

Marek Trunkát

Actor metrics

17 monthly users
98.6% runs succeeded
0.0 days response time
Created in Mar 2018
Modified 7 months ago

Categories

News

Twitter Scraper

quacker/twitter-scraper

Scrape tweets from any Twitter user profile. Top Twitter API alternative to scrape Twitter hashtags, threads, replies, followers, images, videos, statistics, and Twitter history. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Quacker

24.1k

Google Trends Scraper

emastra/google-trends-scraper

Scrape data from Google Trends by search terms or URLs. Specify locations, define time ranges, select categories to get interest by subregion and over time, related queries and topics, and more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Emiliano Mastragostino

3.2k

Twitter URL Scraper

quacker/twitter-url-scraper

Copy any Twitter URL and extract Twitter usernames, profile photos, follower count, tweets, hashtags, favorite count, and more. Export scraped datasets, run the scraper via API, schedule and monitor runs or integrate with other tools.

Quacker

4.3k

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

3.2k

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

2.9k

Transfermarkt Scraper

curious_coder/transfermarkt

⚽ Use this free tool as an API for the Transfermarkt website. Scrape and extract data from competition, club or player pages, or almost any Transfermarkt page. Download your data as HTML table, JSON, CSV, Excel, XML, and RSS feed.

Curious Coder

2.4k

Dun & Bradstreet Scraper

epctex/dnb-scraper

Effortlessly extract valuable company information, financial projections, industry insights, and more from the extensive Dun & Bradstreet commercial database. Dive deep into the D&B Data Cloud, Business Directory, articles, companies, and industries with customized search terms.

epctex

1.6k

Dark Web Scraper

epctex/darkweb-scraper

Uncover valuable insights with our Dark Web Scraper. Extract sensitive data, including crypto wallets, API keys, emails, phone numbers, and more, from the depths of the Dark Web. You can specify search terms, and customize and retrieve OSINT data out of the box.

epctex

865