No credit card required

Metadata Extractor

jancurn/extract-metadata

No credit card required

A small efficient actor that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

The actor takes a list of URLs of web pages on input, loads the HTML, and then extracts metadata from the HTML. The result is stored as a JSON file into the default dataset.

For example, for https://www.apify.com, the JSON result looks as follows:

1{
2    "url": "https://www.apify.com/",
3    "title": "Web Scraping, Data Extraction and Automation · Apify",
4    "meta": {
5        "X-UA-Compatible": "IE=edge,chrome=1",
6        "viewport": "width=device-width,minimum-scale=1,initial-scale=1",
7        "copyright": "Copyright&copy; 2019 Apify Technologies s.r.o. All rights reserved.",
8        "keywords": "web scraper, web crawler, scraping, data extraction, API",
9        "robots": "index,follow",
10        "referrer": "origin",
11        "googlebot": "index,follow",
12        "description": "Apify extracts data from websites, crawls lists of URLs and automates workflows on the web. Turn any website into an API in a few minutes!",
13        "twitter:card": "summary_large_image",
14        "twitter:creator": "@apify",
15        "fb:app_id": "1636933253245869",
16        "og:url": "https://apify.com/",
17        "og:type": "website",
18        "og:title": "Web Scraping, Data Extraction and Automation · Apify",
19        "og:description": "Apify extracts data from websites, crawls lists of URLs and automates workflows on the web. Turn any website into an API in a few minutes!",
20        "og:image": "https://apify.com/img/og-image.png",
21        "og:image:alt": "Apify",
22        "og:image:width": "1200",
23        "og:image:height": "630",
24        "og:locale": "en_IE",
25        "og:site_name": "Apify",
26        "next-head-count": "19"
27    }
28}

Developer

Jan Čurn

Actor metrics

37 monthly users
66.9% runs succeeded
0.0 days response time
Created in Feb 2018
Modified 7 months ago

Categories

Developer tools

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

61.5k

Website Content Crawler

apify/website-content-crawler

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Apify

12.3k

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

4.1k

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

3.1k

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

1.6k

Actor fail manager

lukaskrivka/actor-fail-manager

Automatically triggered on a failed run to analyze if the run should be resurrected and to create an error report for the author.

Lukáš Křivka

2.4k

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

587

Website Screenshot Generator

apify/screenshot-url

Create a screenshot of a website based on a specified URL. The screenshot is stored as the output in a key-value store. It can be used to monitor web changes regularly after setting up the scheduler.

Apify

2.3k

Anti Captcha Recaptcha

petr_cermak/anti-captcha-recaptcha

🧰 Actor for solving Google reCAPTCHA using the anti-captcha.com service. You need to have an anti-captcha subscription.

Petr Cermak

1.3k

Page Scraping Analyzer

apify/page-analyzer

Performs analysis of a webpage to figure out the best way how to scrape its data. Provide a URL and data points to find and get back a detailed dashboard showing how the data can be scraped. Works with initial and rendered HTML, JavaScript variables and dynamically loaded data.

Apify