Pricing

Pay per usage

Try for free

Go to Apify Store

Puppeteer Scraper

Try for free

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Pricing

Pay per usage

Rating

5.0

(20)

Developer

Apify

Actor stats

299

Bookmarked

15K

Total users

1.1K

Monthly active users

2 days

Issues response

4 days ago

Last modified

Cost of usage

You can find the average usage cost for this Actor on the pricing page under the Which plan do I need? section. Cheerio Scraper is equivalent to Simple HTML pages while Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to Full web pages. These cost estimates are based on averages and might be lower or higher depending on how heavy the pages you scrape are.

Usage

To get started with Puppeteer Scraper, you only need a few things. First, with Start URLs, tell the scraper which web pages it should load. Then, tell it how to handle each request and extract data from each page.

The scraper starts by loading pages specified in the Start URLs input setting. You can make the scraper follow page links on the fly by setting a Link selector, Glob Patterns and/or Pseudo-URLs to tell the scraper which links it should add to the crawler's request queue. This is useful for the recursive crawling of entire websites (e.g. finding all products available in an online store).

To tell the scraper how to handle requests and extra data, you need to provide a Page function, and optionally arrays of Pre-navigation hooks and Post-navigation hooks. This is JavaScript code that is executed in the Node.js environment. Since the scraper uses the full-featured Chromium browser, client-side logic to be executed within the context of the web-page can be done using the page object within the Page function's context.

In summary, Puppeteer Scraper works as follows:

Adds each URL from Start URLs to the request queue.
For each request:
- Evaluates all hooks in Pre-navigation hooks
- Executes the Page function on the loaded page
- Optionally, finds all links from the page using Link selector. If a link matches any of the Glob Patterns and/or Pseudo URLs and has not yet been requested, it is added to the queue.
- Evaluates Post-navigation hooks
If there are more items in the queue, repeats step 2. Otherwise, finishes the crawl.

Puppeteer Scraper has a number of other configuration settings to improve performance, set cookies for login to websites, mask the web browser, etc... See Advanced configuration below for the complete list of settings.

Limitations

The Actor employs a fully-featured Chromium web browser, which is resource-intensive and might be an overkill for websites that do not render the content dynamically using client-side JavaScript. To achieve better performance for scraping such sites, you might prefer to use Cheerio Scraper, which downloads and processes raw HTML pages without the overheads of a web browser.

For non-seasoned developers, Puppeteer Scraper may be too complex. For a simpler setup process check out Web Scraper, which also uses Puppeteer under the hood.

Input Configuration

On input, the Puppeteer Scraper Actor accepts a number of configuration settings. These can be entered either manually in the user interface in Apify Console, or programmatically in a JSON object using the Apify API. For a complete list of input fields and their types, please see the outline of the Actor's Input-schema.

Start URLs

The Start URLs (startUrls) field represent the initial list of URLs of pages that the scraper will visit. You can either enter these URLs manually one by one, upload them in a CSV file or link URLs from a Google Sheet document. Note that each URL must start with either a http:// or https:// protocol prefix.

The scraper supports adding new URLs to scrape on the fly, either using the Link selector and Glob Patterns/Pseudo-URLs options, or by calling await context.enqueueRequest()inside the Page function.

Optionally, each URL can be associated with custom user data - a JSON object that can be referenced from your JavaScript code in Page function under context.request.userData. This is useful for determining which start URL is currently loaded, allowing the ability to perform some page-specific actions. For example, when crawling an online store, you might want to perform different actions on a page listing the products vs. a product detail page. For details, refer to Web scraping tutorial within the Apify documentation.

Link selector

The Link selector (linkSelector) field contains a CSS selector that is used to find links to other web pages (items with href attributes, e.g. <div class="my-class" href="...">).

On every page loaded, the scraper looks for all links matching Link selector, and checks that the target URL matches one of the Glob Patterns/Pseudo-URLs. If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.

By default, new scrapers are created with the following selector that matches all links on any page:

a[href]

If Link selector is empty, the page links are ignored, and the scraper only loads pages that were specified in Start URLs or that were manually added to the request queue by calling await context.enqueueRequest() in Page function.

Glob Patterns

The Glob Patterns (globs) field specifies which types of URLs found by Link selector should be added to the request queue.

A glob pattern is simply a string with wildcard characters.

For example, a glob pattern http://www.example.com/pages/**/* will match all the following URLs:

http://www.example.com/pages/deeper-level/page
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/something

Note that you don't need to use the Glob Patterns setting at all, because you can completely control which pages the scraper will access by calling await context.enqueueRequest() from the Page function.

Pseudo URLs

The Pseudo-URLs (pseudoUrls) field specifies which types of URLs found by Link selector should be added to the request queue.

A pseudo-URL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [regexp], which defines a JavaScript-style regular expression to match against the URL.

For example, a pseudo-URL http://www.example.com/pages/[(\w|-)*] will match all the following URLs:

http://www.example.com/pages/
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/something

If either "[" or "]" are part of the normal query string, the symbol must be encoded as [\x5B] or [\x5D], respectively. For example, the following pseudo-URL:

http://www.example.com/search?do[\x5B]load[\x5D]=1

will match the URL:

http://www.example.com/search?do[load]=1

Optionally, each pseudo-URL can be associated with user data that can be referenced from your Page function using context.request.label to determine which kind of page is currently loaded in the browser.

Note that you don't need to use the Pseudo-URLs setting at all, because you can completely control which pages the scraper will access by calling await context.enqueueRequest() from the Page function.

Clickable elements selector

For pages where the links you want to add to the crawler's request queue aren't included in elements with href attributes, you can pass a CSS Selector to the Clickable elements selector. This CSS selector should match elements that lead to the URL you want to queue up.

The scraper will mouse click the specified CSS selector after the page function finishes. Any triggered requests, navigations, or open tabs will be intercepted, and the target URLs will be filtered using Globs and/or Pseudo URLs. Finally, these filtered URLs will be added to the request queue. Leave this field empty to prevent the scraper from clicking in the page.

It's important to note that using this setting can impact performance.

Page function

Page function context as it appears within Page function:

const context = {
    // USEFUL DATA
    input, // Input data in JSON format.
    env, // Contains information about the run, such as actorId and runId.
    customData, // Value of the 'Custom data' scraper option.

    // EXPOSED OBJECTS
    page, // Puppeteer.Page object.
    request, // Crawlee.Request object.
    response, // Response object holding the status code and headers.
    session, // Reference to the currently used session.
    proxyInfo, // Object holding the url and other information about currently used Proxy.
    crawler, // Reference to the crawler object, with access to `browserPool`, `autoscaledPool`, and more.
    globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
    log, // Reference to Crawlee.utils.log.
    Actor, // Reference to the Actor class of Apify SDK.
    Apify, // Alias to the Actor class for back compatibility.

    // EXPOSED FUNCTIONS
    setValue, // Reference to the Actor.setValue() function.
    getValue, // Reference to the Actor.getValue() function.
    saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
    skipLinks, // Prevents enqueueing more links via Glob patterns/Pseudo URLs on the current page.
    enqueueRequest, // Adds a page to the request queue.

    // PUPPETEER CONTEXT-AWARE UTILITY FUNCTIONS
    injectJQuery, // Injects the jQuery library into a Puppeteer page.
    sendRequest, // Sends request using got-scraping.
    parseWithCheerio, // Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.
};

`input`

Type	Arguments	Returns
Object	-	Input object

The Actor's input as it was received from the UI. Each pageFunction invocation gets a fresh copy. Note that the Actor's input cannot be modified by changing the values in this object.

`env`

Type	Arguments	Returns
Object	-	Return value of `Actor.getEnv()`

A map of all the relevant environment variables that you may want to use.

`customData`

Type	Arguments	Returns
Object	-	Custom data object

Since the input UI is fixed, it does not support adding of other fields that may be needed for all specific use cases. If you need to pass arbitrary data to the scraper, use the Custom data input field within Advanced configuration and its contents will be available under the customData context key as an object.

`page`

Type	Arguments	Returns
Object	-	Puppeteer Page object

This is a reference to the Puppeteer Page object, which enables you to use the full power of Puppeteer in your Page functions. If you are not familiar with the Page API already, you can refer to their documentation.

`request`

Type	Arguments	Returns
Object	-	Apify Request object

An object with metadata about the currently crawled page, such as its URL, headers, and the number of retries.

const request = {
    id,
    url,
    loadedUrl,
    uniqueKey,
    method,
    payload,
    noRetry,
    retryCount,
    errorMessages,
    headers,
    userData,
    handledAt
}

See the Request class for a preview of the structure and full documentation.

`response`

Type	Arguments	Returns
Object	-	Response object

The response object is produced by Puppeteer. Currently, we only pass the response's HTTP status code and headers to the response object.

`session`

Type	Arguments	Returns
Object	-	Session object

Reference to the currently used session. See the official documentation for more information.

`proxyInfo`

Type	Arguments	Returns
Object	-	ProxyInfo object

Object holding the url and other information about currently used Proxy. See the official documentation for more information.

`crawler`

Type	Arguments	Returns
Object	-	PuppeteerCrawler object

To access the current AutoscaledPool or BrowserPool instance, we can use the crawler object. This object includes the following properties:

const crawler = {
    stats,
    requestList,
    requestQueue,
    sessionPool,
    proxyConfiguration,
    browserPool,
    autoscaledPool
}

Refer to the official documentation for more information.

`globalStore`

Type	Arguments	Returns
Object	-	Global store contents

globalStore represents an instance of a very simple in-memory store that is not scoped to the individual pageFunction invocation. This enables you to easily share global data such as API responses and tokens between all requests. Since the stored data needs to cross from the browser to the Node.js process, it must be formatted into JSON stringifiable objects. You cannot store DOM objects, functions, circular objects, etc.

`log`

Type	Arguments	Returns
Object	-	Crawlee.utils.log object

This should be used instead of JavaScript's built in console.log when logging in the Node.js context, as it automatically color-tags your logs, as well as allows the toggling of the visibility of log messages using options such as Debug log in Advanced configuration.

The most common log methods include:

context.log.info()
context.log.debug()
context.log.warning()
context.log.error()
context.log.exception()

`Actor`

Type	Arguments	Returns
Object	-	Actor class object

A reference to the full power of the Actor class of Apify SDK. See the docs for more information.

Caution: Since we're making the Actor class available with this option, and Puppeteer Scraper already runs using the Actor class, some edge case manipulations may lead to inconsistencies. Use Actor class with caution, and avoid making global changes unless you know what you're doing.

`Apify`

An alias for Actor class for back compatibility.

`setValue`

Type	Arguments	Returns
Function	(key: string, data: object, options: object)	Promise<void>

This function is async! Don't forget the await keyword!

Allows you to save data to the default key-value store. The key is the name of the item in the store (which can later be used to retrieve this stored data), and the data is an object containing all the data you want to store.

Usage:

await context.setValue('my-value', { message: 'hello' })

Refer to Key-Value store documentation for more information.

`getValue`

Type	Arguments	Returns
Function	(key: string)	Promise<object>

This function is async! Don't forget the await keyword!

Retrieve previously saved data in the key-value store via the key specified when using the setValue function.

Usage:

const { message } = await context.getValue('my-value')

Refer to Key-Value store documentation for more information.

`saveSnapshot`

Type	Arguments	Returns
Function	()	Promise<void>

This function is async! Don't forget the await keyword!

A helper function that enables saving a snapshot of the current page's HTML and a screenshot of the current page into the default key-value store. Each snapshot overwrites the previous one, and the pageFunction's invocations will also be throttled if saveSnapshot is invoked more than once in 2 seconds (this is a measure put in place to prevent abuse). Make sure you don't call it for every single request.

Usage:

await context.saveSnapshot()

You can find the latest screenshot under the SNAPSHOT-SCREENSHOT key and the HTML under the SNAPSHOT-BODY key.

`skipLinks`

Type	Arguments	Returns
Function	()	Promise<void>

This function is async! Don't forget the await keyword!

With each invocation of the pageFunction, the scraper attempts to extract new URLs from the page using the Link selector and Glob patterns/Pseudo-URLs provided in the input UI. If you want to prevent this behavior in certain cases, call the skipLinks function, and no URLs will be added to the queue for the given page.

Usage:

await context.skipLinks()

`enqueueRequest`

Type	Arguments	Returns
Function	(request: Request\|object)	Promise<void>

This function is async! Don't forget the await keyword!

To enqueue a specific URL manually instead of automatically by a combination of a Link selector and a Pseudo URL/Glob pattern, use the enqueueRequest function. It accepts a plain object as argument that needs to have the structure to construct a Request object, but frankly, you just need an object with a url key.

Usage:

await context.enqueueRequest({ url: 'https://www.example.com' })

This method is a nice shorthand for

await context.crawler.requestQueue.addRequest({ url: 'https://foo.bar/baz' })

`injectJQuery`

Type	Arguments	Returns
Function	()	Promise<void>

This function is async! Don't forget the await keyword!

Injects the jQuery library into a Puppeteer page. The injected jQuery will be set to the window.$ variable, and will survive page navigations and reloads. Note that injectJQuery() does not affect the Puppeteer page.$() function in any way.

Usage:

await context.injectJQuery();

`sendRequest`

Type	Arguments	Returns
Function	(overrideOptions?: Partial<GotOptionsInit>)	Promise<void>

This function is async! Don't forget the await keyword!

This is a helper function that allows processing the context bound Request object through got-scraping. Some options, such as url or method could be overridden by providing overrideOptions. See the official documentation for full list of possible overrideOptions and more information.

Usage:

// Without overrideOptions
await context.sendRequest();
// With overrideOptions.url
await context.sendRequest({ url: 'https://www.example.com' });

`parseWithCheerio`

Type	Arguments	Returns
Function	()	Promise<CheerioRoot>

Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.

Usage:

const $ = await context.parseWithCheerio();

Proxy Configuration

The Proxy configuration (proxyConfiguration) option enables you to set proxies that will be used by the scraper in order to prevent its detection by target websites. You can use both Apify Proxy and custom HTTP or SOCKS5 proxy servers.

Proxy is required to run the scraper. The following table lists the available options of the proxy configuration setting:

Option	Description
Apify Proxy (automatic)	The scraper will load all web pages using Apify Proxy in the automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the Proxy page in Apify Console.
Apify Proxy (selected groups)	The scraper will load all web pages using Apify Proxy with specific groups of target proxy servers.
Custom proxies	The scraper will use a custom list of proxy servers. The proxies must be specified in the `scheme://user:password@host:port` format, and multiple proxies should be separated by a space of a new line. The URL scheme can be either `http` or `socks5`. Username and password can be omitted if the proxy doesn't require authorization, but the port must always be present.

Custom proxy example:

http://bob:password@proxy1.example.com:8000
http://bobby:password123@proxy2.example.com:3001

The proxy configuration can be set programmatically when calling the Actor using the API by setting the proxyConfiguration field. It accepts a JSON object with the following structure:

{
    // Indicates whether to use Apify Proxy or not.
    "useApifyProxy": Boolean,

    // Array of Apify Proxy groups, only used if "useApifyProxy" is true.
    // If missing or null, Apify Proxy will use the automatic mode.
    "apifyProxyGroups": String[],

    // Array of custom proxy URLs, in "scheme://user:password@host:port" format.
    // If missing or null, custom proxies are not used.
    "proxyUrls": String[],
}

Advanced Configuration

This is an array of functions that will be executed BEFORE the main pageFunction is run. A similar context object is passed into each of these functions as is passed into the pageFunction; however, a second "DirectNavigationOptions" object is also passed in. Apify is an alias for Actor class in this case.

The available options can be seen here:

preNavigationHooks: [
    async ({ id, request, session, proxyInfo, customData, Actor, Apify }, { timeout, waitUntil, referer }) => {}
]

Check out the docs for Pre-navigation hooks and the PuppeteerHook type for more info regarding the objects passed into these functions. The available properties are extended with Actor (previously Apify) class and customData in this scraper.

An array of functions that will be executed AFTER the main pageFunction is run. The only available parameter is the PuppeteerCrawlingContext object. The available properties are extended with Actor (alternatively Apify) and customData in this scraper. Apify is an alias for Actor class in this case.

postNavigationHooks: [
    async ({ id, request, session, proxyInfo, response, customData, Actor, Apify }) => {}
]

Check out the docs for Post-navigation hooks and the PuppeteerHook type for more info regarding the objects passed into these functions.

Debug log

boolean

When set to true, debug messages will be included in the log. Use context.log.debug('message') to log your own debug messages.

Browser log

boolean

When set to true, console messages from the browser will be included in the Actor's log. This may result in the log being flooded by error messages, warnings and other messages of little value (especially with a high concurrency).

Custom data

Custom namings

With the final three options in the Advanced configuration, you can set custom names for the following:

Dataset
Key-value store
Request queue

Leave the storage unnamed if you only want the data within it to be persisted on the Apify platform for a number of days corresponding to your plan (after which it will expire). Named storages are retained indefinitely. Additionally, using a named storage allows you to share it across multiple runs (e.g. instead of having 10 different unnamed datasets for 10 different runs, all the data from all 10 runs can be accumulated into a single named dataset). Learn more here.

Results

The scraping results returned by Page function are stored in the default dataset associated with the Actor run, from which you can export them to formats such as JSON, XML, CSV or Excel. For each object returned by the Page function, Puppeteer Scraper pushes one record into the dataset, and extends it with metadata such as the URL of the web page where the results come from.

For example, if you were scraping the HTML <title> of Apify and returning the following object from the pageFunction:

return {
  title: "Web Scraping, Data Extraction and Automation - Apify"
}

The full object stored in the dataset would look as follows (in JSON format, including the metadata fields #error and #debug):

{
  "title": "Web Scraping, Data Extraction and Automation - Apify",
  "#error": false,
  "#debug": {
    "requestId": "fvwscO2UJLdr10B",
    "url": "https://apify.com",
    "loadedUrl": "https://apify.com/",
    "method": "GET",
    "retryCount": 0,
    "errorMessages": null,
    "statusCode": 200
  }
}

To download the results, call the Get dataset items API endpoint:

https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json

[DATASET_ID] is the ID of the Actor's run dataset, in which you can find the Run object returned when starting the Actor. Alternatively, you'll find the download links for the results in Apify Console.

To skip the #error and #debug metadata fields from the results and not include empty result records, simply add the clean=true query parameter to the API URL, or select the Clean items option when downloading the dataset in Apify Console.

To get the results in other formats, set the format query parameter to xml, xlsx, csv, html, etc. For more information, see Datasets in documentation or the Get dataset items endpoint in the Apify API reference.

Additional Resources

That's it! You might also want to check out these other resources:

Actors documentation - Documentation for the Apify Actors cloud computing platform.
Apify SDK documentation - Learn more about the tools required to run your own Apify Actors.
Crawlee documentation - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
Playwright Scraper - A similar web scraping Actor to Puppeteer Scraper, but using the Playwright library instead.
Web Scraper - A similar web scraping Actor to Playwright Scraper, but is simpler to use and only runs in the context of the browser. Uses the Puppeteer library.
Cheerio Scraper - Another web scraping Actor that downloads and processes pages in raw HTML for much higher performance.

Camoufox Scraper

apify/camoufox-scraper

Crawls websites with stealthy Camoufox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

352

5.0

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

18K

4.6

bcv-tasa-oficial

grupoaceivzla/bcv-tasa-oficial

Grupo ACEI

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

121K

4.6

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

303

1.9

Backmarket Scraper Ppr

silentflow/backmarket-scraper-ppr

Back Market scraper. Extract refurbished iPhone, MacBook & Samsung data across 14 countries. Only pay for successful results. Get prices, savings, condition, ratings & specs. 100% success rate. Perfect for price tracking & market research. Advanced filtering by brand, category & price.

SilentFlow

Trip.com Reviews Scraper

knagymate/trip-com-reviews-scraper

Scrape Trip.com hotel reviews into structured data: ratings, text, translated content, travel type, and more. Supports sorting, pagination, and cutoff dates—ideal for analytics, AI, and market research.

knagymate

5.0

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

9.8K

3.3

IRS 990 Scraper - Nonprofit & Charity Verification

pink_comic/irs-990-nonprofit-search

Scrape IRS 990 nonprofit and tax-exempt organization data for charity verification, grant prospecting, donor research, and due diligence. Search by name, EIN, state, NTEE, or 501(c) type; export revenue, assets, filings, officers, and Form 990 PDFs.

Ava Torres

$3.5 Carfax Report by VIN (PDF)✅ - Lookup & Car History Reports

easytools/carfax-report-by-vin

🚗 Download Carfax PDF reports by VIN (single or bulk) + run low-cost carfax VIN lookup and carfax VIN check for used cars. Pay only for delivered PDFs.

Easy Tools

123

5.0

Puppeteer Scraper

Cost of usage

Usage

Limitations

Input Configuration

Start URLs

Link selector

Glob Patterns

Pseudo URLs

Clickable elements selector

Page function

input

env

customData

page

request

response

session

proxyInfo

crawler

globalStore

log

Actor

Apify

setValue

getValue

saveSnapshot

skipLinks

enqueueRequest

injectJQuery

sendRequest

parseWithCheerio

Proxy Configuration

Advanced Configuration

Pre-navigation hooks

Post-navigation hooks

Debug log

Browser log

Custom data

Custom namings

Results

Additional Resources

You might also like

Camoufox Scraper

Cheerio Scraper

bcv-tasa-oficial

Web Scraper

🔥 FireScrape AI Website Content Markdown Scraper

Backmarket Scraper Ppr

Trip.com Reviews Scraper

Playwright Scraper

IRS 990 Scraper - Nonprofit & Charity Verification

$3.5 Carfax Report by VIN (PDF)✅ - Lookup & Car History Reports

Related articles

`input`

`env`

`customData`

`page`

`request`

`response`

`session`

`proxyInfo`

`crawler`

`globalStore`

`log`

`Actor`

`Apify`

`setValue`

`getValue`

`saveSnapshot`

`skipLinks`

`enqueueRequest`

`injectJQuery`

`sendRequest`

`parseWithCheerio`