User

Jan Čurnjancurn

Co-founder and CEO of Apify. Used to be a computer scientist in the previous life.

All
Popularity
Actor

url-to-pdf

jancurn/url-to-pdf

Opens a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object such as: { "url": "https://www.wikipedia.org/", "sleepMillis": 2000, "pdfOptions": { ... } } The optional "sleepMillis" s...

avatarjancurn
88star
FEATURED
Crawler

Machinery trader

Downloads a list of heavy-duty construction equipment for sale or rent, such as heavy duty trucks, trailers etc.

avatarjancurn
14cloud_download
Actor

pdf-to-html

jancurn/pdf-to-html

Fetches a PDF file from a specific URL and converts it to a HTML document using the pdf2htmlEX tool. The input is a JSON object such as: { "url": "https://www.example.com/some-file.pdf" } Output is a HTML file.

avatarjancurn
13star
Crawler

Data, what now?

Simple example showing how to scrape a list of posts from a personal blog.

avatarjancurn
10cloud_download
Actor

analyze-domains

jancurn/analyze-domains

Crawls and downloads web pages running on a list of provided naked domains (e.g. "example.com"). The actor stores a HTML snapshot, screenshot, text body, and HTTP response headers of all the pages. It also extracts email addresses...

avatarjancurn
8star
Actor

extract-metadata

jancurn/extract-metadata

A small efficient act that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

avatarjancurn
7star
Crawler

motor-talk.de discussions

Extracts texts from a German automotive discussion portal. For example, such data set can be used by a machine learning system for sentiment analysis to figure out how people perceive various car models.

avatarjancurn
5cloud_download
Actor

probe-page-resources

jancurn/probe-page-resources

Sequentially loads a list of URLs in headless Chrome and analyzes HTTP resources requested by each page. Source code at https://github.com/jancurn/act-probe-page-resources

avatarjancurn
4star
Crawler

Download CSS files

Downloads CSS files linked from a webpage.

avatarjancurn
4cloud_download
Actor

probe-resources-plus-webhook

jancurn/probe-resources-plus-webhook

Calls jancurn/probe-page-resources and then invokes a hard-coded webhook. The act takes same input as jancurn/probe-page-resources

avatarjancurn
4star
Actor

send-email-on-crawler-finish

jancurn/send-email-on-crawler-finish

Fetches information about a crawler run and sends it to the user by email. For example, this actor can be used to inform the user that the crawler run finished. To do that, simply put the following URL into "Finish webhook URL" se...

avatarjancurn
3star
Actor

find-broken-links

jancurn/find-broken-links

Crawls a website and finds broken links. Unlike other similar SEO analysis tools, it also reports broken URL #fragments. The results are stored in a JSON and HTML report.

avatarjancurn
3star
Actor

cz-president-election

jancurn/cz-president-election

Collects voting data from Czech statistical office about the Czech presidential election 2018.

avatarjancurn
3star
Actor

algolia-webcrawler

jancurn/algolia-webcrawler

Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors. The actor simply runs the algolia-webcrawler NPM package (https://www.npmjs.com/...

avatarjancurn
3star
Actor

example-analyze-dom-css

jancurn/example-analyze-dom-css

Example showing how to use headless Chromium with Puppeteer to open a web page, fetch the list of DOM nodes on the pages and obtain CSS styling information for each HTML element. The actor uses the Chrome DevTools Protocol to acce...

avatarjancurn
2star
Actor

example-sitemap-cheerio

jancurn/example-sitemap-cheerio

An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.

avatarjancurn
1star
Crawler

m.novinky.cz

Downloads a list of all news articles from novinky.cz from the past one week. Note that we're using the mobile version of the website, because it has a simpler structure and it's faster to load.

avatarjancurn
1cloud_download
Crawler

Firemni_seminare_SIS

Firemni seminare

avatarjancurn
0cloud_download