Crawlee Reviews:covers your crawling and scraping end-to-end and helps you build reliable scrapers

About Crawlee

Crawlee is a web scraping and browser automation library.It helps you build reliable crawlers. Fast.Crawlee includes tools for extracting social handles or phone numbers, infinite scrolling, blocking unwanted assets and many more.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project’s needs.

Crawlee Reviews:covers your crawling and scraping end-to-end and helps you build reliable scrapers

Reliable crawling

Crawlee won’t fix broken selectors for you (yet), but it helps you build and maintain your crawlers faster.

When a website adds JavaScript rendering, you don’t have to rewrite everything, only switch to one of the browser crawlers. When you later find a great API to speed up your crawls, flip the switch back.

It keeps your proxies healthy by rotating them smartly with good fingerprints that make your crawlers look human-like. It’s not unblockable, but it will save you money in the long run.

Crawlee is built by people who scrape for a living and use it every day to scrape millions of pages.

JavaScript & TypeScript

We believe websites are best scraped in the language they’re written in. Crawlee runs on Node.js and it’s built in TypeScript to improve code completion in your IDE, even if you don’t use TypeScript yourself.

HTTP scraping

Crawlee makes HTTP requests that mimic browser headers and TLS fingerprints. It also rotates them automatically based on data about real-world traffic. Popular HTML parsers Cheerio  and JSDOM are included.

Headless browsers

Switch your crawlers from HTTP to headless browsers in 3 lines of code. Crawlee builds on top of Puppeteer and Playwright and adds its own anti-blocking features and human-like fingerprints. Chrome, Firefox and more.

Automatic scaling and proxy management

Crawlee automatically manages concurrency based on available system resources and smartly rotates proxies. Proxies that often time-out, return network errors or bad HTTP codes like 401 or 403 are discarded.

Queue and Storage

You can save files, screenshots and JSON results to disk with one line of code or plug an adapter for your DB. Your URLs are kept in a queue that ensures their uniqueness and that you don’t lose progress when something fails.

Helpful utils and configurability

Crawlee includes tools for extracting social handles or phone numbers, infinite scrolling, blocking unwanted assets and many more. It works great out of the box, but also provides rich configuration options.

Try Crawlee out

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It’s not bundled with Crawlee to reduce install size.

npm install crawlee playwright
import { PlaywrightCrawler } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, pushData, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to `./storage/datasets/default` directory.
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page and add them to the crawling queue.
        await enqueueLinks();
    },

    // Uncomment this option to see the browser window.
    // headless: false,

    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 20,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

// Export the whole dataset to a single file in `./result.csv`.
await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);

Features

  • Single interface for HTTP and headless browser crawling
  • Persistent queue for URLs to crawl (breadth & depth first)
  • Pluggable storage of both tabular data and files
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routingerror handling and retries
  • Dockerfiles ready to deploy
  • Written in TypeScript with generics

HTTP crawling

  • Zero config HTTP2 support, even for proxies
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Integrated fast HTML parsers. Cheerio and JSDOM
  • Yes, you can scrape JSON APIs as well

Real browser crawling

  • JavaScript rendering and screenshots
  • Headless and headful support
  • Zero-config generation of human-like fingerprints
  • Automatic browser management
  • Use Playwright and Puppeteer with the same interface
  • ChromeFirefoxWebkit and many others

© 版权声明

相关文章

暂无评论

您必须登录才能参与评论!
立即登录
暂无评论...