Development Standard

BrightData

4-tier progressive scraping with automatic escalation: Tier 1 WebFetch (fast, built-in), Tier 2 curl with Chrome headers (basic bot bypass), Tier 3 agent-browser (headless JavaScript rendering via Rust CLI daemon), Tier 4 Bright Data MCP proxy (CAPTCHA, advanced bot detection, residential proxies).

02
Workflows
00
References
13
Triggers
medium
Effort

The Problem

A lot of the web won't let you in. Paywalled content, JavaScript-heavy SPAs that render nothing server-side, sites that fingerprint your headers and return a 403, pages behind CAPTCHAs. A bare WebFetch call works fine on simple public pages and fails silently or noisily on everything else. Without escalation logic, you either get blocked and give up, or you go straight to an expensive proxy service for content you could have fetched in 200ms.

How This Skill Approaches It

Four-tier progressive escalation: start cheap and fast, escalate only when blocked. Tier 1 is WebFetch — built-in, instant, costs nothing. Tier 2 is curl with Chrome-like headers, which clears basic user-agent checks. Tier 3 is agent-browser, a headless Rust CLI daemon that handles JavaScript rendering for dynamic SPAs. Tier 4 is the Bright Data MCP proxy — residential IPs, CAPTCHA solving, advanced bot-detection bypass — which has real usage costs and only runs when the first three tiers fail. FourTierScrape runs this escalation for a single URL. Crawl handles multi-page work: light crawl uses scrape_batch in a loop for up to 50 pages, full crawl hits the Bright Data Crawl API for entire sites. All output lands as markdown.

  • Always starts at Tier 1 and escalates only when blocked — Tier 4 has usage costs
  • Outputs URL content in markdown format
  • Playwright is banned across PAI
Not for simple public content (use WebFetch directly), social platform scraping with named actors (use Apify), parallel headless automation with persistent auth profiles (use Browser), or real-Chrome bot bypass with logged-in sessions and zero CDP fingerprint (use Interceptor)

In Action

What you say to your DA, and what the BrightData skill actually does.

  • You say "scrape this page for me — it keeps blocking my requests"
    Runs FourTierScrape: tries WebFetch, escalates to curl with Chrome headers, then agent-browser for JS rendering, then Bright Data MCP if all else fails — returns the page content as markdown.
  • You say "crawl all the pages under the docs section of this site"
    Runs Crawl: light crawl mode uses scrape_batch in a loop with link extraction, up to 50 pages; if the site is larger, escalates to the Bright Data Crawl API — returns a site map plus page contents as markdown with crawl stats.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Scrapes a URL or crawls a whole site, escalating through four tiers only as far as it needs to. Tier 1 is WebFetch, Tier 2 is curl with Chrome headers, Tier 3 is the agent-browser headless daemon for JavaScript-heavy pages, and Tier 4 is the Bright Data MCP proxy for CAPTCHA, advanced bot detection, and residential proxies. Two workflows: FourTierScrape for a single URL, Crawl for multi-page site mapping. Output is always markdown.

The Problem

A lot of pages won't give up their content to a simple fetch — some need JavaScript to render, some check headers, some throw a CAPTCHA at anything that looks automated. Reaching straight for the heavy proxy every time wastes money, since Tier 4 has usage costs and most pages don't need it. But guessing which tier a page needs is its own time sink. Starting cheap and escalating only when blocked gets the content with the least cost and the least latency.

How It Works

Progressive escalation, always starting at Tier 1 and stepping up only on failure:

  1. Tier 1: WebFetch — fast, built-in.
  2. Tier 2: curl with Chrome headers — bypasses basic user-agent bot detection.
  3. Tier 3: agent-browser — headless browser via the agent-browser Rust CLI daemon for JavaScript rendering. Playwright is banned across PAI.
  4. Tier 4: Bright Data MCP — proxy service that handles CAPTCHA and advanced bot detection.

Content is preserved in markdown at every tier. The Crawl workflow extends this to multiple pages — a light crawl loops the MCP batch scraper plus link extraction up to 50 pages, a full crawl uses the Bright Data Crawl API for entire sites.

When to Activate This Skill

Direct Scraping Requests (Categories 1-4)

  • "scrape this URL", "scrape [URL]", "scrape this page"
  • "fetch this URL", "fetch [URL]", "fetch this page", "fetch content from"
  • "pull content from [URL]", "pull this page", "pull from this site"
  • "get content from [URL]", "retrieve [URL]", "retrieve this page"
  • "do scraping on [URL]", "run scraper on [URL]"
  • "basic scrape", "quick scrape", "simple fetch"
  • "comprehensive scrape", "deep scrape", "full content extraction"

Access & Bot Detection Issues (Categories 5-7)

  • "can't access this site", "site is blocking me", "getting blocked"
  • "bot detection", "CAPTCHA", "access denied", "403 error"
  • "need to bypass bot detection", "get around blocking"
  • "this URL won't load", "can't fetch this page"
  • "use Bright Data", "use the scraper", "use advanced scraping"

Result-Oriented Requests (Category 8)

  • "get me the content from [URL]"
  • "extract text from [URL]"
  • "download this page content"
  • "convert [URL] to markdown"
  • "need the HTML from this site"

Crawling Requests (Categories 9-11)

  • "crawl this site", "crawl [URL]", "spider this domain"
  • "map this website", "get all pages from [URL]", "scrape the whole site"
  • "crawl all pages under /docs", "extract all pages from", "site crawl"
  • "get every page on this site", "full site extraction"
  • "crawl depth 3", "crawl up to 50 pages"

Use Case Indicators

  • User needs web content for research or analysis
  • Standard methods (WebFetch) are failing
  • Site has bot detection or rate limiting
  • Need reliable content extraction
  • Converting web pages to structured format (markdown)
  • User needs multiple pages from a site, not just one
  • User wants to map a site's structure or extract a section

Core Capabilities

Progressive Escalation Strategy:

  1. Tier 1: WebFetch - Fast, simple, built-in Claude Code tool
  2. Tier 2: Customized Curl - Chrome-like browser headers to bypass basic bot detection
  3. Tier 3: agent-browser - Headless browser automation via agent-browser Rust CLI daemon for JavaScript-heavy sites. Playwright is banned across PAI.
  4. Tier 4: Bright Data MCP - Professional scraping service that handles CAPTCHA and advanced bot detection

Key Features:

  • Automatic fallback between tiers
  • Preserves content in markdown format
  • Handles bot detection and CAPTCHA
  • Works with any URL
  • Efficient resource usage (only escalates when needed)

Workflow Overview

FourTierScrape.md - Complete URL content scraping with four-tier fallback strategy

  • When to use: Any single URL content retrieval request
  • Process: Start with WebFetch → If fails, use curl with Chrome headers → If fails, use Browser Automation → If fails, use Bright Data MCP
  • Output: URL content in markdown format

Crawl.md - Multi-page crawling with link discovery and site mapping

  • When to use: Crawling multiple pages from a site, mapping site structure, extracting a section
  • Process: Light Crawl (MCP scrape_batch + link extraction loop, up to 50 pages) or Full Crawl (Bright Data Crawl API for entire sites)
  • Output: Site map + page contents in markdown, with crawl stats and cost summary

Extended Context

Integration Points:

  • WebFetch Tool - Built-in Claude Code tool for basic URL fetching
  • Bash Tool - For executing curl commands with custom headers
  • Browser Automation - agent-browser headless daemon for JavaScript rendering
  • Bright Data MCP - mcp__Brightdata__scrape_as_markdown and scrape_batch for advanced scraping
  • Bright Data Crawl API - HTTP POST to api.brightdata.com/datasets/v3/trigger for full-site crawls

When Each Tier Is Used:

  • Tier 1 (WebFetch): Simple sites, public content, no bot detection
  • Tier 2 (Curl): Sites with basic user-agent checking, simple bot detection
  • Tier 3 (agent-browser): Sites requiring JavaScript execution, dynamic content loading
  • Tier 4 (Bright Data): Sites with CAPTCHA, advanced bot detection, residential proxy requirements

Configuration: No configuration required - all tools are available by default in Claude Code


Examples

Example 1: Simple Public Website

User: "Scrape https://example.com"

Skill Response:

  1. Routes to three-tier-scrape.md
  2. Attempts Tier 1 (WebFetch)
  3. Success → Returns content in markdown
  4. Total time: <5 seconds

Example 2: Site with JavaScript Requirements

User: "Can't access this site https://dynamic-site.com"

Skill Response:

  1. Routes to four-tier-scrape.md
  2. Attempts Tier 1 (WebFetch) → Fails (blocked)
  3. Attempts Tier 2 (Curl with Chrome headers) → Fails (JavaScript required)
  4. Attempts Tier 3 (agent-browser) → Success
  5. Returns content in markdown
  6. Total time: ~15-20 seconds

Example 3: Site with Advanced Bot Detection

User: "Scrape https://protected-site.com"

Skill Response:

  1. Routes to four-tier-scrape.md
  2. Attempts Tier 1 (WebFetch) → Fails (blocked)
  3. Attempts Tier 2 (Curl) → Fails (advanced detection)
  4. Attempts Tier 3 (agent-browser) → Fails (CAPTCHA)
  5. Attempts Tier 4 (Bright Data MCP) → Success
  6. Returns content in markdown
  7. Total time: ~30-40 seconds

Example 4: Explicit Bright Data Request

User: "Use Bright Data to fetch https://difficult-site.com"

Skill Response:

  1. Routes to four-tier-scrape.md
  2. User explicitly requested Bright Data
  3. Goes directly to Tier 4 (Bright Data MCP) → Success
  4. Returns content in markdown
  5. Total time: ~5-10 seconds

Related Documentation:

  • ~/.claude/PAI/DOCUMENTATION/Skills/SkillSystem.md - Canonical structure guide
  • ~/.claude/ - Overall PAI philosophy

Last Updated: 2026-02-22

Gotchas

  • 4-tier escalation: WebFetch → curl → agent-browser → Bright Data proxy. Always start at Tier 1 and escalate only when blocked. Playwright is banned across PAI.
  • Bright Data proxy has usage costs. Don't use Tier 4 for sites accessible via Tier 1-3.
  • CAPTCHA-solving introduces latency. Allow extra time for Tier 4 responses.
  • Credentials in ~/.claude/.env — BRIGHTDATA_API_KEY.

Workflows · 2

  1. 01
    Crawl Workflows/Crawl.md
  2. 02
    FourTierScrape Workflows/FourTierScrape.md

How to Invoke

Say any of these to your DA and PAI activates the BrightData skill automatically:

  • "Bright Data"
  • "scrape URL"
  • "web scraping"
  • "bot detection"
  • "crawl site"
  • "CAPTCHA"
  • "can't access"
  • "site blocking"
  • "extract page content"
  • "scrape whole site"
  • "spider domain"
  • "convert URL to markdown"
  • "getting blocked"

Or invoke explicitly:

Skill("BrightData")

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Want PAI to do this for you?

Install PAI on your machine — your DA gets the BrightData skill plus 44 others, all hooked into one Life OS.