Development Standard

BrightData

4-tier progressive scraping with automatic escalation: Tier 1 WebFetch (fast, built-in), Tier 2 curl with Chrome headers (basic bot bypass), Tier 3 agent-browser (headless JavaScript rendering via Rust CLI daemon), Tier 4 Bright Data MCP proxy (CAPTCHA, advanced bot detection, residential proxies).

Workflows

References

Triggers

medium

Effort

The Problem

A lot of the web won't let you in. Paywalled content, JavaScript-heavy SPAs that render nothing server-side, sites that fingerprint your headers and return a 403, pages behind CAPTCHAs. A bare WebFetch call works fine on simple public pages and fails silently or noisily on everything else. Without escalation logic, you either get blocked and give up, or you go straight to an expensive proxy service for content you could have fetched in 200ms.

How This Skill Approaches It

Four-tier progressive escalation: start cheap and fast, escalate only when blocked. Tier 1 is WebFetch — built-in, instant, costs nothing. Tier 2 is curl with Chrome-like headers, which clears basic user-agent checks. Tier 3 is agent-browser, a headless Rust CLI daemon that handles JavaScript rendering for dynamic SPAs. Tier 4 is the Bright Data MCP proxy — residential IPs, CAPTCHA solving, advanced bot-detection bypass — which has real usage costs and only runs when the first three tiers fail. FourTierScrape runs this escalation for a single URL. Crawl handles multi-page work: light crawl uses scrape_batch in a loop for up to 50 pages, full crawl hits the Bright Data Crawl API for entire sites. All output lands as markdown.

Always starts at Tier 1 and escalates only when blocked — Tier 4 has usage costs
Outputs URL content in markdown format
Playwright is banned across PAI

Not for simple public content (use WebFetch directly), social platform scraping with named actors (use Apify), parallel headless automation with persistent auth profiles (use Browser), or real-Chrome bot bypass with logged-in sessions and zero CDP fingerprint (use Interceptor)

In Action

What you say to your DA, and what the BrightData skill actually does.

You say "scrape this page for me — it keeps blocking my requests"

Runs FourTierScrape: tries WebFetch, escalates to curl with Chrome headers, then agent-browser for JS rendering, then Bright Data MCP if all else fails — returns the page content as markdown.
You say "crawl all the pages under the docs section of this site"

Runs Crawl: light crawl mode uses scrape_batch in a loop with link extraction, up to 50 pages; if the site is larger, escalates to the Bright Data Crawl API — returns a site map plus page contents as markdown with crawl stats.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Scrapes a URL or crawls a whole site, escalating through four tiers only as far as it needs to. Tier 1 is WebFetch, Tier 2 is curl with Chrome headers, Tier 3 is the agent-browser headless daemon for JavaScript-heavy pages, and Tier 4 is the Bright Data MCP proxy for CAPTCHA, advanced bot detection, and residential proxies. Two workflows: FourTierScrape for a single URL, Crawl for multi-page site mapping. Output is always markdown.

The Problem

A lot of pages won't give up their content to a simple fetch — some need JavaScript to render, some check headers, some throw a CAPTCHA at anything that looks automated. Reaching straight for the heavy proxy every time wastes money, since Tier 4 has usage costs and most pages don't need it. But guessing which tier a page needs is its own time sink. Starting cheap and escalating only when blocked gets the content with the least cost and the least latency.

How It Works

Progressive escalation, always starting at Tier 1 and stepping up only on failure:

Tier 1: WebFetch — fast, built-in.
Tier 2: curl with Chrome headers — bypasses basic user-agent bot detection.
Tier 3: agent-browser — headless browser via the agent-browser Rust CLI daemon for JavaScript rendering. Playwright is banned across PAI.
Tier 4: Bright Data MCP — proxy service that handles CAPTCHA and advanced bot detection.

Content is preserved in markdown at every tier. The Crawl workflow extends this to multiple pages — a light crawl loops the MCP batch scraper plus link extraction up to 50 pages, a full crawl uses the Bright Data Crawl API for entire sites.

When to Activate This Skill

Direct Scraping Requests (Categories 1-4)

"scrape this URL", "scrape [URL]", "scrape this page"
"fetch this URL", "fetch [URL]", "fetch this page", "fetch content from"
"pull content from [URL]", "pull this page", "pull from this site"
"get content from [URL]", "retrieve [URL]", "retrieve this page"
"do scraping on [URL]", "run scraper on [URL]"
"basic scrape", "quick scrape", "simple fetch"
"comprehensive scrape", "deep scrape", "full content extraction"

Access & Bot Detection Issues (Categories 5-7)

"can't access this site", "site is blocking me", "getting blocked"
"bot detection", "CAPTCHA", "access denied", "403 error"
"need to bypass bot detection", "get around blocking"
"this URL won't load", "can't fetch this page"
"use Bright Data", "use the scraper", "use advanced scraping"

Result-Oriented Requests (Category 8)

"get me the content from [URL]"
"extract text from [URL]"
"download this page content"
"convert [URL] to markdown"
"need the HTML from this site"

Crawling Requests (Categories 9-11)

"crawl this site", "crawl [URL]", "spider this domain"
"map this website", "get all pages from [URL]", "scrape the whole site"
"crawl all pages under /docs", "extract all pages from", "site crawl"
"get every page on this site", "full site extraction"
"crawl depth 3", "crawl up to 50 pages"

Use Case Indicators

User needs web content for research or analysis
Standard methods (WebFetch) are failing
Site has bot detection or rate limiting
Need reliable content extraction
Converting web pages to structured format (markdown)
User needs multiple pages from a site, not just one
User wants to map a site's structure or extract a section

Core Capabilities

Progressive Escalation Strategy:

Tier 1: WebFetch - Fast, simple, built-in Claude Code tool
Tier 2: Customized Curl - Chrome-like browser headers to bypass basic bot detection
Tier 3: agent-browser - Headless browser automation via agent-browser Rust CLI daemon for JavaScript-heavy sites. Playwright is banned across PAI.
Tier 4: Bright Data MCP - Professional scraping service that handles CAPTCHA and advanced bot detection

Key Features:

Automatic fallback between tiers
Preserves content in markdown format
Handles bot detection and CAPTCHA
Works with any URL
Efficient resource usage (only escalates when needed)

Workflow Overview

FourTierScrape.md - Complete URL content scraping with four-tier fallback strategy

When to use: Any single URL content retrieval request
Process: Start with WebFetch → If fails, use curl with Chrome headers → If fails, use Browser Automation → If fails, use Bright Data MCP
Output: URL content in markdown format

Crawl.md - Multi-page crawling with link discovery and site mapping

When to use: Crawling multiple pages from a site, mapping site structure, extracting a section
Process: Light Crawl (MCP scrape_batch + link extraction loop, up to 50 pages) or Full Crawl (Bright Data Crawl API for entire sites)
Output: Site map + page contents in markdown, with crawl stats and cost summary

Extended Context

Integration Points:

WebFetch Tool - Built-in Claude Code tool for basic URL fetching
Bash Tool - For executing curl commands with custom headers
Browser Automation - agent-browser headless daemon for JavaScript rendering
Bright Data MCP - mcp__Brightdata__scrape_as_markdown and scrape_batch for advanced scraping
Bright Data Crawl API - HTTP POST to api.brightdata.com/datasets/v3/trigger for full-site crawls

When Each Tier Is Used:

Tier 1 (WebFetch): Simple sites, public content, no bot detection
Tier 2 (Curl): Sites with basic user-agent checking, simple bot detection
Tier 3 (agent-browser): Sites requiring JavaScript execution, dynamic content loading
Tier 4 (Bright Data): Sites with CAPTCHA, advanced bot detection, residential proxy requirements

Configuration: No configuration required - all tools are available by default in Claude Code

Examples

Example 1: Simple Public Website

User: "Scrape https://example.com"

Skill Response:

Routes to three-tier-scrape.md
Attempts Tier 1 (WebFetch)
Success → Returns content in markdown
Total time: <5 seconds

Example 2: Site with JavaScript Requirements

User: "Can't access this site https://dynamic-site.com"

Skill Response:

Routes to four-tier-scrape.md
Attempts Tier 1 (WebFetch) → Fails (blocked)
Attempts Tier 2 (Curl with Chrome headers) → Fails (JavaScript required)
Attempts Tier 3 (agent-browser) → Success
Returns content in markdown
Total time: ~15-20 seconds

Example 3: Site with Advanced Bot Detection

User: "Scrape https://protected-site.com"

Skill Response:

Routes to four-tier-scrape.md
Attempts Tier 1 (WebFetch) → Fails (blocked)
Attempts Tier 2 (Curl) → Fails (advanced detection)
Attempts Tier 3 (agent-browser) → Fails (CAPTCHA)
Attempts Tier 4 (Bright Data MCP) → Success
Returns content in markdown
Total time: ~30-40 seconds

Example 4: Explicit Bright Data Request

User: "Use Bright Data to fetch https://difficult-site.com"

Skill Response:

Routes to four-tier-scrape.md
User explicitly requested Bright Data
Goes directly to Tier 4 (Bright Data MCP) → Success
Returns content in markdown
Total time: ~5-10 seconds

Related Documentation:

~/.claude/PAI/DOCUMENTATION/Skills/SkillSystem.md - Canonical structure guide
~/.claude/ - Overall PAI philosophy

Last Updated: 2026-02-22

Gotchas

4-tier escalation: WebFetch → curl → agent-browser → Bright Data proxy. Always start at Tier 1 and escalate only when blocked. Playwright is banned across PAI.
Bright Data proxy has usage costs. Don't use Tier 4 for sites accessible via Tier 1-3.
CAPTCHA-solving introduces latency. Allow extra time for Tier 4 responses.
Credentials in ~/.claude/.env — BRIGHTDATA_API_KEY.

Workflows · 2

01

Crawl Workflows/Crawl.md
02

FourTierScrape Workflows/FourTierScrape.md

How to Invoke

Say any of these to your DA and PAI activates the BrightData skill automatically:

"Bright Data"
"scrape URL"
"web scraping"
"bot detection"
"crawl site"
"CAPTCHA"
"can't access"
"site blocking"
"extract page content"
"scrape whole site"
"spider domain"
"convert URL to markdown"
"getting blocked"

Or invoke explicitly:

Skill("BrightData")

Related Skills

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Tool

Bright Data Residential proxy network and CAPTCHA-solving layer used at Tier 4 when standard methods are blocked.

Want PAI to do this for you?

Install PAI on your machine — your DA gets the BrightData skill plus 44 others, all hooked into one Life OS.

Install PAI View on GitHub