Pick the extraction provider for the output class, then pick the execution boundary for the authority risk. Firecrawl, ScraperAPI, and Apify answer different extraction jobs; a governed lane answers whether this agent may fetch, crawl, render, store, retry, and spend budget on this target set now.
Start with the extraction job, not the scraping catalog
A broad scraper, crawler, browser, screenshot, and platform-actor catalog is useful inventory. It is not one safe capability. The first operator decision is what output the agent needs and how narrow the target set can be.
Page-to-markdown extraction
The agent needs readable page content, docs, articles, or lightweight crawl output that can feed summarization or research without building an HTML parser first.
Boundary: URL allowlist, crawl depth, markdown size cap, source URL preservation, and freshness timestamp.
Raw HTML at scale
The workflow needs reliable fetches through proxy rotation, anti-bot handling, and region choices, while your own code controls parsing and downstream schema extraction.
Boundary: Proxy pool, target domain policy, response byte cap, parser version, and retry budget.
Platform-specific actors
The task maps to a known site or marketplace actor where structured output matters more than generic URL fetching.
Boundary: Actor identity, input schema, dataset scope, per-run cost ceiling, and tenant-safe output storage.
Browser-state verification
The agent must verify UI state, screenshots, logged-in pages, or JavaScript-rendered content after an API or workflow step.
Boundary: Session identity, cookies, screenshot retention, navigation allowlist, and explicit human/privacy review when sensitive data is visible.
Turn extraction into a bounded route
Provider selection answers the fetch shape. Safe execution answers who may fetch, where they may go, which credential burns quota, what output can cross the boundary, and how a failed extraction recovers without widening the lane.
Name the output class
Decide whether the agent needs markdown, raw HTML, JSON entities, a screenshot, a dataset, or a verification artifact. Output class determines the safety checks more than the vendor logo does.
Constrain the target set
List allowed domains, URL patterns, robots/policy constraints, tenant boundaries, and disallowed data classes before the first crawl or browser session starts.
Pick the credential and budget rail
Attach the run to a governed key, BYOK credential, Agent Vault token, wallet/prefund rail, or direct provider credential so quota, proxy spend, and downstream enrichment cost have one owner.
Require denial and recovery proof
The unsafe neighbor — blocked domain, oversized crawl, wrong tenant, stale cookie, or out-of-scope extraction request — should fail as a typed policy outcome before the provider or browser sees it.
Treat scrape.extract as a narrow workflow, not a browser license
The safe first workflow is not “let the agent browse.” It is a scoped extraction request: one output class, one target set, one credential rail, one budget ceiling, and a denial case that proves the adjacent unsafe target fails before navigation or provider dispatch.
1. Define
Output: markdown summary of docs pages, with source URLs and fetch timestamps preserved.
2. Bound
Targets: allowed domains, crawl depth, byte cap, blocked paths, and typed denial for the nearest unsafe neighbor.
3. Prove
Trace: caller, normalized URL, provider route, credential mode, output class, denial rule, and recovery checkpoint.
Preflight questions before the first repeat extraction
- What exact content should the agent extract: article body, table rows, product fields, contact records, screenshots, or post-action UI evidence?
- Which URL/domain patterns are allowed, and which adjacent targets must fail closed before navigation?
- Is JavaScript execution required, or would browser automation only increase authority and privacy risk?
- What output cap prevents one large page, crawl, actor run, screenshot set, or dataset from flooding the model or blowing the budget?
- Which credential rail owns proxy cost, captcha spend, provider quota, and downstream enrichment calls?
- What trace evidence would let an operator debug the run tomorrow: normalized URL, provider route, actor id, parser version, output class, denial rule, and recovery checkpoint?
Common failure modes
Browser convenience becomes broad authority
A full browser can click, scroll, read cookies, expose logged-in state, and collect screenshots. Use it only when the workflow requires browser state; otherwise prefer narrower fetch or extraction lanes.
Actor marketplace choice hides the real capability
A platform actor can be excellent, but the actor identity, input schema, dataset, and pricing model are now part of the execution contract. The agent should not discover them ad hoc mid-run.
Raw extraction loses source evidence
Markdown, HTML, JSON, and screenshots all need source URL, fetch time, normalized target, and provider route preserved together. Otherwise a plausible summary is not auditable.
Retry loops look like scraping scale
Crawl failures, empty pages, CAPTCHAs, and parser drift can produce runaway retries. A safe lane caps attempts, output size, crawl depth, and fallback branches before the loop starts.