Better Fetch

· Paul Crossland

Headless Chrome Is Not a Fetch Strategy

Modern headless Chrome runs like real Chrome, but bot walls still score sessions. Treat browser mode, clearance, and replay as separate signals.

Modern headless Chrome is a better browser than the old headless stack, but it is not a complete fetching strategy. If your scraper assumes headless: true is the only difference between success and failure, you will misdiagnose bot walls, lose session state, and retry the wrong layer.

The practical move is to separate three questions: did the page render, did the session earn enough trust, and can the data source be replayed safely? Those are different failure modes, and they need different instrumentation.

Headless now means real Chrome behavior

Chrome changed the meaning of headless automation. The Chrome Headless documentation explains that, since Chrome 112, headless mode creates platform windows without displaying them and shares the same browser implementation as regular Chrome. Old Headless is now separate as chrome-headless-shell, and Chrome notes that --dump-dom executes page scripts before serializing the DOM.

Puppeteer reflects the same shift. Its headless modes guide says Puppeteer launches modern headless mode by default, while the old headless shell is now an explicit performance-oriented option. Selenium made a similar point when it removed shortcut APIs and pushed users to choose their browser arguments explicitly in its headless mode migration note.

For builders, this is good news. Rendering JavaScript-heavy pages in automation is less special than it used to be. It also means a blanket recommendation like "use headful" or "use headless" is usually too shallow. Rendering capability is only one part of the request.

Bot systems score the whole session

Cloudflare's Bot Scores documentation describes bot scores as a 1 to 99 signal for how likely a request is automated, with inputs that can include headers, session characteristics, and browser signals. In other words, a browser is not judged only by whether it can execute JavaScript. It is judged as a sequence of requests with context.

Cloudflare's JavaScript Detections documentation makes that statefulness concrete. The detection script is injected into HTML page views, not AJAX calls. It produces a result stored in a cf_clearance cookie, and enforcement is meant to happen through a WAF rule or Worker field. Cloudflare specifically warns not to enforce the JavaScript detection field on the first HTML request, because the browser needs a page view before the detection can run and issue state.

That matters for scraping architecture. If your first request is an internal JSON endpoint, you may be skipping the page view that creates the state required for later requests. If your retry loop throws away cookies, region, viewport, and browser profile after every failure, you may be resetting the very signals the target expects to mature.

Clearance is not a portable magic cookie

Challenge clearance is also stateful. Cloudflare's clearance documentation says a cf_clearance cookie proves a visitor passed a challenge and is tied to the specific visitor and device. The same page explains that clearance levels differ: a higher-level clearance can bypass lower challenge types, while lower-level clearance may still be challenged later.

Do not treat that cookie as a generic bearer token. A production fetcher should record the browser context that produced clearance: target host, top-level URL, country, proxy class, user agent family, viewport, timestamp, challenge level if known, and the endpoint sequence that followed. If a later request fails, compare those fields before blaming the selector or parser.

This also changes how you scale. A session that works in one geography or browser context may not work somewhere else. Copying cookies across workers can create noisy failures that look like bot detection but are really session portability bugs.

Use a three-stage fetch ladder

A reliable browser-grade pipeline should make the stages visible:

  1. Render the human page first. Let the browser execute page JavaScript, receive cookies, and expose network calls.
  2. Discover the data source from the rendered session. Capture XHR and fetch requests, status codes, response shapes, required headers, and whether the request depends on cookies.
  3. Replay only what is stable. If a JSON endpoint works with the warmed session, use it. If it depends on short-lived clearance, region, or client hints, keep it attached to the browser context instead of pretending it is a static API.

This is why API discovery should come after browser warmup, not before it. The page load is not just a slow way to get HTML. It is often the step that establishes the trust context for the cleaner data path.

Log outcomes as signals, not just errors

The worst production pattern is a single FETCH_FAILED bucket. Split failures into actionable categories:

  • raw network failure
  • rendered page loaded but expected selector missing
  • challenge page or interstitial detected
  • clearance acquired but replay failed
  • JSON endpoint changed shape
  • region-specific denial
  • first request blocked before browser state existed

Each bucket points to a different fix. Selector missing suggests a frontend change. Replay failed after clearance suggests a session or header dependency. First-request failure suggests your pipeline should begin with a page view, not an API call. Region-specific denial suggests routing policy, not parser code.

The builder takeaway

Modern headless Chrome is a strong rendering substrate, but it does not remove the need for session-aware fetching. Treat headless mode as a browser choice, clearance as visitor state, and API replay as a separate contract to test.

If you build that separation into your logs and retries, you will spend less time toggling browser flags and more time fixing the real failure: rendering, trust, or replay.