· Paul Crossland
Discover APIs Before Scraping HTML
Modern pages often expose cleaner JSON behind the UI. Capture network calls first, then scrape rendered HTML only when you must.
Most scraping failures start with the wrong target.
A team points a parser at rendered HTML, writes selectors for the visible page, ships a cron job, and then spends the next month chasing broken class names, skeleton states, consent banners, region variants, lazy-loaded lists, and partial responses. The page looked like the source of truth because it was what the browser displayed. In many modern web apps, it is only the final presentation layer.
The better first question is: what data did the browser fetch to build this page?
For production extraction, API discovery should happen before HTML scraping. Load the page like a real browser, capture the XHR and fetch traffic, identify the structured endpoint that powers the UI, and only fall back to DOM parsing when there is no cleaner data source.
The browser already knows where the data is
Modern sites commonly separate the document shell from the data it renders. The first navigation returns an application frame, then client-side JavaScript calls JSON endpoints, GraphQL APIs, search backends, product feeds, personalization services, or route data endpoints.
That is not an edge case. MDN describes the Fetch API as the browser interface for fetching resources across the network, built around request and response objects. MDN's XMLHttpRequest API page makes the same historical point from the older API: web apps can make HTTP requests and update part of a page with data from the server rather than reloading the whole page.
Chrome DevTools documents the Network panel as the place developers inspect network activity. That is exactly the mental model extraction systems should copy. A human debugging a web app does not start by regexing the DOM. They open the Network panel, filter the traffic, click the request that contains the payload, and inspect the response.
Your automated fetch pipeline should do the same thing.
HTML is the worst stable contract
Rendered HTML is optimized for browsers and users, not downstream systems. It changes whenever product, design, experimentation, accessibility, or analytics code changes.
A selector that works today can break because:
- a component library renamed classes
- an A/B test wrapped content in a new container
- infinite scroll delayed the data until after initial load
- a mobile or regional layout returned different markup
- a consent banner covered or withheld part of the page
- server components or hydration changed the delivery path
- the content moved into a script-managed state object
- the site shipped a redesign that preserved the API but changed the UI
The API behind the page is not guaranteed to be stable either, but it is often closer to the product's data contract than the visible DOM. It usually carries identifiers, timestamps, pagination cursors, normalized fields, prices, availability, review counts, images, and nested objects in a shape the frontend already depends on.
If you parse HTML first, you are reverse-engineering the presentation. If you capture the data request first, you are reverse-engineering the application boundary.
The API discovery pass
A practical discovery pass is short and repeatable:
- Navigate to the human page in a real browser context.
- Capture XHR and fetch requests during page load and the first few seconds after.
- Keep response bodies for likely JSON calls, but cap them to avoid huge logs.
- Rank calls by content type, status, response size, URL pattern, and whether the body contains the entities visible on the page.
- Re-request the best candidate directly with the same session, cookies, headers, region, and referer where appropriate.
- Compare the direct API response with the rendered page to make sure it actually explains the UI.
- Only then decide whether the job should call the API, render the page, or combine both.
Better Fetch is designed around this workflow. The Better Fetch docs describe browser-grade fetching with real Chromium profiles, sticky sessions, regional routing, Cloudflare clearance handling, rendered page data, and optional network capture. The important field for this pattern is network capture: it captures matching browser network calls, defaults to XHR and fetch, and can include response bodies for API discovery and debugging.
That changes the unit of work. Instead of asking "can I scrape this page?", ask "which request produced the data I need?"
What to log from discovery
Do not store every byte forever. Store enough to explain and reproduce the decision.
A useful discovery record looks like this:
page_url
final_url
country
session_key
navigation_status
candidate_request_url
candidate_method
candidate_status
candidate_content_type
candidate_body_sample_hash
candidate_body_size
matched_entities
pagination_hint
auth_or_cookie_required
cors_observation
selected_strategy
captured_at
The matched_entities field is the quality check. If the page shows a product named "Trail Running Jacket" and the candidate JSON contains the same title, price, SKU, and image URL, you probably found the right endpoint. If the candidate only contains analytics events, recommendations for a different widget, or a preload manifest, keep looking.
The selected_strategy should be explicit:
| Strategy | Use when |
|---|---|
direct_api | JSON endpoint returns the data reliably with replayable request context |
browser_api | Endpoint needs browser cookies, headers, session, or clearance |
rendered_html | Data exists only in the final DOM or is easier to verify visually |
screenshot_or_ocr | Visual state matters more than structured fields |
blocked_or_policy | Response indicates access policy, bot wall, auth, or payment requirement |
That last row matters. API discovery is not a license to ignore access controls. If the browser observes a policy gate, authentication boundary, or bot-management decision, classify it separately from a network failure.
CORS is a browser signal, not the whole access story
One trap in API discovery is confusing browser behavior with server capability.
MDN's CORS guide explains that browsers restrict cross-origin requests initiated by scripts and that CORS uses HTTP headers to tell browsers which origins may read a response. That is why a frontend request can fail in the browser even when a server-to-server request can technically reach the endpoint.
For extraction infrastructure, CORS tells you about the browser security model and the site's intended client boundary. It does not automatically mean the endpoint is public infrastructure for arbitrary reuse. You still need to respect authentication, terms, robots guidance where applicable, rate limits, customer authorization, and policy signals.
A good fetch system records CORS observations, but it does not treat "I can call it from a server" as the same thing as "I should call it at scale."
Replay is the real test
Finding a promising network request is not enough. The endpoint has to replay under controlled conditions.
Replay the request in three modes:
- Same browser session, same region, same cookies.
- Fresh session, same region.
- Server-side API call with only the minimal required headers.
The differences tell you how fragile the extraction will be.
If only the original browser session works, the endpoint may be bound to short-lived cookies, bot clearance, signed URLs, or route state. You might still use it, but the job should be treated as a browser API workflow rather than a cheap HTTP client workflow.
If a fresh browser session works but a plain server call fails, the site may require realistic TLS, headers, JavaScript-set cookies, or region-specific routing. That is exactly where browser-grade fetching earns its keep.
If the minimal server-side call works, you have found the best case: skip rendering for the steady-state job, keep browser discovery as a monitor, and spend your budget on validation instead of repeated page loads.
Measure the network, not just the page
API-first extraction also gives you better observability.
The PerformanceResourceTiming interface exists because resource-level timing matters: DNS, connection, TLS, request time, response time, transfer size, caching, and protocol all affect how a page behaves. You do not need to expose every browser timing field to your product, but you should carry the lesson into your fetch logs.
When a job slows down, did the document navigation slow down, or did the JSON endpoint slow down? When results go missing, did the page render without the API call, did the API return an empty list, or did the endpoint move behind a policy gate? When costs increase, are you rendering pages that could be served by a replayed JSON endpoint?
A DOM-only scraper cannot answer those questions cleanly. A network-aware fetch pipeline can.
Where rendered HTML still wins
API discovery should be first, not exclusive.
Rendered HTML is still the right source when the goal is visual verification, final user-facing copy, layout-sensitive data, screenshot evidence, or content that is only assembled after multiple client and server steps. Some pages also embed enough state in the document that parsing the HTML is simpler and more stable than replaying several internal requests.
The point is sequencing. Do the discovery pass before committing to selectors. If the network traffic gives you a clean data source, use it. If it does not, scrape the rendered page with a clear reason and keep the discovery evidence so future failures are easier to debug.
The builder takeaway
The modern web is not one request and one document. It is a browser session, a navigation, a set of resource requests, a policy environment, and a rendered result.
Treating the final DOM as the only artifact throws away most of the evidence you need to build reliable extraction. Treating network capture as the first-class artifact gives you a better path:
- Render once to observe the application.
- Capture the XHR and fetch calls.
- Identify the request that actually carries the data.
- Replay it with the minimum reliable context.
- Fall back to rendered HTML only when the API path is weaker.
That is the difference between a scraper that keeps breaking and a fetch pipeline that learns how the site really works.