Better Fetch

· Paul Crossland

Log Accept-Language Before Scaling

Locale can change rendered content and API payloads. Record language, Vary, and geo state before comparing fetch results.

A fetch that works in one locale can silently become a different data product in another. Before you scale a browser-grade crawler, log the language and cache-variation signals that shaped the response, then compare extraction results by representation instead of assuming every URL has one canonical body.

Locale is part of the request contract

Accept-Language is not just a browser preference header. MDN describes it as a request header that tells the server which human languages the client prefers, and it is one of the headers used in server-driven content negotiation.

That matters for scraping because many sites do not expose locale only in the path. The same product, booking, search, or documentation URL may return different copy, currency hints, date formats, consent flows, inventory modules, or internal API parameters depending on language, region, cookies, and previous redirects.

If your pipeline only stores url and status, two successful 200 responses can look identical while producing incompatible parsed records.

Check Vary before trusting cache or diffs

The Vary response header is the server's clue that request headers influenced the selected representation. MDN's content negotiation guide notes that servers use Vary to indicate which request headers were used so caches can behave correctly.

For fetch pipelines, treat Vary as extraction metadata:

url
final_url
status
accept_language
country
vary
content_language
redirect_chain
session_id
parser_version

When Vary includes Accept-Language, do not compare a US English render against a French render as a simple content diff. Compare them as separate representations with their own parser expectations.

Crawlers and search engines expose the same trap

Google's international site guidance warns that if you automatically redirect or reroute users based on language settings, Google might not find every variation. The same document notes that Googlebot usually originates from the United States and sends requests without an Accept-Language header.

That is a useful warning for data teams too. If your crawler defaults to one region or one language, you may never see the variant your customers, QA team, or downstream users care about. Conversely, if a browser session inherits a non-default locale, you may accidentally build a parser around a variant you did not intend to target.

A practical capture pattern

Start every new domain with a small locale matrix before scaling:

  1. Fetch the canonical URL with your default browser context.
  2. Repeat with the target country and explicit Accept-Language values you plan to support.
  3. Record final_url, Vary, Content-Language, visible language, and the API calls discovered during render.
  4. Diff the network calls, not only the HTML.
  5. Decide whether the crawler owns one representation or a set of locale-specific representations.

If the internal API changes by locale, promote locale to a first-class input in your job queue. If only the rendered labels change, keep the parser language-agnostic and validate on normalized fields.

What to do differently

Do not debug every locale mismatch as a flaky selector, bad proxy, or bot wall. First ask whether the browser asked for a different representation.

For production browser fetching, the safer default is simple: capture language, region, session, and Vary beside every extracted record. That makes later failures explainable, keeps cache behavior honest, and prevents a crawler from quietly mixing incompatible page variants under one URL.

Sources: MDN on Accept-Language, MDN on Vary, MDN on HTTP content negotiation, and Google Search Central on multi-regional sites.