· Paul Crossland
Make Crawlers Conditional, Not Repetitive
Use ETags, If-None-Match, and 304s to avoid needless renders, reduce load, and make browser-grade crawlers easier to operate.
Most crawling systems decide between two expensive modes: fetch the page again or skip it entirely. A better default is conditional fetching. Store validators like ETag and Last-Modified, replay them before you spend a browser render, and treat 304 Not Modified as a successful freshness signal rather than a missing page.
That sounds like old HTTP plumbing, but it matters more when a pipeline mixes cheap API calls, rendered pages, sticky sessions, and bot-wall handling. Conditional requests give the crawler a low-cost question to ask before it burns the most expensive path.
A validator is part of the extraction contract
MDN describes the ETag response header as an identifier for a specific version of a resource. When a later request sends that value in If-None-Match, the server can answer with 304 Not Modified if the representation has not changed. The HTTP semantics spec groups this under conditional requests: metadata from one response becomes a validator for a later request.
For a production data job, that means the output record should not only contain parsed fields. It should also keep the freshness metadata that made the record cheap to revisit:
{
"url": "https://example.com/products/123",
"etag": "\"b20a0973b226eeea30362acb81f9e0b3\"",
"last_modified": "Wed, 28 Aug 2024 10:36:35 GMT",
"fetched_at": "2026-06-16T08:10:00Z",
"content_hash": "sha256:...",
"render_mode": "api"
}
If the next run gets a 304, the crawler can keep the previous parse, update the freshness timestamp, and avoid a browser session entirely.
Put the conditional probe before the browser
Browser-grade fetching is valuable when the page requires JavaScript, session continuity, location, or bot-wall handling. It should not be the first thing you spend on every known URL.
A useful replay ladder looks like this:
- Send a lightweight
GETorHEADwithIf-None-Matchwhen you have an ETag. - Fall back to
If-Modified-Sincewhen you only haveLast-Modified. - If the server returns
304, reuse the last parsed result and mark the row fresh. - If the server returns
200, compare content hash and parse normally. - Only escalate to API discovery or rendered browser fetch when the cheap path cannot answer the question.
This keeps browser work focused on pages that changed, pages that hide data behind client rendering, and pages that need a real session.
Do not confuse 304 with an empty response
A 304 Not Modified response is supposed to have no message body. MDN's 304 reference says it is sent for conditional GET or HEAD requests when the cached version is still valid. A scraper that expects every successful request to include HTML can misclassify that as an extraction failure.
Log it separately:
| Outcome | Meaning | Next action |
|---|---|---|
200 with changed hash | Content may need re-parse | Parse and update validators |
200 with same hash | Server did not honor validators, but content is stable | Keep result, update seen time |
304 | Cached representation is still valid | Reuse previous parse |
403 or challenge page | Access or bot signal changed | Escalate to session-aware browser fetch |
429 | Rate limit or quota signal | Pace according to retry policy |
That distinction matters operationally. A 304 should improve confidence in the stored record; it should not page an engineer or trigger proxy rotation.
Validators are not universal truth
Conditional requests work best when the origin or CDN exposes stable validators for the thing you actually care about. Many modern pages are assembled from multiple APIs. A document-level ETag may stay stable while an XHR payload changes, or it may churn because of unrelated template noise.
For dynamic sites, store validators at the narrowest useful layer:
- the JSON endpoint discovered from the page's network traffic
- the rendered document when no cleaner API exists
- embedded API responses when the document is only a shell
- a normalized content hash for the extracted fields you publish downstream
The goal is not to worship HTTP headers. The goal is to ask the cheapest reliable freshness question before doing expensive work.
The builder takeaway
Add validator storage to your crawl state before you add another retry path. Every URL should carry its last ETag, Last-Modified, content hash, fetch mode, and last successful parse time. Then make the next run prove that a render is necessary.
Browser-grade fetching still solves the hard cases: dynamic rendering, geo-sensitive results, sticky sessions, and bot-wall negotiation. Conditional requests make sure you reserve that power for pages that need it.
Sources: MDN on ETag, MDN on If-None-Match, MDN on 304 Not Modified, and RFC 9110 on conditional requests.