Better Fetch

· Paul Crossland

Extract JSON-LD Before You Render

Check structured data before spending a browser render. JSON-LD can turn pages into cleaner extraction contracts.

A browser render is not always the first expensive step your crawler should take. Many ecommerce, local business, job, recipe, and product pages already publish machine-readable facts in JSON-LD. Extract that contract first, then render only when the structured data is missing, stale, or incomplete.

The point is not to trust structured data blindly. The point is to make it a first-class branch in your fetch pipeline instead of treating every dynamic page as a screenshot, DOM, or AI-parsing problem.

JSON-LD is meant to be machine-readable

Google's structured data documentation describes structured data as a standardized format for providing information about a page and classifying page content. Google recommends JSON-LD where possible because it can be placed in a script tag and does not need to be interleaved with visible HTML.

That matches the web platform model. The HTML specification treats script type="application/ld+json" as a data block: the browser does not execute it as JavaScript, but tools can read the text and parse it as data. The JSON-LD 1.1 specification defines JSON-LD as a JSON-based format for linked data, and Schema.org's getting started guide shows the vocabulary that many sites use to describe products, organizations, events, articles, and local businesses.

For a fetch system, that means JSON-LD is not decoration. It is a possible extraction surface.

Put a structured-data probe near the front

A practical crawler should not jump straight from raw HTTP to full browser automation. Use a cheap structured-data probe first:

  1. Fetch the HTML response.
  2. Parse every script block with type="application/ld+json".
  3. Normalize arrays, @graph objects, and multiple blocks into one candidate list.
  4. Select the schema types your job understands, such as Product, Offer, LocalBusiness, Article, JobPosting, or BreadcrumbList.
  5. Compare the extracted fields against the fields your downstream system requires.
  6. Render the page only when the structured data is absent, inconsistent, or insufficient.

This works especially well for monitoring tasks. If your product-pricing job only needs name, price, currency, availability, and canonical URL, a valid Product plus Offer block may be a cleaner input than a fragile CSS selector buried in a client-rendered page.

Still validate it against the page

Structured data is not a guarantee that the visible page agrees. Google warns that structured data should represent the page content and follow its guidelines. Builders should treat it as a strong hint, not as unverified truth.

The operational pattern is simple: extract JSON-LD, then verify enough of it against the page to know whether the source is usable for your purpose. For high-value fields, compare the structured value with the rendered text, canonical URL, or discovered API response. For low-risk fields, store provenance and confidence so downstream users can decide how strict to be.

A useful extraction record includes:

FieldWhy it matters
source_urlThe page that supplied the data
schema_typeThe chosen object type, such as Product or LocalBusiness
jsonld_pathWhich block or @graph node produced the value
validated_against_domWhether visible page text matched the structured value
render_requiredWhether a browser was needed after the probe
mismatch_reasonWhy structured data was rejected or downgraded

That record is much easier to debug than a single opaque "scrape failed" error.

Use rendering for the hard parts

Browser-grade fetching still matters. Some sites hydrate prices after page load, hide availability behind region or session state, publish incomplete schema, or update the UI faster than their structured data. Others expose the cleanest source through an internal JSON endpoint discovered from network traffic.

The difference is sequencing. If JSON-LD answers the question, do not spend a browser session. If it gives you a partial answer, use it to guide the browser pass. If it disagrees with the rendered page, log the mismatch and prefer the source that matches your product requirements.

A better pipeline looks like this:

  1. Try a normal HTML fetch and parse structured data.
  2. If required fields are present, validate a small sample against the visible page or cached render.
  3. If fields are missing, run API discovery or a rendered fetch.
  4. If the page is region- or session-sensitive, repeat the probe inside the same region and session used for rendering.
  5. Store both the structured-data result and the fetch mode that produced the accepted record.

The builder takeaway

Treat JSON-LD as a cheap contract check before you pay for a render. It will not replace browser automation, bot-wall handling, or API discovery, but it can tell you when those heavier paths are unnecessary.

For production crawlers, add jsonld_found, schema_types, required_fields_present, validated_against_dom, and render_required to your crawl state. Over time, those fields show which sites are structured-data friendly, which ones need browser sessions, and which ones are publishing markup you should not trust.

Sources: Google Search Central on structured data, JSON-LD 1.1, Schema.org getting started, and the HTML specification's script element.