Technical SEO News

Lighthouse 13.3 adds agentic browsing audits for AI agents

Sun, 10 May 2026 14:00:00 GMT

What happened

Lighthouse 13.3 shipped this week with a new "Agentic Browsing" category that audits how well websites work with AI agents. The category is still marked "under development," but it runs checks grouped into four areas:

Accessibility tree well-formedness. Verifies that the page's accessibility tree is properly structured. The accessibility tree is a browser-native representation of the DOM that exposes roles, names, and states to assistive technologies and AI agents.
WebMCP validation. Checks whether HTML forms are annotated with WebMCP metadata, a declarative API that lets websites expose specific commands for agents to use within the visitor's browser session.
llms.txt compliance. Looks for an llms.txt file and flags it if the file is missing an H1 header, is too short, or contains no links.
Layout shift detection. Surfaces existing CLS data in the agentic context, since agents taking screenshots can be confused by shifting content.

The category draws partly from existing Lighthouse data (accessibility and CLS) and partly from new audits (WebMCP and llms.txt). Google also published a guide on building agent-friendly websites, which we covered separately. Several of these audits overlap with that guide's recommendations.

You won't fail the category just because you haven't added AI-specific features. The source notes that example.com scores a green 2/2.

Why it matters

The agentic browsing category creates a diagnostic layer alongside traditional SEO and accessibility audits. Sites scoring well on existing checks can still fail here if their semantic structure doesn't meet stricter agent expectations.

Accessibility tree quality is where human-focused and agent-focused audits diverge most. Using aria-label on a div without an explicit role is an ARIA spec violation that screen readers silently ignore. The agent audit catches these. A clean WCAG AA report is not equivalent to a well-formed accessibility tree.

JavaScript-heavy sites face particular risk. Google's guide notes that agents interact with pages through screenshots, HTML, and the accessibility tree. On SPA frameworks like React or Next.js, forms render via hydration. If the accessibility tree snapshot captures pre-hydration state, form annotations appear missing.

WebMCP is the most unfamiliar check. Unlike general MCP servers, WebMCP targets front-end interactions within the visitor's browser session. The DebugBear writeup mentions a programmatic navigator.modelContext.registerTool surface alongside a declarative form-annotation pattern. WebMCP is not a ratified W3C or WHATWG standard and ships in no browser today. Treat it as a moving target. The audit checks whether forms carry annotations matching the expected schema and surfaces any registered tools. ARIA attributes are not a substitute for WebMCP metadata.

The source explicitly notes llms.txt "is not widely used by AI tools currently." Be aware of the spec, but don't over-invest in a standard with minimal agent adoption today. Semrush launched similar agent-readiness audits in its Site Audit tool, suggesting this class of checks is becoming standard across the industry.

What to do

Run Lighthouse 13.3 locally. PageSpeed Insights and Chrome DevTools still run an older Lighthouse version. DebugBear expects them to update in the coming months. Running the CLI now lets you audit immediately:

npm install -g lighthouse@latest
lighthouse --view https://your-site.com/

Check the agentic browsing category in the report output.

Audit your accessibility tree separately from WCAG compliance. The ARIA Authoring Practices Guide defines what "well-formed" means. Check role attributes on container elements and nesting hierarchies. Sites using Shadow DOM or web components should verify their accessibility trees don't fragment across boundaries.

Review WebMCP form annotations if you have interactive forms. The spec is experimental and subject to change, so check the latest version before implementing. If your site has agent-facing forms (search, filters, checkout), annotate those first. Lighthouse checks both form coverage and schema validity separately.

Don't rush to create an llms.txt file just for the audit score. If you have one, make sure it includes an H1 header, meaningful content, and links. Those are the three things the audit checks. If you don't have one, wait for evidence that AI tools actually parse it.

Check CLS in the agentic context. Agents take screenshots and read the accessibility tree at page load. If content shifts after that snapshot, the agent works from stale layout data. Sites passing the 0.1 CLS threshold for human visitors may still trip agents that don't wait for visual stability.

Watch out for

Hydration timing on JS frameworks. Test with and without JavaScript to see what the pre-hydration accessibility tree looks like. Forms rendered via client-side hydration may appear missing in the snapshot.

WebMCP and ARIA are separate concerns. Passing accessibility audits does not mean passing agentic browsing audits. WebMCP metadata uses its own schema independent of ARIA labels.

The "under development" label matters. Don't refactor heavily based on current audit criteria. Monitor Lighthouse documentation for spec changes.

Why IA migrations break without a staging crawl

Sun, 10 May 2026 12:00:00 GMT

What happened

A practitioner in r/TechSEO asked when a site migration justifies a full staging environment versus simply mapping redirects and going live. The scenario: a mid-sized ecommerce site moving from a custom CMS to Shopify. The domain stays the same, but the URL structure, navigation paths, and information architecture are all changing. Some content is being consolidated or removed entirely.

One commenter summarized the general consensus: "For any site with decent traffic I'd go the thorough route. It doesn't actually take that much extra time."

The question itself is worth unpacking because the answer isn't binary. Redirect mapping alone works fine for simple migrations where URLs change but IA stays intact. The complexity threshold shifts when content consolidation, navigation restructuring, and platform-specific URL behaviors enter the picture.

Why it matters

The risk with a "go live and monitor" strategy is that monitoring is reactive. By the time Google Search Console reports crawl anomalies or ranking drops, the damage may already be weeks old. GSC data lags by a variable and unspecified amount, and monitoring catches problems only after they reach production. A staging crawl catches problems before they get there.

Three specific failure modes make staging valuable for IA-heavy migrations:

Redirect chains that don't surface in logs. When old URLs redirect to intermediate URLs that then redirect again to final destinations, the chain resolves correctly in a browser and Google will follow it. Google has confirmed that redirect chains pass signals without loss. The problem is operational: chains are harder to audit, harder to maintain, and mask mapping errors where an intermediate URL was supposed to go somewhere else entirely. Staging lets you crawl the full redirect topology and catch A-B-C chains before launch, when they're still cheap to flatten.
Internal links pointing to removed or consolidated pages. Content consolidation means some old URLs redirect to pages covering a broader topic. The redirect returns a 200 at the final destination, so it won't show up as a 404. But Google treats a redirect as a signal of equivalence and may assess whether the destination is a true equivalent. For low-relevance redirects, signals may not transfer at all, not just transfer at reduced strength.
Platform-specific canonical behavior. Based on practitioner experience, Shopify generates canonical tags for product variants and collection URLs in ways that can conflict with your redirect rules. A carefully mapped redirect can land on a page where Shopify's auto-generated canonical tag points somewhere unexpected. You won't see that conflict in a redirect spreadsheet.

Google's migration documentation covers planning steps for moves involving URL changes but doesn't draw a hard line on when staging becomes necessary. The practical threshold depends on how much of the site's link topology is changing, not just how many URLs are moving.

For ecommerce sites with faceted navigation, the risk multiplies. Old facet URLs may redirect to filtered collection pages on Shopify. Note that Shopify's actual filtered collection URL structure varies depending on the theme and any installed search or filter apps. Confirm the URL patterns your specific Shopify implementation will produce before building redirect mappings from legacy facet URLs. If some facet combinations no longer exist, those redirects land on filtered pages with thin content. Googlebot following those chains produces nothing useful.

What to do

Decide based on IA change, not URL count. If only URLs are changing and the site structure stays the same, redirect mapping with post-launch monitoring is reasonable. If navigation paths, content hierarchy, or page consolidation patterns are changing, staging is worth the investment.

Crawl both environments simultaneously. Use Screaming Frog or Sitebulb to crawl the staging site and compare it against a crawl of the current production site. Enable JavaScript rendering in the crawler — Screaming Frog defaults to static HTML mode, which will miss canonical tags and internal links injected by Shopify themes and apps. Look for:

Redirect chains longer than one hop
Internal links on the new site that point to old URLs
Pages where Shopify's auto-generated canonical tag conflicts with your redirect destination, or points to a collection or variant URL you intended to consolidate
Orphaned pages that exist on staging but have no internal links pointing to them

Test Shopify's canonical and redirect behavior specifically. Shopify generates canonical tags for collections and product variants in ways that may conflict with your redirect rules. On staging, visit product variant URLs directly and check the rendered canonical tag. Compare it against what your redirect map expects. Google already overrides canonicals for several reasons; a platform-generated mismatch adds another.

Validate content consolidation targets. For every old URL that redirects to a consolidated page, check whether the destination page actually covers the topic the old page ranked for. If the destination is a broad category page and the old page was a specific product or article, Google may not transfer ranking signals at all. Before redirecting any high-traffic or high-backlink page to a broader category page, evaluate whether that page should simply be retained or redirected to a closer equivalent. Consolidation redirects make sense for truly redundant content, not for unique pages with their own ranking signals.

Set your monitoring baseline before launch. Export your current crawl stats, indexed page count, and top landing pages from GSC. After launch, compare against these baselines daily for the first month, then weekly for three to six months. IA migrations surface problems on a longer tail than simple URL moves -- consolidation-related ranking loss often doesn't appear until Google re-evaluates page quality signals weeks after the initial recrawl. Pre-define traffic drop thresholds so you know what counts as normal fluctuation versus a real problem. Pay special attention to pages marked "Crawled, currently not indexed". This status has multiple potential causes including duplicate content, soft 404s, canonical conflicts, quality signals, or Google's crawl queue prioritization, and should be investigated case by case.

Watch out for

Staging environments that don't match production. If your staging site lacks JavaScript rendering, lazy-loading behavior, or CMS plugins present on the live site, your staging crawl results won't match what Googlebot encounters. Make sure the staging environment mirrors the production Shopify configuration as closely as possible.

When using Screaming Frog or Sitebulb to validate canonical tags on Shopify staging, enable JavaScript rendering in the crawler settings. Shopify themes and apps can inject or modify canonical tags via JavaScript. A default non-JS crawl will miss these conflicts entirely.

"Successful" redirects masking relevance loss. A redirect that returns a 200 at the final destination looks clean in every audit tool. But if the destination page doesn't match the search intent of the old page, Google may not transfer signals at all, and rankings will erode over weeks. No redirect audit tool flags semantic mismatches. You need to review consolidated redirect targets manually, at least for your top-traffic pages.

Pre-Launch Website Audit Skill

Sat, 09 May 2026 00:00:00 GMT

import ToolSection from '../../components/ToolSection.astro';

What it does

This skill turns Claude Code into a website auditor. Point it at any URL and it runs 5 sub-audits: technical SEO, AI accessibility, security, performance, and on-page SEO. It was built for pre-launch checks but works just as well on live sites.

A site that launches well is one where all the parts work together. Good performance means nothing if search engines can't crawl the site. Solid SEO means nothing if the site leaks API keys. Clean security headers mean nothing if they break how Googlebot renders the page.

The 5 sub-audits aren't independent checklists. They cross-reference each other, deduplicate findings, and surface shared root causes. One fix in the right place often resolves issues across multiple audits.

The output is a prioritized report: P0 launch blockers (fix or don't ship), P1 launch-day items, P2/P3 backlog. Every finding includes what breaks if you ignore it, the exact file or config to change, and a command to verify the fix worked. </ToolSection>

Stack detection

Before running any checks, the skill fingerprints your tech stack using HTTP headers, HTML signatures, DNS records, and JavaScript bundle paths. It then tailors every check to your specific framework and hosting setup.

Detection covers Next.js, Nuxt, Astro, SvelteKit, WordPress, Shopify, Webflow, Framer, Wix, Squarespace, Hugo, Jekyll, Eleventy, Drupal, and AI-generated apps (Lovable, Bolt, Base44, Replit). Each component gets a confidence score (HIGH/MEDIUM/LOW). You can correct the profile before the audit proceeds.

A Next.js site on Vercel gets ISR cache validation and NEXT_PUBLIC_* env var audits. A WordPress site gets plugin CVE checks and username enumeration detection. A vibe-coded Lovable app gets Supabase RLS probing and exposed endpoint sweeps. </ToolSection>

The 5 sub-audits

Security, AI accessibility, and performance start immediately and run in parallel. Technical SEO and on-page wait for the Screaming Frog crawl to finish (if SF is available). If SF isn't installed, all 5 run in parallel using fallback tools.

AI Accessibility. Can AI search engines (ChatGPT Search, Perplexity, Google AI Overviews) see and cite your content? Checks robots.txt bot policies, llms.txt presence, ai-agent.json, and Cloudflare Bot Fight Mode conflicts (a common invisible killer that blocks AI crawlers without any visible error). Also runs cloaking detection by sending requests as Googlebot, GPTBot, and a normal browser to compare responses.

Technical SEO. Can search engines find and index your pages? Checks for indexation blockers, JS rendering issues, broken redirects, canonical conflicts, sitemap validation, structured data, internal linking, and the staging hostname leak check (hardcoded staging. or dev. URLs that would ship to production). This is the sub-audit where Screaming Frog adds the most value, because it crawls the full site rather than spot-checking individual URLs.

On-Page SEO. Is the content structured for search visibility? Title and meta description coverage, H1 structure, OG tags, image alt text, content quality signals (lorem ipsum detection, soft 404 pages returning 200 status), and faceted URL parameter sprawl.

Performance. Will the site be fast for real users? Core Web Vitals via Lighthouse, bundle size analysis, image optimization, caching headers, font loading strategy, render-blocking resources, and third-party script impact. Checks both mobile and desktop where tools allow.

Security. Are there exposed secrets, missing headers, or known vulnerabilities? Checks transport security (HSTS, TLS cert validity, DMARC), security headers (CSP, X-Frame-Options, Permissions-Policy), exposed secrets in HTML and JS bundles, known framework CVEs, and the vibe-coding checklist (Supabase anon-vs-service-key, Firebase rules, GraphQL playground exposure, unprotected API routes). This is pre-launch hygiene, not a penetration test. </ToolSection>

Tools and costs

The skill probes for available tools at startup and tells you what it found before running. It works with whatever you have. At minimum, it needs bash.

Tool	Cost	What it adds
bash	Free, always available	HTTP headers, DNS, TLS, HTML inspection, robots.txt, sitemap, secret scanning. The baseline for every sub-audit.
Playwright MCP	Free, ships with Claude Code	Browser automation, rendered DOM snapshots, JavaScript execution checks.
Chrome DevTools	Free	Accessibility tree, Lighthouse audits, console error monitoring, network analysis.
Screaming Frog MCP	Paid (SF license, free tier covers 500 URLs)	Full site crawl with custom extractions and searches. The deepest crawl data you can get. Requires Screaming Frog SEO Spider installed locally.
DataForSEO MCP	Paid (usage-based API)	Technology detection, Lighthouse API, AI search volume data. Supplements other tools but can be replaced.

Without any paid tools, Playwright, Chrome DevTools, and bash cover all 5 sub-audits. Screaming Frog is the biggest upgrade for technical SEO (bulk crawl data vs. spot checks). DataForSEO adds breadth to tech detection but is the most replaceable. </ToolSection>

Stack-specific checks

After detecting your framework, the skill injects targeted checks into the relevant sub-audits:

Stack	What gets checked
Next.js / Vercel	ISR cache behavior, `NEXT_PUBLIC_` env var exposure, server action auth, source maps in production, `/api/` endpoint security, RSC rendered-DOM gaps
WordPress	Yoast/RankMath config, user enumeration via `/wp-json/wp/v2/users`, plugin CVE check, `wp-config.php` exposure, `xmlrpc.php` open
Shopify	`/products.json` data exposure, `?variant=` faceted URL sprawl, Liquid rendering check, app-injected script performance
Nuxt	Server route auth (`server/api/` is public by default), `useAsyncData` data leaks in client bundle
SvelteKit	CSRF origin checking, loader data serialization issues, `+server.ts` auth gaps
Astro	`set:html` XSS risk, SSR mode attack surface, middleware header configuration
Webflow / Framer	Client-side rendering gaps, Cloudflare Bot Fight Mode blocking AI crawlers, redirect manager coverage
Vibe-coded (Lovable/Bolt/Base44)	Supabase RLS probe (anon key vs service_role key), IDOR sweep, GraphQL playground exposure, AI endpoint rate limiting
Wix / Squarespace	Platform ceiling flags: the skill notes which findings are unfixable on managed platforms so you don't waste time chasing them
</ToolSection>

Pre-launch block handling

Most pre-launch audits trip over the fact that the site isn't live yet. Robots.txt blocks everything, noindex is on every page, and Screaming Frog reports zero indexable URLs. A naive audit flags all of this as broken.

This skill classifies blocks by scope instead:

Sitewide blocks (robots.txt Disallow: /, global noindex): Expected. Flagged as P0 launch-day checklist items with the production replacement config, not treated as current bugs.
Section blocks (/admin/, /api/, /draft/ blocked): The skill asks whether these should stay blocked in production. Usually yes for admin and API routes.
Page-specific blocks (individual noindex, odd canonicals): These get closer inspection. A page in the sitemap with noindex is always a conflict, staging or not.

The audit focuses on what would happen after blocks are removed. Are canonicals correct? Are there redirect chains? Is structured data valid? Is the production robots.txt ready to deploy? </ToolSection>

Customizing the skill

The skill is a set of markdown files that Claude reads as instructions. You can modify any of them by asking Claude directly.

Swap tools. "Replace DataForSEO with Ahrefs MCP in the pre-launch audit." Claude will update the tool probing, fallback tables, and the relevant playbook sections to use Ahrefs API calls instead. Same approach works for swapping Chrome DevTools for Playwright, or adding any MCP tool Claude has access to.

Add framework checks. "Add Remix-specific checks to the pre-launch audit." Claude will add a new entry to the stack detection table and create the corresponding security and SEO checks.

Change severity rules. "Make missing OG images a P1 instead of P2." Claude will update the severity classification in the on-page playbook.

Modify the report. "Add a WCAG 2.1 AA compliance section to the audit report." Claude will extend the report template and add relevant checks.

Add custom searches. The Screaming Frog crawl includes regex searches for staging hostname leaks, lorem ipsum, hardcoded HTTP, debug statements, and more. Ask Claude to add patterns specific to your codebase or CMS.

The skill files live in your project after installation. Changes are local to you and don't affect the upstream repo. If you break something, reinstall from the marketplace to reset. </ToolSection>

What's in the repo

The skill is organized as a main orchestrator file with 5 audit playbooks and 5 reference docs that load on demand:

skills/pre-launch-audit/
  SKILL.md                         # Main orchestrator: phases, stack detection, report format
  audits/
    technical-seo.md               # Crawl analysis, indexation, redirects, structured data
    ai-accessibility.md            # Bot access, llms.txt, cloaking, AI citation readiness
    security.md                    # Headers, secrets, CVEs, vibe-coding checklist
    performance.md                 # CWV, Lighthouse, caching, bundle analysis
    on-page.md                     # Titles, descriptions, headings, content quality
  references/
    ai-crawler-landscape.md        # Bot taxonomy, user-agents, Cloudflare, llms.txt
    security-checks.md             # Header specs, vibe-coding patterns, secrets regex
    sf-power-workflows.md          # Screaming Frog extractions, JS snippets, CLI usage
    performance-budgets.md         # CWV thresholds, diagnostic playbooks, caching patterns
    stack-profiles.md              # 15 stack fingerprints, bash recon commands, platform ceilings

Only the files needed for selected sub-audits get loaded into context. If you skip performance and security, those playbooks and references never get read. </ToolSection>

Agent runtimes, not models, now control how AI reads your site

Fri, 08 May 2026 19:00:00 GMT

What happened

Cloudflare and OpenAI both shipped agent runtime SDKs in mid-April. A Search Engine Journal analysis by Slobodan Manic argues this marks a structural shift. The runtime, not the model, now fetches your page, parses your HTML, executes (or skips) your JavaScript, and resolves your structured data.

Cloudflare's April 15 release, called Project Think, includes durable execution with crash recovery, sub-agents, persistent sessions, and sandboxed code execution. The next day, April 16, the Cloudflare Workers AI platform added a vendor-agnostic inference layer and vector index for agent retrieval.

Around the same time, OpenAI shipped an updated Agents SDK with native sandbox execution. Two of the web's largest infrastructure operators answered the same question within days: how does a long-running AI agent actually run in production?

Why it matters

The model no longer reads your website directly in agent architectures. The runtime fetches your page, parses it, and hands the result to the model. By the time any model sees your content, it sees the runtime's interpretation of it.

This is an emerging pattern rather than a settled universal reality. Many AI products still use proprietary or model-specific crawlers. Perplexity, for example, operates its own crawling infrastructure independent of these runtime SDKs.

But the Cloudflare and OpenAI releases signal a clear direction. Optimizing narrowly for individual models misses the point if the runtime upstream can't parse your site.

JavaScript-heavy sites face the sharpest risk. Sandboxed execution in agent runtimes may block dynamic imports, certain fetch patterns, or browser APIs. These failures can occur even on pages that Googlebot renders correctly.

Passing Google's crawl tests only confirms Googlebot compatibility. Agent runtimes operate under different and often more restrictive sandboxes. A Next.js site with lazy-loaded product attributes could render perfectly for humans while agents see incomplete data. Those failures won't appear in traditional SEO audits.

Structured data practitioners face a related problem. JSON-LD that depends on JavaScript injection may not resolve at all in a runtime that skips client-side execution. Server-side JSON-LD is the recommended practice regardless of agent considerations.

Multi-threaded agent sessions create infrastructure concerns too. Multiple concurrent requests from a single agent task could trigger rate limiting or bot detection systems built for single-request crawlers. Server logs may not clearly distinguish agent runtime traffic from regular bot traffic.

What to do

Test your critical pages without JavaScript. Disable JavaScript and check whether your most important content, structured data, and product attributes still appear in the raw HTML. If key information only exists after client-side rendering, server-side render it or pre-render it as static HTML.

Audit your JSON-LD independently from your rendered page. Use View Page Source or curl to check whether structured data exists in the raw server response before any JavaScript executes. The Rich Results Test uses a Googlebot-equivalent rendering pipeline, so it won't catch JS-dependency issues relevant to agent runtimes. If your structured data is injected by JavaScript, add it to the initial server response instead.

Check your authentication flow for multi-call sessions. Auth flows built for one-shot human logins will break when an agent needs to hold state across a multi-request task. Verify that agents acting on a user's behalf can maintain a session.

Review your rate limiting and bot detection rules. Agent runtimes may send multiple concurrent requests as part of a single user task. If your WAF treats rapid sequential requests as abuse, legitimate agent traffic gets blocked silently. Log agent runtime user-agent strings separately.

Build for accessibility first. Semantic HTML, proper heading hierarchy, and ARIA landmarks are accessibility best practices that also make sites parseable by agent runtimes. A site that works for screen readers is already well-positioned for AI browsing agents that navigate via the accessibility tree, such as those using Playwright MCP.

Consider structured agent access via WebMCP. The Web Model Context Protocol is an emerging pattern (not yet a ratified standard) that lets pages register tools and structured section access for AI agents via navigator.modelContext. Rather than hoping every runtime correctly parses your HTML, WebMCP gives agents a structured interface to your content.

Serve machine-readable responses at key endpoints. If your pages only make sense inside a full browser session with CSS layout, agents will struggle. Non-semantic HTML breaks agent interaction across multiple access patterns, as covered in our analysis of Google's developer guidance. The minimum is well-formed semantic HTML with server-rendered JSON-LD.

Watch out for

Stale state after crash recovery. Durable runtimes like Cloudflare's can pause mid-task and resume later using previously fetched content. Pages with time-sensitive data (pricing, inventory, event availability) are most exposed. Cache-Control headers govern CDN and browser caches, not what a runtime persists internally. Design fallbacks like re-validation steps, short-lived tokens, and server-side timestamps.

Invisible failures across runtimes. A site that works for Googlebot may fail silently for Cloudflare's sandbox or OpenAI's runtime. No single monitoring tool covers all agent runtimes yet. Manual testing against raw HTML output is the most reliable check.

Six weeks of 307 redirects split two identical migrations

Fri, 08 May 2026 19:00:00 GMT

What happened

A practitioner posted in r/bigseo describing a subdomain migration where two clothing brands were moved onto the same parent domain. Both migrations were handled identically. One brand is recovering normally. The other has nearly vanished from Google's index.

The key detail: both brands initially used 307 (temporary) redirects instead of 301s. That lasted roughly six weeks before the team switched to permanent 301 redirects. The 301s have been live for about 2.5 weeks. The practitioner reports that Google Search Console is showing a "site is being migrated" status for both domains. The post does not specify how this notification appeared. Google's Change of Address tool generates similar notices and supports both domain and subdomain migrations, though its applicability to a consolidation onto a parent domain is unclear.

Brand A is picking up new URLs and updating the index as expected. Brand B has only a handful of pages indexed on the new subdomain, with a few dozen lingering on the old domain. The practitioner also discovered that Brand B's sitemap wasn't being read correctly by Google.

Why it matters

The six-week 307 window is the likely root cause of the divergence. During that period, Google had no permanent redirect signal telling it which domain to treat as canonical. Google's documentation on URL canonicalization lists permanent redirects (301, 308) as a stronger canonicalization signal than temporary redirects (302, 307). The six-week 307 window gave Google a weaker signal than a 301 would have, which likely delayed canonical consolidation for Brand B. Over six weeks, Google may have locked in a preference for Brand B's old domain while Brand A, for whatever reason, got luckier in canonical evaluation.

The sitemap failure compounded the problem silently. If Google couldn't parse Brand B's sitemap during the transition, it had no crawl queue signal pushing it toward the new subdomain URLs. Per Google's documentation, sitemaps are a weak canonicalization signal, though they do aid URL discovery, and losing that discovery signal during a migration can leave the new subdomain URLs under-crawled.

Subdomains may be treated as separate properties by Google in some contexts, which can affect how crawl demand is distributed. If Brand B had lower historical crawl frequency than Brand A, the new subdomain destination may inherit that lower priority. Recovery is slower even when everything else is configured correctly.

Whatever the source of the GSC "migrating" status, treat it as a lagging indicator at best. It tells you Google is aware a migration is happening, not that indexing has flowed to the new URLs. The URL Inspection tool shows the indexed state of a single URL, but critically it also shows the Google-selected canonical. If URL Inspection shows a new subdomain URL as a non-canonical duplicate, Google still prefers the old domain version. The migration has not resolved for that page. Practitioners who check only the top-line status without reading the canonical field may miss that most of their pages haven't transitioned.

Multi-brand operators running subdomain consolidations are the most exposed. Sites with tens of thousands of pages depend on sitemap discovery working correctly during the transition window. A parsing error that goes unnoticed for weeks can stall the entire migration for one property while another sails through.

What to do

Fix and resubmit the sitemap immediately. The practitioner already identified that Brand B's sitemap wasn't being read correctly. Resubmit a clean sitemap in GSC containing only the new subdomain URLs. Don't wait for Google to auto-discover the corrected version.

Check whether the old domain's sitemap is still live. If Google can still find an old sitemap pointing to old URLs on the original domain, it may treat those as canonical alternatives. Remove or update old sitemaps so they don't compete with the new ones. Google's sitemap documentation describes how sitemaps inform URL discovery.

Use URL Inspection on a sample of Brand B pages, not just one. Inspect 20–30 pages across different sections to get an accurate picture. For each, check the Google-selected canonical field, not just the indexed status. A single green result doesn't reflect the index state for the full site. Setting pre-defined recovery thresholds before a migration starts helps distinguish expected drops from stalled indexing.

Verify internal links from the parent domain point to Brand B's new subdomain. If the parent site links to old Brand B URLs or doesn't link to Brand B at all, the new subdomain gets no signal pushing Google to crawl it. Brand A may have recovered faster because it had stronger internal linking from the parent domain.

Request indexing for high-priority Brand B pages manually. For the most important category and product pages, use the URL Inspection tool to request indexing. The manual signal won't scale to thousands of pages, but it tells Google to crawl those specific URLs.

Keep the 301 redirects in place for at least one year, ideally permanently. This is practitioner consensus, not a directive in Google's migration documentation. Google's systems don't re-crawl old URLs frequently. Removing redirects too early forces Google to rediscover the migration from scratch. Treat 301s as long-term stabilizers, not a quick handoff.

Watch out for

The "both domains indexed" split-signal trap. Brand B still has pages on the old domain and a handful on the new one. If external backlinks continue pointing to old Brand B URLs and the old domain remains crawlable, Google can't cleanly resolve which version to keep. Check that the old domain's robots.txt does not disallow the redirecting URLs. If old URLs are blocked by robots.txt, Google may drop them from crawl queues over time and stop revisiting them. Once Google stops fetching those URLs, it stops encountering the 301 redirects, and the index transfer signal is lost. This is an easily overlooked migration misconfiguration. The old redirecting URLs must remain crawlable so Google continues to follow the redirects to the new subdomain.

False confidence from GSC migration notices. The notification that "the old site is being migrated" reflects Google's awareness of the redirect, not the state of indexing. Practitioners often treat this as an all-clear signal when actual index migration can lag by weeks or months, especially after a prolonged temporary redirect window.

Google officially deprecates FAQ rich results as of May 2026

Fri, 08 May 2026 18:00:00 GMT

What happened

Google confirmed that FAQ rich results are no longer appearing in Search as of May 7, 2026. The announcement, shared on r/TechSEO and sourced from Google's developer documentation, lays out a three-phase deprecation timeline:

May 7, 2026: FAQ rich results stop appearing in Google Search.
June 2026: Google removes the FAQ search appearance filter, the rich result report in Search Console, and FAQ support in the Rich Results Test tool.
August 2026: Search Console API support for FAQ rich result data is removed.

The Google Search docs changelog also documents the update. Search Engine Land covered the news as well.

Why it matters

FAQ rich results have been declining in visibility since Google restricted them to government and health authority sites in August 2023. The formal deprecation removes one more SERP feature that SEOs once used to capture additional real estate in search results.

The practical impact varies by team size and tooling. Enterprise content teams managing thousands of FAQ pages face a decision about when and how to remove FAQ schema from templates. A passive template deploy means Googlebot picks up the changes as it naturally re-crawls pages on its own schedule, not as an immediate site-wide surge. Leaving the markup in place is harmless for now, but creates monitoring noise.

SaaS platforms and agencies with automated reporting pipelines face a harder deadline. Any Search Console API integration that queries FAQ rich result data will break in August 2026. Teams running automated SEO health dashboards that track FAQ appearance metrics need to update those queries before the cutoff.

For smaller sites, the impact is minimal. FAQ schema wasn't generating rich results for the vast majority of domains already. The markup can stay or go without meaningful consequence.

The FAQPage type at schema.org remains a valid schema definition. Google choosing not to render it does not make the markup invalid. The distinction matters if you use FAQ schema for purposes beyond Google, such as feeding other search engines or internal tooling.

The deprecation raises an open question for AI search surfaces. Google is removing FAQ rich results from classical search, but FAQ structured data could still play a role in AI contexts. Retrieval pipelines used by AI search products may use structured Q&A pairs during document selection, even though the LLM itself does not parse schema tags. Whether AI Overviews, ChatGPT search, or Perplexity use FAQPage schema in their retrieval layers is undocumented. Practitioners deciding whether to remove FAQ markup should factor this uncertainty into the timeline.

What to do

Don't rush to remove FAQ schema markup. The markup is inert now. Google isn't penalizing it, and removing it from thousands of pages creates unnecessary work. If you do want to clean it up, batch the removal over time rather than deploying a site-wide template change that forces re-crawling of every affected URL at once. If you run Shopify, also check for third-party FAQ schema injections you may not have added yourself.

Audit your Search Console API integrations now. If you pull FAQ rich result data from the Search Console API, you have until August 2026 before those calls stop returning data. Check whether your monitoring dashboards, automated reports, or alerting systems reference the FAQ search appearance. Update or remove those queries before August to avoid silent failures.

Expect false positives in Search Console during the transition. Between now and June, FAQ data may still appear in Search Console reports even though the rich results are gone from SERPs. Don't interpret lingering report data as evidence that FAQ results are still live.

Check JavaScript-rendered FAQ content separately. If your FAQ schema is injected via client-side JavaScript, confirming removal after deployment is harder. Use URL Inspection's "Test Live URL" option, which fetches and renders the page in real time. The default view shows cached crawl state, not current output. For large-scale audits, run a renderer-enabled crawler like Screaming Frog.

Factor AI search surfaces into your keep-or-remove decision. FAQ schema costs nothing to maintain and may have undocumented value for AI retrieval pipelines. If you have no urgent reason to remove it, leaving it in place preserves optionality.

Leave FAQ content in place. The schema is deprecated, not the content. FAQ-style content still has value for users and can surface in AI Overviews or standard organic results. Only the structured data rendering is affected.

Watch out for

Template-level removal on large sites. Sites that baked FAQ schema into product or category templates should avoid force-requesting re-crawls for all affected URLs at once. Let the normal crawl schedule pick up the changes. For most sites, passive schema removal carries no meaningful crawl budget risk.

Multi-language sites with separate crawl schedules. Removing FAQ schema from one language version doesn't guarantee other hreflang variants get re-crawled on the same timeline. Verify each language variant individually.

Next.js patches 13 vulnerabilities in security release

Fri, 08 May 2026 12:00:00 GMT

What happened

Vercel shipped a coordinated security release on May 7 covering 13 advisories across Next.js and React. The vulnerabilities span five categories: middleware and proxy bypass, denial of service, server-side request forgery, cache poisoning, and cross-site scripting. One advisory tracks an upstream React Server Components vulnerability as CVE-2026-23870.

Patched versions are Next.js 15.5.18 and 16.2.6. On the React side, the react-server-dom-* packages (parcel, webpack, turbopack) are fixed in 19.0.6, 19.1.7, and 19.2.6. All Next.js 13.x and 14.x users must upgrade to at least 15.5.18 or 16.2.6. There is no backport. This is a major version jump that requires React 19 and may include breaking API changes, so review the official upgrade guides before deploying to production.

Vercel explicitly stated that WAF rules cannot block these issues. The vulnerabilities sit deep in the request routing pipeline, not at the input validation layer.

Why it matters

Four of the five middleware and proxy bypass advisories are rated High; the fifth (middleware redirect cache poisoning) is rated Low. Per the Vercel advisory, these affect any application that relies on middleware.js or proxy.js for authorization. For SEO practitioners, that means sites using proxy to gate content behind paywalls, subscription checks, or bot-detection logic.

The segment-prefetch bypasses are the most SEO-relevant. They allow requests to reach protected content paths without triggering the auth middleware. Robots.txt is fetched independently by the crawler and would still be honored. If you've Disallowed those paths, compliant bots like Googlebot won't request them. The real exposure is meta robots tags and other response-level signals. If your middleware served a noindex header or a login redirect on protected paths, the bypass skips that logic and serves the gated content directly. Googlebot can then index pages that should have been protected.

Consider an e-commerce site checking subscription status in middleware before serving /products/premium-inventory. The segment-prefetch bypass lets a crawler fetch that path without hitting the auth check. Private SKUs get indexed. Because the bypass serves a normal 200 response, the exposure may only become apparent through manual crawling or by noticing protected URLs appearing in site: queries.

The cache poisoning advisories compound the problem. If a protected resource gets fetched via the bypass, your CDN may cache that response. Subsequent requests from any user or bot then receive the cached protected content. One advisory notes that middleware redirects can be cache-poisoned: a legitimate /protected → /login redirect gets replaced with an attacker's response. Crawlers hitting the poisoned cache encounter redirect chains that waste crawl budget or follow redirects to attacker-controlled URLs. If your site uses aggressive HTTP caching with long TTLs, the poisoned responses persist longer.

The DoS vulnerabilities matter for crawl reliability. CVE-2026-23870 is a denial-of-service vulnerability in React Server Components, while a separate Cache Components issue causes DoS via connection exhaustion. In a DoS scenario, server unavailability could cause crawlers to time out, potentially delaying indexing of new content.

The two XSS advisories affect apps using CSP nonces in App Router or beforeInteractive scripts consuming untrusted input. These are less directly SEO-impactful but still relevant for sites that inject nonces into server-rendered responses.

What to do

Patch immediately. Update Next.js to 15.5.18 or 16.2.6. Update react-server-dom-webpack, react-server-dom-turbopack, and react-server-dom-parcel to the latest patched version for your React release line (19.0.6, 19.1.7, or 19.2.6).

Check both dependencies. If your bundler pins a specific version of react-server-dom-*, the Next.js upgrade alone won't fix CVE-2026-23870. Verify your lockfile includes the patched React packages.

Audit for prior exploitation. Search Google using site:yourdomain.com for URLs that should be behind authentication. Check Search Console's indexed pages report for protected paths that shouldn't appear. If you find indexed protected content, request removal and purge your CDN cache for those URLs.

Purge your CDN cache after patching. Cache poisoning means stale or malicious responses may persist even after the code fix is deployed. If you use ISR, consider triggering a full revalidation rather than waiting for individual paths to expire.

Don't rely on middleware alone for auth. The server/client boundary in RSC is still maturing. Defense-in-depth matters. Use signed cookies, JWTs, or HTTP-only tokens validated at the data layer, not just at the routing layer.

Test locally before deploying. Attempt to access protected routes directly to confirm the segment-prefetch bypass is closed in your patched build. Verify that protected paths return the expected auth response rather than serving gated content.

Related: If your Next.js site uses streaming, check whether streaming metadata is reaching Google's index correctly.

Watch out for

Silent bypass. Because the segment-prefetch bypass skips middleware rather than triggering an auth failure, protected content may appear in search indexes or CDN cache entries without obvious error signals. Actively check for indexed protected URLs rather than waiting for errors to surface.

Coordinated dependency updates. Patching Next.js without also updating react-server-dom-* leaves the upstream RSC vulnerability (CVE-2026-23870) open. Both packages must be updated in the same deploy cycle. Check your lockfile explicitly.

Subfolder, subdomain, or separate .com? A composable-stack call

Thu, 07 May 2026 19:30:00 GMT

What happened

A practitioner post in r/TechSEO laid out a high-stakes architecture decision. An established EU fashion retailer with organic rankings across DACH, UK, France, and Benelux needs a URL structure for a new US storefront. The stack is composable (Next.js, Strapi CMS, multi-region commerce backend), so splitting storefronts per region is cheap on the engineering side. The SEO side is where it gets expensive.

The poster listed four options: subfolder under the existing .com, subdomain (us.brand.com), a standalone .com, or .us ccTLD. Engineering has no strong preference because the composable backend handles regional splitting at the API layer. The SEO call is the deciding factor. The poster mentioned that agencies they've consulted are softening the subfolder recommendations they gave two years ago, though that's one practitioner's read, not a documented industry shift.

The core question: does hreflang still help Google serve the correct regional variant, or has its reliability eroded under AI Overviews and recent ranking volatility?

Why it matters

The subfolder-with-hreflang answer has been standard for a decade. Google's multi-regional site documentation presents subfolders, subdomains, and separate domains as valid approaches with different trade-offs, but doesn't address how AI Overviews interact with multi-region structures.

The basic trade-offs are familiar. Subfolders pool authority but risk geo-intent blurring. Subdomains isolate crawl entities but may be slower to associate with root-domain authority. Separate domains give the cleanest signal separation but start from zero. Google's documentation lists comparable pros and cons for each without claiming any option performs identically in search.

What makes this thread worth covering is the composable-stack angle. On Next.js, locale state often lives in the routing layer rather than clean URL paths, making hreflang misconfigurations easier to introduce than they look. If the commerce backend handles multi-region logic at the API layer but URLs don't reflect regional differences, Google sees inconsistent content from identical URLs. If you serve different content on the same URL depending on the visitor's region or language, Google and AI bots will only see one version. The other versions effectively do not exist in search. Every regional version needs its own crawlable URL path.

Googlebot does not send Accept-Language headers, so header-based or cookie-based locale routing is invisible to it. The W3C notes general limitations with content negotiation for delivering the right language version, and search engines sharpen the problem.

Hreflang has always been a hint, not a directive. Google can override it based on relevance signals. If a new US subfolder has weak authority compared to the established EU parent, Google may serve the EU version for US queries anyway. The practitioner concern is real, but it reflects hreflang working as designed rather than a new degradation.

What to do

Don't let engineering convenience drive the URL decision. The fact that composable stacks make region splitting cheap in code doesn't mean the SEO consequences are reversible. Authority flows and crawl patterns lock in faster than a codebase refactor can undo them.

If you choose subfolders, add structured data for regional clarity. Hreflang alone may not be enough. Add areaServed to an Offer entity nested within your Product schema markup (via the offers property) to reinforce which region each product page targets.

Get canonicals and hreflang aligned, not just "redundant." Each regional URL needs a self-referencing canonical pointing to itself, and that canonical must match the URL listed as the self-reference inside that page's hreflang cluster. If a canonical on /en-us/ points cross-region to /en-de/, Google will treat /en-de/ as the indexable version and ignore the hreflang annotations on the /en-us/ page entirely. Canonicals and hreflang are complementary signals, but they must be consistent. A mismatch silently breaks the whole regional cluster. (For related canonical override behavior, see how Cloudflare enforces canonical tags as 301s for AI crawlers.)

Audit your locale routing for Googlebot visibility. If your Next.js front-end or Strapi API returns region-specific content based on headers, query parameters, or cookies, Google won't see the regional variants. Every regional version needs its own crawlable URL path. Verify in Search Console that US and EU variants are indexed separately.

Test with a small category before committing. Launch a limited set of US product pages under your chosen structure and monitor Search Console for several weeks. Check whether Google is ranking the US variant for US queries or falling back to the EU version. If cross-region issues show up in a small test, they will be worse at full scale.

Skip .us entirely. The ccTLD has negligible adoption and low user trust in the US market. ccTLDs are the strongest geo-targeting signal available to Google, but the case against .us is adoption and trust, not signal strength.

If brand independence matters, go separate domain. A standalone .com gives the cleanest long-term signal separation. Budget for dedicated link building to compensate for the authority gap.

Watch out for

AI Overviews may conflate regional content. Some practitioners speculate that subfolder structures rolling all regional signals into one domain could let AI models pull from different regional variants in the same overview. This is unverified hypothesis, not documented behavior. Separate domains remove the structural possibility, but subfolders require extra structured data precision to keep regional content distinct.

Canonical drift during deploys. It's easy for a templating change or a CMS-side canonical override to push every regional page's canonical back to a single "primary" locale. The moment that ships, your hreflang cluster goes silent. Add a deploy-time check that asserts each locale's canonical resolves to itself before any URL structure change goes live.

ISR path invalidation and near-duplicate content. On Next.js with Incremental Static Regeneration, a product update can regenerate both /en-de/ and /en-us/ paths at once. If descriptions are identical across regions, this creates a plausible duplicate content risk. The specific Googlebot impact is not formally documented, but differentiate regional content beyond price and currency as a precaution.

Google's quality sampling kills scaled content, not AI detection

Thu, 07 May 2026 15:00:00 GMT

What happened

The viral "Mt. AI" traffic pattern, where scaled AI content surges then collapses, keeps showing up in LinkedIn case studies. Dan Taylor argues in Search Engine Journal that the collapse has nothing to do with AI detection. The real mechanism is Google's quality sampling of new URL batches combined with its shifting quality threshold.

Taylor points to an ongoing brand case study shared by Martin Sean Fennon on LinkedIn. The content was scaled through AI and showed the familiar pattern: initial traffic spike followed by steep decline. Taylor then shows the same pattern occurring with non-AI content from a brand launched in January 2021, well before the current wave of AI-generated pages.

Why it matters

The distinction between "Google penalizes AI content" and "Google samples new URLs and drops the ones that fail quality checks" changes how practitioners should respond to traffic losses.

When a site publishes a large batch of new URLs, Google increases crawl resources to process them. The URLs receive a freshness boost during initial processing. Google then selects a representative sample of those new URLs, sometimes based on URL pattern such as a subfolder, and monitors how users engage with them.

If the sampled URLs perform poorly after the freshness boost fades, the remaining scaled content struggles to gain traction. Google pulls back crawl resources and drops pages from the index. The content never survived on its own merits.

Taylor notes that the quality threshold is not static. It shifts over time as better content gets published across the web. Adam Gent, cited in the article, has noted this moving-target dynamic. The threshold also varies by topic, since not all queries reward freshness equally.

The practical implication is that AI is just an amplifier. Sites that scaled content through freelance farms, agency content mills, or template-based generation saw the same pattern years ago. AI makes it easier to produce volume, which makes it easier to trip Google's quality sampling at scale.

Google recently described the goal as "non-commodity content" at a Toronto event. That framing fits with the sampling model: if a batch of URLs looks interchangeable with what already exists in the index, the sample will underperform, and Google will retract resources from the whole batch.

What to do

Audit recent scaled content against the sample model. If you published a large batch of URLs in a subfolder or content hub, check which ones Google is still crawling. Use server logs or Screaming Frog's log analyzer to see if crawl frequency dropped after an initial spike. A decline in crawl frequency across the batch suggests Google sampled and found the content wanting.

Check indexation rates by URL pattern. Group your scaled content by subfolder or template type in Google Search Console. If indexation rates are low or declining for a specific pattern, Google may be applying its sample-based quality judgment to the entire group.

Stop treating volume as the metric. The article argues that production scale should give way to quality maintenance at scale. Before publishing the next 500 pages, pick 20 from your last batch and ask whether they would hold up against the top 3 results for their target query. If not, the sample model predicts the batch will fail.

Invest in post-publication quality signals. Internal linking, distribution, and editorial updates matter after the freshness boost window closes. Pages that receive no internal links and no updates after publication are the ones most likely to fail Google's quality sampling.

Don't blame AI tooling when the content strategy is the problem. If your keyword targeting is thin, your editing is minimal, and your internal linking is absent, swapping AI for human writers won't fix the pattern. The sampling mechanism is content-quality agnostic.

Watch out for

Subfolder-level penalties from bad samples. If Google evaluates a representative sample from a URL pattern and the sample fails, the entire subfolder or pattern may lose crawl priority. A few poor pages can drag down hundreds of decent ones grouped under the same path.

Misreading the freshness boost as validation. Early traffic to new content does not mean Google has judged it high quality. The freshness boost reflects initial processing of new URLs, not a quality endorsement. Give scaled content time to settle before drawing conclusions about performance.

Google AI Mode adds five link types that complicate attribution

Thu, 07 May 2026 10:00:00 GMT

What happened

Google announced five changes to how links appear in AI Mode and AI Overviews. The updates add subscription labels, inline links within response text, discussion and social media previews, topic suggestions after responses, and desktop hover previews. Hema Budaraju, VP of Product Management, detailed the changes in a blog post covered by Search Engine Journal.

The five link types break down as follows:

Subscription highlighting: Links from a user's news subscriptions are now labeled in AI Mode and AI Overviews. Google previously announced this for the Gemini app in December but had not confirmed expansion to Search surfaces until now.
Topic suggestions: Related content links will appear at the end of many AI responses, pointing to articles or analyses on different aspects of the topic.
Discussion and social media previews: AI responses will show previews from public discussions, social media, and firsthand sources, with context like creator and community names.
More inline links: Links will appear directly within AI response text, positioned next to the relevant passage. Google did not say how many more inline links users will see.
Link hover previews (desktop): Hovering over an inline link will show the site name and page title. Google said people hesitate to click links when they don't know where they lead.

Google did not share rollout details for most of these features. Geography, language, eligibility criteria, and timing are all unspecified.

Why it matters

These changes signal that Google is trying to make AI-generated answers feel less like dead ends. Each feature adds a new surface where a site can earn visibility, but also a new format where attribution works differently than in traditional search results.

The subscription label is the most concrete change for publishers. Google said early testing showed users were "significantly more likely" to click links labeled as subscriptions. The company did not share specific numbers. For news publishers with subscription models, the label creates a new click incentive tied to existing reader relationships.

Inline links placed next to relevant passages could affect how users interact with citations. The update expands inline link placement within AI response text, potentially changing click patterns for cited pages. Whether that increases or decreases total referral traffic is an open question.

Separately, Amsive's analysis of 2,000+ domains using SISTRIX Visibility Index data found that aggregators lost US search visibility after the March core update. According to Amsive's analysis as reported by Search Engine Journal, YouTube saw the largest drop, followed by Reddit, Instagram, and X. First-party brand sites and government domains gained. Lily Ray at Amsive reads this as Google favoring original sources over discussion platforms.

Discussion previews introduce a new competitor for visibility. Pages from Reddit, forums, and social media now get their own preview cards within AI responses, complete with creator and community names. Sites that rely on user-generated content may see these cards appear alongside or instead of their own pages.

Google's AI features documentation still states there are "no additional requirements to appear in AI Overviews or AI Mode, nor other special optimizations necessary." The practical gap between that guidance and five new link treatments is worth watching.

What to do

Start by checking whether your site's structured data is current. Google's Article structured data documentation covers the NewsArticle, Article, and BlogPosting types. Accurate author names, publication dates, and headlines help Google populate preview cards and attribution correctly.

If you publish subscription content, review Google's developer documentation on connecting subscriptions. The subscription label only appears when Google can match a user's subscription to your publication.

Set up tracking to isolate AI Mode and AI Overview referral traffic once these features roll out. Filter by referrer or landing page to distinguish clicks from AI surfaces versus traditional results. Without rollout dates, there is no way to run a clean before-and-after comparison yet. Baseline your current AI referral numbers now so you have a comparison point.

Monitor how discussion previews affect your pages. If your content competes with Reddit or forum threads on the same topics, watch whether those discussion cards displace your links in AI responses.

Watch out for

No rollout timeline means no clean test window. Google did not confirm when or where most of these features will appear. Any traffic changes you see in the next few weeks may or may not be related to these updates. Avoid attributing fluctuations to these changes until you can confirm the features are live for your queries.

Subscription labels require publisher setup. The labels will not appear automatically. Publishers need to integrate with Google's subscription linking system. If you skip the setup, your subscribers will not see the label, and you will miss the click-through lift Google described.

Google tests Gemini 3 model selector in the search bar

Thu, 07 May 2026 10:00:00 GMT

What happened

Google is testing a new search bar feature that lets users pick which Gemini 3 model they want to use. SE Roundtable reported on May 6 that the options include Pro, Fast, and Auto.

The test was first spotted by Sachin Patel on X, then confirmed by Gagan Ghotra with a slightly different version of the UI. Barry Schwartz noted he could not replicate the feature himself, suggesting it is a limited test.

The model selector appears alongside previously seen options like "create images" and "deep search." The new element is specifically the ability to choose a Gemini 3 model variant before running a query.

Why it matters

If this feature rolls out broadly, it would give users direct control over the AI model powering their search results. That changes the dynamic for SEOs in a few ways.

The "Fast" option likely prioritizes speed over depth, which could mean shorter or simpler AI-generated answers with fewer source citations. The "Pro" option may produce more detailed responses that pull from more pages. If users can toggle between these, the same query could surface different sites depending on which model is selected.

For practitioners tracking AI Overview appearances, model selection adds a new variable. A page that gets cited in a Pro response might not appear in a Fast response, or vice versa. Monitoring tools would need to account for this if it ships.

Google's own AI features documentation states there are no special requirements to appear in AI Overviews or AI Mode beyond following fundamental SEO best practices. That guidance has not changed with this test. The documentation also notes that AI Overviews are designed to show up on queries where they add value beyond standard results, and that they drive visits to a greater diversity of websites.

The model selector also signals Google is moving toward giving users more transparency about which AI is generating their results. Search has historically been a black box. Letting users pick "Pro" vs. "Fast" acknowledges that different models produce different outputs.

What to do

No immediate action is needed. The feature is in limited testing and may never roll out.

If it does ship, consider these steps:

Check your AI Overview tracking setup. If your monitoring tools report AI Overview citations, confirm whether they can distinguish between model variants. Search Console's performance reports track impressions and clicks for search features, but do not currently break data down by AI model.
Test your key queries manually. If you gain access to the model selector, run your top queries through each option (Pro, Fast, Auto) and compare which sources get cited. Look for patterns in whether Pro favors longer-form content or more authoritative domains.
Keep following standard SEO fundamentals. Google's documentation is clear that no special steps are required for AI feature visibility. Creating well-structured, accurate content remains the baseline.

Watch out for

Inconsistent citation tracking. If different Gemini 3 models cite different sources for the same query, any single-snapshot monitoring approach will miss part of the picture. You may need to run checks against multiple model settings once the feature is available.

Assuming Fast means less traffic. A faster, shorter AI response does not necessarily mean fewer clicks to your site. Shorter answers may actually drive more click-through if users want more detail than the summary provides.

Managed WordPress hosts silently block AI crawlers

Thu, 07 May 2026 10:00:00 GMT

What happened

Several managed WordPress hosting platforms are blocking AI crawlers by default, according to a report from Search Engine Land. Site owners on these platforms may not realize their content is invisible to AI-powered search products like ChatGPT search, Perplexity, and Claude's web features.

The blocks typically happen at the server or platform level through robots.txt rules or firewall configurations that the site owner never opted into. Because the blocking is silent, many publishers only discover it after noticing their content missing from AI search results.

Why it matters

AI search products now use distinct crawler user agents, each controlling a different type of access. Blocking them has real consequences that vary by bot.

OpenAI's documentation lists four bots:

GPTBot crawls content for training generative AI models. Blocking it prevents your content from entering training datasets.
OAI-SearchBot surfaces websites in ChatGPT search results. Sites that block OAI-SearchBot won't appear in ChatGPT search answers, though they can still show as navigational links.
OAI-AdsBot crawls pages to assess ad placement quality. Relevant for sites running OpenAI-powered ad integrations.
ChatGPT-User fetches pages in real time when a ChatGPT user asks the model to browse a specific URL. OpenAI notes that because these fetches are user-initiated, robots.txt rules may not apply. Blocking via robots.txt is not guaranteed to prevent access.

Anthropic uses three bots:

ClaudeBot gathers content for model training.
Claude-User retrieves content when a user asks Claude a question. Blocking it reduces your site's visibility in user-directed web search.
Claude-SearchBot indexes content to improve search result quality.

Perplexity similarly documents separate crawler user agents with independent robots.txt controls.

The distinction between training crawlers and search crawlers is the crux of the issue. A site owner might reasonably want to block training bots while keeping search bots enabled. Platform-level blocking removes that choice entirely.

For publishers who depend on organic traffic, blanket AI bot blocking could mean losing visibility in a growing channel. For e-commerce sites using product feeds or content marketing, the impact compounds as AI search usage grows.

What to do

Check whether your managed WordPress host is blocking AI crawlers. The fastest method is to inspect your live robots.txt file at yourdomain.com/robots.txt and look for disallow rules targeting these user agents:

GPTBot
OAI-SearchBot
OAI-AdsBot
ChatGPT-User
ClaudeBot
Claude-User
Claude-SearchBot
PerplexityBot

If your host manages robots.txt at the server level, you may not see these rules in your WordPress settings. Check the live file directly via browser or curl.

Contact your host's support team if you find unexpected blocks. Ask whether they apply AI crawler restrictions at the server, CDN, or WAF level. Some platforms use Cloudflare or AWS WAF rules that block bot traffic before it reaches your robots.txt.

Decide which bots you actually want to allow. A reasonable starting position for most publishers is to block training crawlers (GPTBot, ClaudeBot) while allowing search and browsing crawlers (OAI-SearchBot, Claude-SearchBot, Claude-User, PerplexityBot). Note that ChatGPT-User may not respect robots.txt because its fetches are user-initiated. OpenAI also notes that if a site allows both GPTBot and OAI-SearchBot, they may use results from a single crawl for both purposes, which means allowing OAI-SearchBot could inadvertently contribute to training data.

Be aware that robots.txt changes can take up to 24 hours to propagate across both OpenAI's and Perplexity's systems.

Watch out for

WAF-level blocks hiding behind a clean robots.txt. Your robots.txt may look fine, but your host's web application firewall could be dropping AI crawler requests before they reach your server. Test by checking server access logs for AI bot user agents. If you see zero requests from any AI crawler, a firewall rule is likely responsible.

Platform updates re-applying blocks. Even if you override your host's default settings, platform updates may reset them. Document your intended crawler access policy and audit it quarterly.

Schema markup does not influence LLM parsing

Thu, 07 May 2026 10:00:00 GMT

What happened

Pedro Dias published an analysis in Search Engine Journal arguing that schema markup plays no role in how large language models parse web content. The piece takes direct aim at vendor claims that structured data "ensures AI engines can parse and connect your content."

Dias's central argument is architectural. Transformer models, as described in the foundational "Attention Is All You Need" paper by Vaswani et al., process language as sequences of tokens. There is no parser inside the model looking for schema tags or FAQ markup. The model reads the words. Pre-training data is the public web, and the public web has never been structured.

The article names three vendors making variations of the same claim. Semrush's "Technical GEO" pillar presents schema and structured data as ensuring AI engines can parse content. AirOps published a graphic claiming specific percentage lifts from schema and heading changes, but those numbers trace back to its own report, creating a self-citation loop.

Peec AI's GEO guide acknowledges the probabilistic nature of LLMs but lands on the same prescriptions: heading hierarchy, bullet lists, FAQ markup, and multiple schema types per page.

Why it matters

The distinction Dias draws is between what schema actually does and what vendors claim it does. Schema.org markup has well-defined functions: powering rich results in classical search, supporting entity disambiguation in the knowledge graph, and helping voice assistants pull structured fields. None of those functions involve LLM text comprehension.

The practical risk for SEO teams is misallocated effort. If practitioners spend time layering schema types onto pages specifically to improve AI search visibility, they are working against a mechanism that does not exist in the model architecture.

The argument is architecturally correct for training, but the picture differs across the four types of AI bot that access websites.

Training crawlers like GPTBot and ClaudeBot process the public web as token sequences during pre-training. Schema tags are not parsed as structure. Dias is correct here.

Search and retrieval bots like OAI-SearchBot and PerplexityBot use a retrieval layer to select documents before the model generates an answer. Whether that retrieval layer uses structured data for document selection is an open question the article's own gotcha acknowledges but does not develop.

User-action fetchers like ChatGPT-User and Claude-User fetch pages in real time when a user asks the model to read a URL. JSON-LD containing product data, pricing, or FAQs could help these bots extract structured information from the page.

AI browsers and agents like Operator, Atlas, and Mariner navigate pages to complete tasks such as purchases or form submissions. Structured data describing products, availability, and pricing has direct utility for agents that need to understand what a page offers.

When evaluating any claim about schema and AI, ask which layer it affects: training, retrieval, or action. The answer differs for each.

The self-citation problem Dias identifies in vendor research deserves attention. When a vendor's "data-backed" claims trace back to the vendor's own report, the evidence loop is closed. SEO teams building GEO strategies on those numbers are building on unverified foundations.

Google recently began surfacing AI Mode traffic data in Search Console, giving practitioners real performance signals. Actual click and impression data from GSC is more reliable for measuring AI search performance than vendor infographics.

What to do

Reject vendor claims that schema helps LLMs "parse" or "understand" your content. That mechanism does not exist in transformer architecture. But do not conclude schema is irrelevant to AI. Schema has documented value in classical search (rich results, knowledge graphs, voice assistants) and may have undocumented value at the retrieval and agent layers described above.

Implement schema for what it demonstrably does, not for what GEO vendors claim. If you already have Product, Article, or FAQPage markup, keep it. If you are deciding where to invest next, prioritize structured data that helps agents complete tasks (product pricing, availability, specifications) over decorative schema types layered for "AI visibility."

Audit any GEO or AEO strategy your team has adopted. Check whether the recommended tactics trace back to independent research or to the vendor selling the solution. If the methodology leads back to the vendor's own report, weight those claims accordingly.

Use Search Console's AI Mode data to measure what actually drives AI search traffic to your site. Real performance data from GSC beats vendor infographics with self-sourced percentages.

Watch out for

Conflating retrieval with generation. Some vendor claims may hold partial truth at the retrieval layer (how a RAG system selects source documents) rather than the generation layer (how the LLM processes text). Dias acknowledges Peec AI makes this distinction. The problem is when vendors blur the two and present retrieval-layer tactics as though they affect how the model "understands" content.

Self-citing vendor research. Any study claiming specific percentage lifts from schema or heading changes should be checked for methodology independence. If the vendor produced the study, ran the tests, and sells the solution, treat the numbers as marketing until independently replicated.

URL paths are semantic inputs for RAG pipelines, not just SEO

Thu, 07 May 2026 10:00:00 GMT

What happened

URL structures are no longer just an SEO hygiene factor. They now function as semantic inputs for AI retrieval systems, according to a new analysis from Sophie Brannon at Search Engine Journal. The piece argues that RAG pipelines, web-connected LLMs, and zero-shot classification models all parse URL path segments as meaningful text strings when deciding what content to retrieve and cite.

The traditional SEO advice (short paths, hyphens, target keyword) still applies. But Brannon's argument is that it's incomplete for a world where ChatGPT, Perplexity, Claude, and Google's AI Overviews are retrieving and synthesizing content differently from classic crawlers.

Why it matters

Traditional search engines can infer context from a page even when the URL is a meaningless ID string. AI retrieval systems are less forgiving. Brannon outlines three mechanisms where URL structure matters to LLMs:

RAG chunking and retrieval. Developer-built RAG systems crawl URLs, convert page content into searchable chunks, and store them as vector embeddings. The URL path is part of the text that gets processed. A descriptive path like /resources/seo/url-structure-ai-retrieval/ gives the retrieval layer explicit hierarchy and topic signals before it even reads the page body.
URL context grounding (Gemini-specific). Google's Gemini uses a technique called URL context grounding to pull direct information from individual URLs without full RAG processing. The goal is to improve factual accuracy by analyzing content and data at specific URLs. Descriptive paths help Gemini understand what a URL covers before combining information from multiple sources.
Zero-shot classification. Models can categorize a webpage's purpose without task-specific training data by analyzing semantic cues in the URL string itself. The model maps URL patterns to predefined categories using cosine similarity or prompt-based reasoning. A URL that communicates nothing forces the model to work harder and introduces ambiguity in categorization.

Beyond retrieval mechanics, there's a user-facing reason too. When an AI system cites a source, the URL is often visible alongside the excerpt. A clean, descriptive path builds trust in the same way it does in a SERP snippet. A path like /p?id-4821 does not.

The practical implication is that URL path segments now serve as a secondary content layer. They communicate hierarchy, topic, and specificity independently of the page title, H1, or other metadata.

What to do

Audit existing URL structures for semantic clarity. Check whether your most important pages have paths that communicate topic and hierarchy through readable words. A path like /resources/seo/url-structure-ai-retrieval/ tells both humans and machines what the page covers. A path like /blog/post-4821 does not.

Use folder depth to signal content hierarchy. Brannon frames URL hierarchy as a way to reinforce topical authority. If your domain covers SEO, structure your paths so that category and subcategory relationships are visible: /guides/technical-seo/crawl-budget/ rather than /guides/crawl-budget/. RAG systems can use folder nesting to infer content provenance.

Prioritize question-based and long-tail content paths. AI systems handling specific queries look for precise matches. A URL path that mirrors the query structure (e.g., /faq/how-to-set-canonical-tags/) gives the retrieval system an additional relevance signal before it processes the page content.

Don't break existing URLs to fix this. If your current URLs rank well and have backlinks, restructuring them creates redirect chains and risks losing link equity. Apply these principles to new content and new sections. For existing content, the on-page signals (title, headings, body) still carry more weight than the path alone.

Check your AI-facing URLs specifically. Look at which URLs are being crawled by AI bots (GPTBot, ClaudeBot, PerplexityBot) in your server logs. If those bots are hitting your most opaque URL patterns, those pages are the highest-priority candidates for path improvements on future versions or redesigns.

Watch out for

Over-nesting URL paths. Adding five folder levels for semantic clarity backfires. Deeply nested paths create crawl friction and dilute link equity. Two to three meaningful path segments is the sweet spot.

Assuming URL structure alone drives AI citation. URL paths are one signal among many. Page content quality, structured data, and domain authority still matter more. Treating URL restructuring as a silver bullet for AI visibility will disappoint.

Next.js streaming metadata fails Google indexing

Wed, 06 May 2026 08:05:00 GMT

What happened

A practitioner building a site on Next.js reported in r/TechSEO that pages using streaming metadata were indexed by Google with empty <title> tags, missing meta descriptions, missing canonicals, and missing hreflang tags.

Next.js's generateMetadata function can stream metadata after the initial HTML shell is sent. When the metadata depends on dynamic information, the resolved tags get appended to the <body> rather than appearing in the <head> of the initial server-rendered HTML. The framework includes a config option called htmlLimitedBots that forces blocking (non-streamed) metadata for specific user agents.

The default htmlLimitedBots list includes several Google crawlers like Mediapartners-Google, AdsBot-Google, and Google-PageRenderer. It does not include Googlebot itself. The presumed reasoning is that Googlebot executes JavaScript and can interpret the full DOM, though the documentation does not explicitly state this rationale.

The practitioner added Googlebot to the htmlLimitedBots config to force blocking metadata. Google Search Console's live test showed the metadata present and correct in rendered HTML. But after submitting three test pages for indexing, the "View Crawled Page" results told a different story. One page had correct metadata in the <head>. The other two had an empty <title> tag in the <head>, with no meta description, no canonical, and no hreflang tags at all.

Why it matters

Google renders JavaScript pages in two waves. The first wave processes the raw HTML response. The second wave runs JavaScript and inspects the rendered DOM. Pages that haven't gone through the second wave yet will be indexed based on the initial HTML alone.

If critical metadata only arrives via streaming (appended to <body> after the initial shell), it depends entirely on the second rendering wave to be picked up. Any delay or timeout in that second wave means Google indexes the page without titles, canonicals, or hreflang. Google has also stated that certain tags are ignored if they appear outside <head>, which adds another layer of risk when streamed metadata lands in <body>.

The practitioner's finding suggests that even after adding Googlebot to htmlLimitedBots, the config may not work reliably in all cases. One of three test pages worked correctly while two did not. Possible causes include generateMetadata timing out before the response was sent, or the htmlLimitedBots config not matching the exact user-agent string Googlebot sends.

Sites using Next.js App Router with dynamic metadata are the most exposed. Static metadata defined via the metadata object in layout.js or page.js is not affected because it gets included in the prerendered HTML without streaming.

What to do

Check whether your Next.js site uses generateMetadata with dynamic data. If it does, your metadata may be streamed rather than included in the initial HTML.

Test your pages by viewing the raw HTML response (not the rendered DOM) with curl or wget. If <title>, canonical, and other critical tags are missing or empty in that response, they're being streamed.

Add Googlebot to your htmlLimitedBots config if you haven't already:

// next.config.ts
import type { NextConfig } from 'next'

const config: NextConfig = {
  htmlLimitedBots: /Googlebot|Mediapartners-Google|AdsBot-Google|Google-PageRenderer/,
}

export default config

Note that setting htmlLimitedBots overrides the default list entirely. Include the default bots alongside Googlebot in your regex.

If you want to eliminate streaming metadata risk completely, you can match all user agents:

htmlLimitedBots: /.*/,

After making changes, use Google Search Console's URL Inspection tool to request indexing on a few test pages. Compare the "Crawled Page" HTML against what you expect. Pay attention to whether <title>, meta description, canonical, and hreflang tags appear in the <head> of the crawled HTML, not just the rendered HTML.

Monitor the Google crawlers documentation for the exact user-agent strings Googlebot sends. Your regex needs to match these strings precisely.

Watch out for

Overriding defaults without including them. Setting htmlLimitedBots replaces the entire default list. If you only add Googlebot without including Mediapartners-Google, AdsBot-Google, and the other defaults, those crawlers will start receiving streamed metadata instead of blocking metadata.

Live test vs. indexed page mismatch. The GSC live test runs a fresh render and shows the full DOM. The "View Crawled Page" view shows what Google actually stored during crawling. A page can pass the live test and still be indexed with missing metadata if the initial crawl hit a timeout or skipped the second rendering wave.

GSC reports resource failures despite 200 OK in server logs

Tue, 05 May 2026 18:03:53 GMT

What happened

A practitioner in r/bigseo reported that Google Search Console is flagging pages with "Page resources couldn't be loaded / Other error" warnings, even though their server logs show Googlebot receiving HTTP 200 responses for all requests.

The thread describes a debugging scenario many practitioners will recognize: GSC says something is broken, your logs say everything is fine, and you're left trying to reconcile the two.

Why it matters

Resource loading failures in GSC affect how Google renders pages. If Googlebot can't load CSS, JavaScript, or image resources during rendering, the indexed version of your page may be incomplete. Pages that rely on client-side rendering are especially vulnerable, since missing JS resources can mean missing content entirely.

The disconnect between server logs and GSC reports is a common source of confusion. Several factors can cause it.

Server logs only record requests that reach your origin server. If a CDN, edge proxy, or firewall intercepts a request before it hits your origin, you'll see no log entry at all. Googlebot may be getting blocked or rate-limited at a layer you're not monitoring.

Timing matters too. Google's rendering service (WRS) fetches resources separately from Googlebot's initial crawl. The WRS may request resources minutes or hours after the initial HTML fetch. If your server was briefly unavailable, overloaded, or returned an error during that second pass, your access logs for the original crawl would still show 200s.

Another possibility is that the requests flagged in GSC didn't actually come from Googlebot. Google's documentation on verifying crawlers describes how to confirm whether a request genuinely originated from Google. Reverse DNS lookups should resolve to googlebot.com or google.com hostnames. If your logs show 200s for requests that weren't actually from Googlebot, you're looking at the wrong traffic.

Google also publishes IP range JSON files for its various crawler categories: general crawlers like Googlebot, special-purpose crawlers like AdsBot, and user-triggered fetchers like the URL Inspection tool. Cross-referencing your log IPs against these published ranges can help isolate which Google system is making each request.

What to do

Start by checking whether the resource URLs flagged in GSC are actually reachable by Googlebot. Use the URL Inspection tool's "View Tested Page" feature to see exactly what Google's renderer received. Compare the rendered HTML against your source to spot missing resources.

Check your CDN and edge layer logs, not just your origin server. If you use Cloudflare, Fastly, or similar services, look for blocked or challenged requests from Google's IP ranges. Bot management rules and rate limiting are frequent culprits.

Verify that the requests in your logs are genuinely from Googlebot. Run a reverse DNS lookup on the source IPs:

host 66.249.66.1

The result should resolve to a *.googlebot.com or *.google.com hostname. Then run a forward DNS lookup on that hostname to confirm it maps back to the same IP:

host crawl-66-249-66-1.googlebot.com

If the IPs in your logs don't resolve to Google hostnames, your 200 responses are going to a different bot, not to the real Googlebot that GSC is reporting on.

Check your robots.txt for rules that might block resource paths. A common mistake is disallowing /wp-content/ or /assets/ directories. Googlebot needs access to CSS and JS files to render pages properly.

Finally, review your server's response times under load. If the WRS requests resources during a traffic spike and gets timeouts, GSC will report failures even if the page itself loaded fine seconds earlier. Server-side caching for static resources can reduce this risk.

Watch out for

CDN bot-management rules silently blocking Google. Many CDN providers apply bot challenges or rate limits that don't generate origin server logs. You'll see clean 200s in your logs while Googlebot is actually getting 403s or CAPTCHAs at the edge.

Robots.txt blocking render-critical resources. GSC will report the page itself as loaded but flag resource failures if your robots.txt disallows paths to CSS, JS, or font files. The URL Inspection tool's robots.txt test only checks the page URL, not every sub-resource.

Google's Web Bot Auth adds cryptographic bot identity

Tue, 05 May 2026 17:50:28 GMT

What happened

Google published developer documentation for Web Bot Auth, a new cryptographic protocol that lets bots sign their HTTP requests. The protocol is experimental. Google is testing it with some AI agents hosted on its own infrastructure.

SE Roundtable reported the launch on May 5, 2026.

Web Bot Auth replaces the current trust model where sites verify bots through self-reported user-agent headers and reverse DNS lookups against known IP ranges. Instead, agents cryptographically sign their requests, giving site owners a way to confirm that traffic genuinely comes from the claimed bot provider.

Google's documentation describes three benefits:

Cryptographic certainty: Verified identity that moves beyond spoofable headers and decouples agent identity from IP addresses.
Better observability: Clearer data on how agents interact with your content.
Future-proofing: A foundation for mutual trust between agent providers and websites.

The protocol is based on an IETF Internet Draft and builds on HTTP Message Signatures (RFC 9421), a proposed standard for signing HTTP messages. Google explicitly notes that not all Google user agents use Web Bot Auth yet, and even agents that do support it are not signing every request.

Why it matters

Bot spoofing is a real and growing problem. The Imperva 2026 Bad Bot Report highlights the increasing difficulty of distinguishing legitimate automated traffic from malicious bots, especially as agentic AI blurs the lines. Current verification methods have clear weaknesses. User-agent strings are trivially faked. Reverse DNS verification is more reliable but ties identity to IP ranges, which creates maintenance headaches as providers scale infrastructure.

Cryptographic signing addresses both problems at once. A valid signature proves the request came from an entity holding the private key, regardless of which IP address sent it.

The timing matters too. AI agents that browse the web on behalf of users are multiplying fast. Google testing the protocol with its own AI agents signals that it expects agent-to-site authentication to become standard plumbing. If Web Bot Auth or something like it gains adoption, sites could make granular access decisions per agent with high confidence in the claimed identity.

For site owners who already manage crawler access through robots.txt and IP allowlists, the protocol offers a potential upgrade path. Instead of maintaining IP range lists that change when providers update infrastructure, you could verify a cryptographic signature against a published public key.

What to do

No immediate action is required. The protocol is experimental and Google is not yet signing all requests.

That said, practitioners managing bot access policies should familiarize themselves with Google's Web Bot Auth documentation. Understanding the request-signing flow now will save time if adoption accelerates.

Keep your existing bot verification methods in place. Google's documentation explicitly says to continue using established methods like reverse DNS verification. Web Bot Auth is additive, not a replacement, during the experimental phase.

If you run a site that gates content or enforces differential access for AI crawlers versus search crawlers, watch how the protocol develops. The ability to verify bot identity cryptographically would make those access controls more reliable than header-based checks.

For teams building server-side middleware or WAF rules, RFC 9421 (HTTP Message Signatures) is the underlying standard worth reading. Any future implementation will involve validating signatures against that spec.

Watch out for

Do not drop existing verification. Google warns it is not signing every request, even from agents that support the protocol. If you switch to signature-only verification now, you will block legitimate Google traffic that arrives unsigned.

Protocol scope is narrow for now. Only some AI agents on Google infrastructure are participating. Googlebot for search indexing is not mentioned as a current participant. Do not assume all Google crawlers will send signed requests in the near term.

Google's query fan-out splits AI queries against classic search

Tue, 05 May 2026 03:43:47 GMT

What happened

Google's Liz Reid explained on the Bloomberg Odd Lots podcast how AI Overviews and AI Mode are changing query behavior. Users are now expressing full, complex information needs instead of condensing them into short keyword phrases. Search Engine Journal's coverage by Roger Montti unpacks the key implications for SEO.

Reid described how a user searching for "restaurants New York" always had a more complex need in mind. They wanted a restaurant in a specific location, for five people, not too pricey, with vegan options and kid-friendly seating. In the old model, users translated that need into "keyword-ese." Now they type the full question and expect Google to do the translation.

The critical detail is what happens next. Google doesn't match that long query against a single page. Instead, it uses query fan-out to decompose the complex question into smaller, specific sub-queries. Each sub-query runs against classic search. Google's AI then selects from the results across those sub-queries and synthesizes an answer.

Why it matters

Query fan-out means the unit of ranking hasn't changed as much as it might seem. Long, complex AI queries get broken into fragments that look a lot like traditional keyword phrases. Your page doesn't need to answer the entire complex question. It needs to be the best result for one of the decomposed sub-queries.

The practical implication is that creating pages targeting hyper-specific long-tail AI queries may not be the right move. Those complex queries are often one-off and rarely repeated. The time spent crafting content for them could be better spent elsewhere.

Montti's analysis points out that because AI Overviews share space among multiple sites, other factors gain importance. Relevant images and video content can help a site stand out within an AIO result.

A separate SEJ article on "browsy" queries adds context. Reid noted that user behavior across search surfaces is varied, not monolithic. Some queries still favor the full SERP experience, particularly what she calls "browsy" queries where users want to explore rather than get a single synthesized answer.

Meanwhile, Google's Nikola Todorovic encouraged SEOs to use AI tools themselves to analyze data, research competition, and improve their ability to provide value. The message from Google's side is consistent: focus on being genuinely useful for specific needs.

What to do

Audit your content against decomposed queries, not full complex questions. Think about what sub-queries your pages could satisfy when a complex AI question gets broken apart. A page about "kid-friendly vegan restaurants in Midtown Manhattan" answers a very specific fragment of a larger query.

Don't chase one-off long-tail AI queries. If a complex query is unlikely to be repeated, building a page around it has poor ROI. Focus instead on the recurring, specific sub-queries that fan-out generates repeatedly across many different complex questions.

Strengthen your position in classic search for specific queries. Query fan-out runs against classic search results. The fundamentals of ranking for well-defined keyword phrases still apply. If you rank well for a sub-query, you're a candidate for the synthesized AI answer.

Claim visual space in AI Overviews. Since AIO results pull from multiple sites, differentiation matters. Include relevant images and video on key pages so Google has rich media to surface alongside your content.

Track which sub-queries your content wins. Google Search Console won't show you fan-out sub-queries directly, but monitoring which specific queries drive AI Overview impressions can reveal the fragments Google associates with your pages.

Watch out for

Misreading the signal as "long-tail is back." Reid's comments might tempt some practitioners to build content for verbose, conversational queries. The fan-out mechanism means Google decomposes those into shorter fragments. The winning pages answer specific sub-queries, not the full conversational prompt.

Assuming AI search replaces classic search behavior. Reid explicitly noted that search behavior is varied across surfaces. Some users still prefer browsing full SERPs. Treating AI Overviews as the only surface worth targeting risks neglecting traffic from traditional results.

Noindex vs. robots.txt disallow for millions of stub pages

Mon, 04 May 2026 23:07:26 GMT

What happened

A r/TechSEO discussion about managing crawl budget on a large news site sparked debate over whether tag and author stub pages should be blocked via robots.txt or handled with noindex. The original post has since been deleted, but the thread's responses reveal a common tension for news sites with millions of low-value URLs.

The core question was whether tag pages and empty author pages should be disallowed in robots.txt. Practitioners in the thread offered different approaches, but converged on a key point: confirm you actually have a crawl budget problem before making changes.

Why it matters

News sites generate stub pages at an enormous rate. Every new tag, author profile, or taxonomy page creates a URL that Googlebot will eventually discover and attempt to crawl. For sites with millions of these pages, the concern is that Googlebot spends its crawl budget on low-value URLs instead of fresh articles.

The choice between noindex and robots.txt disallow has real consequences. Blocking a URL via robots.txt prevents Google from crawling it, but Google can still index the URL if it finds links pointing to it. The URL may appear in search results with no snippet. Adding a noindex meta tag requires Google to crawl the page at least once to see the directive, but it reliably removes the page from the index.

Google's documentation on how Search works confirms that crawling and indexing are separate stages. A robots.txt disallow stops crawling but not indexing. If the goal is to keep stub pages out of search results, noindex is the more precise tool.

One practitioner in the thread, rykef, recommended noindex for stub pages and suggested starting with a linking analysis. Their reasoning: news sites are aggressively crawled, so identifying which pages get discovered quickly matters more than broad blocking rules. Changes that affect many pages at once will move the needle more than general site health fixes.

Another commenter, AbleInvestment2866, drew a distinction between tag pages and author pages. Tags should generally be blocked or noindexed. Author pages are worth keeping if they have real content, but empty ones should be handled. They also raised a critical caveat: unless you can confirm there is actually a crawl budget problem, leaving things alone may be safer. Making large-scale changes to a news site's URL structure carries risk.

What to do

Verify the problem exists before acting. Check server logs to confirm Googlebot is spending disproportionate time on stub pages. The Screaming Frog Log File Analyser can process millions of log events and show exactly which URLs bots are crawling and how frequently. Google Search Console's crawl stats report also shows crawl activity by response code and page type.

Use noindex instead of robots.txt disallow when you want pages out of the index. A robots.txt disallow prevents crawling but not indexing. If Google discovers a disallowed URL through internal links or sitemaps, it can still index the URL without a snippet. Noindex requires one crawl to process but then reliably removes the page.

Audit internal linking to stub pages. On news sites, tag and author pages often receive thousands of internal links from article footers and sidebars. Reducing internal link signals to empty stub pages can decrease how aggressively Googlebot crawls them. Prioritize changes that affect the largest number of pages.

Distinguish between page types. Empty tag pages and author profiles with no content are safe candidates for noindex. Author pages with bios, article lists, and E-E-A-T signals may be worth keeping indexed. Apply rules by page type, not with a single blanket directive.

Stage the rollout. As one practitioner in the thread noted, there are more ways to make things worse than better on a site this size. Apply noindex to one category of stub pages first, monitor crawl behavior and indexing for two to four weeks, then expand.

Watch out for

Robots.txt blocking pages that already have noindex. If you disallow a URL in robots.txt and also add a noindex tag, Google cannot crawl the page to see the noindex directive. The robots.txt block takes priority, and the page may remain indexed. Pick one approach per URL pattern.

Sitemaps including noindexed URLs. If your sitemap generator automatically includes tag or author pages, noindexed URLs will keep appearing in sitemaps. Googlebot may continue requesting them. Exclude noindexed URL patterns from sitemap generation to avoid sending mixed signals.

ChatGPT free tier triggers web search in only 10.8% of queries

Mon, 04 May 2026 10:28:41 GMT

What happened

Free-tier ChatGPT models triggered a live web search in just 10.8% of queries, compared to 47.4% for paid-tier models. That finding comes from a WordLift analysis of 56 ChatGPT enterprise SSE (server-sent events) traces published on May 4, 2026.

The researchers classified traces by the model slug exposed in the SSE stream. Traces using gpt-5-3-mini were treated as a free-tier proxy, while gpt-5-3 and gpt-5-5-thinking served as paid-tier proxies. Only 56 of 131 total enterprise traces had usable model-slug metadata. WordLift's Andrea Volpini describes the sample as "small" and frames the results as "a directional signal, not a final verdict."

The grounding gap extended beyond search frequency. Free-tier traces produced 0.93 URLs per 1,000 characters versus 3.38 for paid-tier. Citation density was 0.14 per 1,000 characters for free versus 0.78 for paid. A composite "trustworthiness proxy" score came in at 49.2 for free-tier traces and 76.8 for paid.

One finding the team did not expect: schema-related vocabulary appeared at nearly identical rates across both groups (35.1% free, 31.6% paid). Both model tiers could discuss structured data and machine-readability fluently. The difference was whether the model actually verified claims against live pages.

Why it matters

The practical gap here is evidence density, not fluency. A free-tier response can sound equally authoritative while citing fewer sources and relying more heavily on parametric memory. For brands, that means free-tier users may receive answers built on stale or incorrect information with no easy way to tell.

WordLift's analysis found that 32.4% of free-tier traces were purely parametric, meaning no web search or retrieval happened at all. Only 5.3% of paid-tier traces behaved this way. When a model skips live retrieval, it falls back on whatever its training data contains. Old product descriptions, outdated positioning, or outright fabrications all become more likely to surface.

Lily Ray described the underlying feedback loop in her Substack piece on the AI Slop Loop. She documented how AI-generated misinformation enters training data and gets repeated until "repetition is treated as consensus." Ray found Perplexity citing fabricated SEO news from AI-generated agency blog posts, including a nonexistent "September 2025 Perspective Core Algorithm Update." When free-tier models search less, they become more vulnerable to exactly this kind of recycled misinformation.

The SSE stream schemas reinforce the behavioral data. Paid-tier traces exposed a richer orchestration layer with reasoning status fields, reasoning start/end times, and deliberation stages before answer assembly. Free-tier schemas were leaner, following a simpler prompt-to-answer flow with optional search.

For sites that depend on accurate brand representation in AI answers, the split matters. Most casual users are on free tiers. Those users get answers with thinner evidence trails and fewer citations back to primary sources.

What to do

The WordLift analysis suggests the gap is not about schema vocabulary but about whether models verify claims against live sources. Your structured data still matters, but it is not sufficient on its own.

Audit your brand's parametric footprint. Ask free-tier ChatGPT questions about your brand, products, and key claims. Compare the answers against what paid-tier models return. Document where the free tier surfaces stale or incorrect information.

Strengthen signals that survive without live retrieval. When a model relies on training data rather than live search, the information it absorbed during training determines the answer. Consistent, accurate information across authoritative sources (your site, Wikipedia, industry publications) reduces the chance of hallucinated claims.

Keep structured data current on your pages. Both tiers showed similar schema awareness in vocabulary. The paid tier was more likely to check live pages. When it does check, accurate structured data on the page gives the model a machine-readable source of truth. Google's structured data policies remain the baseline for markup quality.

Watch for the slop loop. Monitor whether AI-generated content about your brand is entering the broader web. If fabricated claims get repeated across enough low-quality sites, they can become the parametric "consensus" that free-tier models fall back on.

Watch out for

Overreading the sample size. The analysis covers 56 usable traces from enterprise accounts, not a direct comparison of free and paid consumer subscriptions. The model-slug classification is a proxy. Treat the specific percentages as directional rather than definitive.

Assuming schema markup alone closes the gap. The study found both tiers equally fluent in schema vocabulary. The problem is that free-tier models often skip the step where they would actually visit your page and read your markup. Schema helps when the model checks. It does not force the model to check.

Google tells developers to build websites for AI agents

Sat, 02 May 2026 03:42:57 GMT

What happened

Google's web.dev site published a new guide titled "Build agent-friendly websites," telling developers that their sites now have "a new type of visitor." The guide frames AI agents as a distinct audience alongside human users and recommends specific development practices to support them.

The core argument: sites built with complex hover states, shifting layouts, and fluid motion are "functionally broken for agents." Search Engine Journal's coverage notes that most of the guidance maps directly to existing accessibility and semantic HTML practices.

Google describes three ways agents interpret websites:

Screenshots: Agents take a page snapshot and use vision models to identify elements visually.
Raw HTML: Agents read DOM structure and hierarchy directly.
Accessibility tree: Google calls this a "high-fidelity map" of interactive elements, stripped of visual noise.

The specific recommendations include using semantic HTML elements like <button> and <a> instead of styled <div> elements, keeping layouts stable across pages, linking <label> tags to inputs with the for attribute, and setting cursor: pointer on clickable elements.

Google closes with a pointed summary: "Everything we suggest to make a site 'agent-ready' also makes sites better for humans."

At the bottom of the guide, Google links to WebMCP, a proposed web standard for agent-website interaction. WebMCP would let websites register tools with defined input/output schemas that agents can discover and call as functions. Chrome's team describes it as an early preview program and is accepting developer sign-ups.

Why it matters

Semantic HTML, stable layouts, and proper accessibility markup have been web development defaults for years. The practices Google recommends are not new. What changed is the messenger and the framing.

Publishing this on web.dev puts agent-friendliness alongside established developer guidance areas like accessibility and performance. Google is signaling that agent interaction is now part of its official web platform priorities.

For sites already following accessibility best practices, there is likely little to change. The practical gap is for sites that rely heavily on JavaScript-rendered UI components, custom <div>-based controls, and layouts that shift between states. Those patterns already hurt accessibility scores. Now they also break agent workflows.

The business case for semantic HTML now extends beyond screen readers. AI agents that browse, compare, and transact on behalf of users need the same structural clarity that assistive technology does. Sites selling products or services through multi-step flows should pay particular attention. An agent that cannot reliably identify form fields or buttons cannot complete a purchase.

WebMCP is worth watching separately. If adopted, it would give sites a way to expose structured capabilities directly to agents, rather than relying on agents to parse page structure. Chrome is listed for Google I/O on May 19–20, which may bring more details on browser-based agent interactions.

What to do

Audit your HTML semantics. Check whether interactive elements use native HTML (<button>, <a>, <input>) rather than styled <div> or <span> elements with click handlers. Browser DevTools' accessibility inspector can flag these quickly.

Check layout stability. Pages with high CLS scores are already a Core Web Vitals problem. They are now also an agent problem. Ensure layouts do not shift between interaction states.

Verify label associations. Every form input should have a <label> element linked via the for attribute. Run an accessibility audit in Lighthouse to catch missing associations.

Review your accessibility tree. Open Chrome DevTools, navigate to the Accessibility panel, and check how your key pages look in the accessibility tree view. If interactive elements are missing or mislabeled, agents will struggle with them too.

Consider the WebMCP early preview. If your site offers transactional functionality that agents might use (bookings, purchases, comparisons), signing up for the WebMCP preview could give you early input into the standard. The sign-up is linked from Google's web.dev guide.

No urgent changes are needed for sites that already pass WCAG audits and use semantic markup. The guide confirms existing best practices rather than introducing new requirements.

Watch out for

Custom component libraries that swallow semantics. Many React, Vue, and Angular component libraries render interactive elements as nested <div> structures. Even if your source code looks clean, the rendered DOM may not expose proper semantics to the accessibility tree. Inspect the rendered output, not just the source.

Cursor styling as a signal. Google specifically recommends cursor: pointer on clickable elements. Some CSS resets or design systems strip this style. Agents using screenshot-based interpretation may rely on visual cues like cursor changes to identify interactive targets.

Semrush launches AI agent readiness audits for technical SEO

Fri, 01 May 2026 15:09:38 GMT

What happened

Semrush has released a set of features designed to help sites prepare for AI agent interactions. The company published a guide on agentic search optimization on May 1, walking through how its existing tools can audit a site's readiness for AI-driven browsing and task completion.

The core addition is an "AI Search Health" score within Semrush's Site Audit tool. After running a crawl, users can review a score reflecting how accessible and structured their pages are for AI crawlers. A "Blocked from AI Search" widget shows which AI crawlers are blocked via robots.txt and which pages are affected.

The Site Audit's Issues tab now includes an "AI Search" filter that flags problems like missing anchor text on links, pages with only one internal link, content needing optimization, and a missing llms.txt file.

Semrush also recommends using its Log File Analyzer to check whether AI bots actually crawl your site. Users can filter server logs for user agents like GPTBot, ChatGPT-User, OAI-SearchBot, and ClaudeBot to see which pages get bot activity, what status codes bots encounter, and whether certain pages or file types are being skipped.

Why it matters

The concept Semrush is calling "agentic readiness" goes beyond AI Overview visibility. It addresses whether an AI agent can land on your site, understand the content, and complete a task like retrieving pricing or submitting a form. The distinction matters because agent-driven workflows penalize sites differently than traditional search does.

A page that ranks fine in Google but hides pricing behind a PDF or relies on heavy client-side JavaScript may work for human visitors. An AI agent encountering the same page may simply move on to a competitor. Semrush frames this as a filtering problem: agents evaluate multiple sites, extract structured information, and narrow down options. Sites that present information clearly survive the cut.

The llms.txt check is worth noting. While llms.txt is still an emerging convention (not a formal standard), Semrush flagging its absence signals that the file is becoming part of the expected technical SEO baseline for AI readiness.

For e-commerce and SaaS sites with pricing pages, feature comparisons, and signup flows, the practical risk is real. If agents can't extract your product details programmatically, you lose consideration before a human ever sees your brand.

What to do

Run the AI Search audit in Semrush Site Audit. Launch a crawl, then check your AI Search Health score. Review the "Blocked from AI Search" widget to see if you're blocking GPTBot, ClaudeBot, or other AI user agents in robots.txt. Unblock them for pages you want AI agents to access.

Check the AI Search filter under Issues. Look for flagged problems: missing anchor text, orphan-like pages with a single internal link, and the missing llms.txt warning. Prioritize fixing access issues on your most important commercial pages.

Audit your server logs for AI bot activity. Use Semrush's Log File Analyzer or your own log analysis setup. Filter for GPTBot, ChatGPT-User, OAI-SearchBot, and ClaudeBot. If these bots aren't hitting your key pages, you have a discovery problem to solve before worrying about content structure.

Identify and review your key pages. List the URLs that explain what you offer, your pricing, and your conversion paths (demo requests, signups, contact forms). For each page, confirm that the essentials are explicitly present in the HTML:

What you offer
Who it's for
How it's different
What the next step is

Reduce barriers to machine readability. Avoid burying key information in PDFs, images without alt text, or JavaScript-rendered content that requires execution to access. Use clear headings that match the topic of each section. Break dense text into short paragraphs or lists. Keep related information grouped so sections can stand alone.

These structural principles align with Schema.org best practices and Google's documentation on how search works. Clean HTML, logical heading hierarchies, and crawlable content have always mattered. AI agents just raise the penalty for getting them wrong.

Watch out for

Over-blocking in robots.txt. Many sites added blanket blocks for AI crawlers in 2024–2025 to prevent training data scraping. If you still have those blocks in place, they also prevent AI agents from accessing your content during real-time search workflows. Review your robots.txt rules and consider allowing access on commercial pages you want agents to find.

Assuming AI visibility equals agent readiness. Appearing in an AI Overview is not the same as being usable by an autonomous agent. A page can surface in a summary but still fail when an agent tries to extract structured pricing or navigate a signup flow. Test your key pages from the perspective of a machine reader, not just a search result.

B2B SaaS listicles and comparison pages losing rank weight

Fri, 01 May 2026 14:03:29 GMT

What happened

A discussion in r/bigseo asked whether self-promotional content formats like listicles and "alternatives to X" pages are losing ranking weight. The post, submitted on May 1, 2026, was removed by a moderator before gaining traction, but the question it raised reflects a pattern that B2B SaaS SEOs have been discussing for months.

The core concern: pages where a SaaS company publishes a "best X tools" list and ranks itself first, or creates "alternatives to [Competitor]" pages, appear to be losing visibility in Google search results.

Why it matters

These content formats have been a staple of B2B SaaS content strategy for years. Nearly every mid-market SaaS company publishes pages like "Top 10 project management tools" (with their own product at position one) or "Best [Competitor] alternatives" (with their own product as the recommended switch). The pages are designed to capture commercial-intent queries and funnel traffic toward sign-ups.

The concern aligns with how Google describes its approach to content quality. Google's helpful content documentation states that ranking systems use both page-level and site-level signals to evaluate content. The documentation notes that the systems aim to surface results that are genuinely useful, not content created primarily for search engine traffic.

Self-promotional listicles sit in an awkward spot under those criteria. A "best tools" list written by one of the tools being listed has an inherent conflict of interest. The content exists to rank for commercial queries, and the editorial judgment (which tool is "best") is made by a party with a financial stake in the answer.

Google's systems don't need a manual penalty to handle this. Site-level classifiers can detect patterns of self-serving content, and page-level signals like user engagement can reflect whether searchers find the content genuinely helpful or bounce back to try another result.

For B2B SaaS companies that have invested heavily in this format, any downranking creates a real pipeline problem. These pages often sit at the top of the funnel and drive significant demo requests.

What to do

The r/bigseo post was removed before substantive community discussion could develop, so there is no consensus from practitioners on specific remediation steps. That said, the signal is worth investigating if your site relies on these formats.

Check your own data first. Pull Google Search Console performance for your listicle and comparison URLs. Filter by the last 6–12 months and look for declining impressions or position changes on the queries these pages target. If you see drops, compare them against overall site trends to separate page-specific losses from broader algorithm shifts.

Audit the editorial framing. Pages that rank the publishing company's own product first, with thin or dismissive coverage of competitors, are the most likely to be affected. If your "Top 10" page gives your product 400 words and every competitor 50 words, the bias is structurally obvious to both users and quality classifiers.

Consider third-party validation. Some SaaS companies have shifted toward sponsoring or contributing to genuinely independent reviews rather than self-publishing comparison content. Pages on review sites like G2 or Capterra carry editorial distance that self-published listicles cannot replicate.

Don't overreact to one Reddit thread. The original post was removed and generated only one comment. Treat the question as a prompt to audit your own performance data, not as confirmation of a ranking change.

Watch out for

Conflating correlation with causation. If your listicle traffic dropped after a core update, the cause might be domain authority shifts, SERP feature changes, or increased competition from review aggregators rather than a specific penalty on the format itself.

Over-correcting by removing pages entirely. Even if these pages have lost some ranking strength, they may still convert well from direct traffic, email campaigns, or paid channels. Evaluate their full contribution before cutting them.

LLMs misrepresent brands at training, retrieval, and generation

Fri, 01 May 2026 04:27:23 GMT

What happened

A Search Engine Land analysis published April 30 breaks down how large language models can misrepresent brands at three distinct stages: when training data is ingested, when documents are retrieved at query time, and when the model generates its final response. The piece argues that brand distortion is not a single-point failure but a compounding problem across the full LLM pipeline.

The three failure points map roughly to how modern AI search systems work. Training data shapes the model's baseline "understanding" of a brand. Retrieval-augmented generation (RAG) pulls in fresher documents at query time but may surface outdated or off-brand content. The generation step then synthesizes all of that into a response, introducing further risk of hallucination or conflation with competitors.

Why it matters

Most brand-visibility work in AI search has focused on the output layer: checking what ChatGPT or Gemini says about a brand and trying to correct it. The Search Engine Land analysis reframes the problem as structural. If training data already contains outdated messaging, fixing the generation layer alone will not solve the issue.

Training data is largely static. Models ingest web content at a point in time, and that snapshot may include old product descriptions, discontinued services, or third-party content that mischaracterizes the brand. Practitioners have limited control over what gets included.

The retrieval layer is where SEOs have more leverage. RAG systems pull from live or semi-live indexes. The content that ranks well in traditional search or appears in knowledge bases is likely to be retrieved. Poorly structured or ambiguous content at this stage feeds directly into the generated answer.

Generation-stage distortion is the hardest to control. Even with accurate training data and clean retrieval, models can blend information from multiple entities or hallucinate details. A brand with a generic name or one that shares terminology with competitors is especially vulnerable.

What to do

Audit your entity footprint across the web. Search for your brand in major LLM-powered tools (ChatGPT, Gemini, Perplexity) and document where the response diverges from your actual positioning. Note whether errors look like stale training data or retrieval-stage problems.

Strengthen structured data on your own properties. Use Schema.org Organization markup to define your brand's name, description, founding date, logos, and key attributes. Structured data gives retrieval systems unambiguous signals about your entity. Include sameAs properties pointing to your Wikipedia page, LinkedIn, and other authoritative profiles.

Clean up third-party references. Identify high-authority pages that describe your brand inaccurately. These are likely retrieval candidates for RAG systems. Request corrections on Wikipedia, industry directories, and partner sites. Outdated press releases and old product pages on your own domain are also retrieval risks.

Consolidate brand messaging into a clear, crawlable "about" page. A single authoritative page with your current positioning, product lines, and differentiators gives both training crawlers and RAG systems a definitive source. Avoid splitting brand-defining content across dozens of pages with inconsistent language.

Monitor regularly. LLM outputs change as models are retrained and retrieval indexes refresh. Set a recurring check, monthly at minimum, to query your brand across AI tools and compare results against your current messaging.

Watch out for

Brand-name ambiguity multiplies risk. If your brand name is also a common word or overlaps with another company, LLMs are more likely to conflate entities at every stage. Structured data and consistent use of your full legal name help, but this is an ongoing problem without a clean fix.

Old content on your own domain can work against you. Archived blog posts, deprecated product pages, and outdated case studies are all fair game for training and retrieval. If you cannot remove them, add clear date signals and consider noindexing content that no longer reflects your brand.

WordPress to SvelteKit migration risks crawlability regression

Fri, 01 May 2026 00:00:00 GMT

What happened

A practitioner posted in r/TechSEO asking how to migrate a photography e-commerce site with 98,000 monthly visitors from WordPress (with WooCommerce) to SvelteKit without losing traffic. The post describes a site with poor existing SEO: no alt tags, unoptimized images, redundant database queries, and bad performance scores.

The poster noted that the site's traffic likely comes from brand strength and location rather than technical SEO merit. Their research identified URL parity as the primary concern, including trailing slashes and sitemap consistency. Community responses, as is common with migration threads, leaned heavily toward "don't do it."

Why it matters

Framework migrations from server-rendered CMS platforms like WordPress to JavaScript-based frameworks carry specific crawlability risks that go beyond URL mapping. SvelteKit can render pages server-side, but the default behavior and configuration matter. A misconfigured SvelteKit deployment can serve client-side rendered pages that Googlebot handles differently than static HTML.

WordPress generates server-rendered HTML by default. Every page is crawlable without JavaScript execution. SvelteKit supports server-side rendering (SSR) and static site generation (SSG), but practitioners need to explicitly configure these. SvelteKit's performance documentation describes code-splitting and preloading as built-in features, but SEO-critical rendering choices still require manual setup.

The e-commerce angle adds another layer. WooCommerce sites typically output Product structured data through plugins. Moving to SvelteKit means rebuilding that structured data from scratch. Schema.org's Product vocabulary defines the expected properties, but SvelteKit has no built-in schema markup generation. Every Product type, AggregateRating, and Offer must be manually implemented in the new codebase.

Photography gallery sites also depend heavily on image search traffic. Google's SEO starter documentation emphasizes that content needs to be interpretable by search engines. Missing alt text on the current WordPress site is a problem, but at least the images are discoverable in server-rendered HTML. A SvelteKit migration that lazy-loads images via JavaScript without proper SSR fallbacks could make image discovery worse, not better.

What to do

Confirm SSR is enabled for every indexable route. SvelteKit's +page.server.js files handle server-side data loading. Every page that needs to appear in search results should use SSR, not client-side rendering. Test with curl or Googlebot's rendered HTML in Search Console's URL Inspection tool.

Map every URL before writing code. Crawl the existing WordPress site with Screaming Frog or Sitebulb. Export the full URL list including trailing slashes, query parameters, and pagination patterns. WooCommerce product URLs, category pages, and image attachment pages all need explicit handling.

Set up redirect rules for any URL changes. If WordPress uses /product/photo-name/ and SvelteKit uses /product/photo-name (no trailing slash), that difference will cause soft 404s. SvelteKit's hooks.server.js file can handle redirects, or configure them at the edge (Vercel, Cloudflare, etc.).

Rebuild structured data manually. Audit the current site's structured data output using Rich Results Test. Recreate Product, Offer, and any LocalBusiness markup in SvelteKit's +page.svelte components using JSON-LD script blocks.

Preserve the XML sitemap. WordPress plugins like Yoast auto-generate sitemaps. In SvelteKit, you need a custom /sitemap.xml endpoint. Build one that dynamically pulls all product and page URLs.

Run a staging crawl comparison. Before switching DNS, crawl both the WordPress site and the SvelteKit staging site. Compare page count, status codes, canonical tags, and rendered HTML output. Any discrepancy is a potential traffic loss.

Watch out for

Image attachment pages. WordPress creates individual URLs for every uploaded image (e.g., /photo-name-attachment/). These pages often rank in image search. If the SvelteKit build doesn't account for them, those URLs will 404 after migration.

WooCommerce query parameter URLs. Filtering and sorting in WooCommerce generates ?orderby= and ?filter= URLs that Googlebot may have indexed. Dropping these without redirects or proper canonical handling can fragment link equity.

Hydration delays affecting Googlebot. SvelteKit hydrates server-rendered HTML on the client. If hydration replaces critical content (prices, product descriptions) with loading states before re-rendering, Googlebot's snapshot may capture the intermediate state.

Migration traffic drops need pre-defined thresholds, not panic

Thu, 30 Apr 2026 04:25:48 GMT

What happened

Brendan Bennett, Principal SEO Consultant at Candour, published a [disaster recovery framework for site migrations](https://sitebulb.com/resources/guides/when-website-migrations-go-wrong-a-practical-guide-to-disaster-recovery) as part of Sitebulb's three-part migration series. The framework centers on a simple argument: if you didn't define acceptable traffic loss thresholds before the migration, every post-launch dip will feel like a crisis.

Bennett's approach covers how to confirm whether a drop is genuinely problematic, how to verify your analytics data isn't misleading you, and a "parity-obsessive" diagnostic method for isolating longer-term issues.

Why it matters

Most migration monitoring starts after launch, when someone notices a traffic graph heading south. Bennett argues the useful work happens before launch, when teams calculate projected traffic loss and estimated revenue impact. Those pre-migration numbers become the benchmark dataset for weekly comparison. Without them, post-migration monitoring is just staring at graphs and guessing.

The distinction between traffic drops and business metric drops is worth calling out. If revenue, transactions, and leads hold steady despite lower traffic, the migration may have shed low-value sessions. Panic in that scenario wastes time. If revenue is actually declining, that's when exit plans and rollback decisions need pre-defined triggers.

Bennett points to the WooCommerce domain migration as a case study. At some point, WooCommerce presumably hit a threshold where waiting for recovery wasn't viable and rolled back to the old domain entirely. He frames that not as failure but as a planned off-ramp that someone had the sense to define in advance.

Google's own documentation says it takes around 180 days after submitting a change of address for the old domain to stop being treated as a primary entity. Bennett reports seeing migrations where recovery takes over a year, with long-tail impacts surfacing well after the main traffic line stabilizes. There is no universal recovery timeline.

What to do

Set thresholds before you migrate. Define acceptable ranges for traffic loss, revenue impact, and lead volume. Agree on these numbers with stakeholders in advance. Document what triggers a rollback decision versus what triggers a "wait and monitor" response.

Ask four questions before entering disaster recovery mode. Bennett suggests these checks:

Have you hit any danger thresholds for revenue or leads?
Have you given Google enough time to process the migration?
Are metrics moving in the right direction, even incrementally?
Is your data actually sound?

Verify your analytics configuration hasn't changed. Migrations are one of the most common moments for GA4 settings to shift. Cookie consent configurations, event tracking, and tag placement can all change during a relaunch. If your measurement changed alongside the migration, you may be comparing different data sets and misdiagnosing the problem. Check Google Search Console data alongside GA4 to cross-reference. GSC measures impressions and clicks independently of your site's analytics setup, so discrepancies between the two can reveal whether the problem is real traffic loss or broken measurement.

Separate traffic metrics from business metrics. Look at revenue, transactions, and lead volume before fixating on session counts. A traffic drop with stable conversions is a different situation from a traffic drop with declining revenue, and the response should differ accordingly.

Plan rollback criteria in advance. Define the conditions under which you would revert the migration. Having this documented before launch removes the emotional decision-making that happens when graphs are falling and clients are calling.

Watch out for

Analytics config drift during migration. GA4 cookie consent settings, event definitions, and tag placements frequently change during a site relaunch. If pre-migration and post-migration data aren't measuring the same thing, your recovery analysis is built on bad comparisons. Audit your analytics setup as a first step before drawing conclusions from the data.

Premature panic over normal volatility. Some traffic loss and ranking instability after a migration is expected. Google needs time to recrawl, reindex, and consolidate signals. Reacting to a two-week dip by making further changes can compound the problem. Stick to your pre-defined thresholds and timelines before making additional interventions.

Screaming Frog Log File Analyser 7.0 verifies AI bot identity

Thu, 30 Apr 2026 04:12:32 GMT

What happened

Screaming Frog released Log File Analyser 7.0 on April 29, 2026 with bot verification for AI crawlers as the headline feature. The update lets practitioners confirm whether an AI bot hitting their server is genuine or spoofed, using the same verification flow that already existed for search engine bots like Googlebot.

The release also includes user agent grouping, customizable verification methods, unknown user agent discovery, project import/export, Google Sheets export, and new time series charts.

Why it matters

AI bot traffic is growing, and so is the number of user agents claiming to be AI crawlers. Until now, verifying whether a request actually came from the bot it claimed to be was straightforward for search engine crawlers but not for AI bots in the Log File Analyser.

Google publishes JSON files of IP ranges and reverse DNS patterns for verifying its own crawlers. Googlebot verification uses reverse DNS lookups against googlebot.com or geo.googlebot.com domains. AI bot operators like OpenAI, Anthropic, and others publish similar verification methods, but tracking all of them manually is tedious.

The new customizable verification feature is particularly useful. Practitioners can input IP ranges from a JSON URL, configure reverse DNS patterns, use ASN lookups, or define static IP ranges for any user agent. If an AI bot provider changes its verification method, you can update the configuration immediately without waiting for a new Screaming Frog release.

The user agent grouping feature addresses a related pain point. The number of distinct bots crawling any given site has multiplied. Grouping them into categories like "All Search Bots" or "All AI Bots" makes log analysis faster when you're trying to understand crawl budget consumption across bot types.

The unknown user agent discovery setting also fills a gap. Enabling "Include unknown User Agents" during project setup surfaces bots that don't match any predefined user agent in your list. These could be scrapers, undocumented AI crawlers, or other automated traffic you might want to monitor or block.

What to do

Verify your AI bot traffic. After uploading log files in version 7.0, run verification via "Project > Verify Bots" to separate genuine AI crawlers from spoofed ones. Filter by verification status to see which bots are real.

Set up custom verification for new bots. When you encounter a new AI bot, check the bot operator's documentation for their published IP ranges or DNS patterns. Add these as a custom user agent with the appropriate verification method (JSON URL, reverse DNS, ASN, or static IP ranges).

Enable unknown user agent discovery. In the User Agents tab of a new project, turn on "Include unknown User Agents" to catch bots that aren't in your predefined list. Review these periodically to decide whether to add them to your monitoring list or block them.

Use grouping to track crawl share. The Overview tab now shows top-level statistics by bot group. Check how much of your crawl traffic comes from AI bots versus search bots. The proportion matters for capacity planning and robots.txt decisions.

Try the time series charts for diagnostics. The new lower time series tab across various views shows response codes, bytes, and average response times over time. Use these to spot sudden spikes in bot activity or changes in server response behavior.

Watch out for

Verification is only as good as the published data. If an AI bot operator doesn't publish IP ranges or DNS patterns, you can't verify their traffic. Some newer or smaller AI crawlers may not offer any verification method yet.

Unknown user agents can be noisy. Enabling the unknown user agent setting will surface all unrecognized traffic, including browsers with unusual user agent strings. Expect to do some manual filtering before the data is useful.

Declarative Shadow DOM cuts render-blocking JS

Wed, 29 Apr 2026 10:18:36 GMT

What happened

Declarative Shadow DOM (DSD) now lets developers define shadow roots directly in HTML, removing the JavaScript dependency that previously blocked server-side rendering of Web Components. A detailed writeup from DebugBear explains how the feature works and why it matters for performance.

With traditional (imperative) shadow DOM, shadow roots could only be attached via JavaScript. The browser had to download, parse, and execute a JS bundle before a component's structure and styles became visible. DSD changes that by moving shadow root attachment into HTML parsing itself, using the shadowrootmode attribute on the <template> element:

<user-card>
  <template shadowrootmode="open">
    <style>/* scoped styles */</style>
    <img src="jane.jpg" alt="Jane Smith" />
    <h2>Jane Smith</h2>
    <p>Lead Engineer</p>
  </template>
</user-card>

When the browser encounters <template shadowrootmode="open">, it moves the template's children into a shadow root and removes the <template> element during parsing. No JavaScript runs. The component renders with full structure and scoped styles on first paint.

The feature is now available across all major browsers. Chrome's documentation notes that Chrome has supported DSD since version 90, with the standardized shadowrootmode attribute landing in version 124. The specification was renamed from shadowroot to shadowrootmode in 2023. Firefox 123, Safari 16.4, and Chrome 111+ all support it, and the feature became Baseline Newly Available as of August 5, 2024.

Why it matters

Server-side rendering of Web Components was effectively impossible with imperative shadow DOM. Encapsulation was lost on the server because shadow roots couldn't be serialized to HTML. The client had to reconstruct them with JavaScript, which blocked the critical rendering path.

DSD removes that bottleneck. Content inside shadow roots becomes visible as soon as the HTML streams in. For sites using Web Components heavily, this can meaningfully improve Largest Contentful Paint and reduce Total Blocking Time, since render-blocking JS is no longer required for initial component display.

The progressive enhancement model is clean. Server-rendered Web Components are fully styled and readable before any JavaScript loads. Client-side JS can then attach event listeners and add interactivity to the existing shadow root without rebuilding it.

For teams choosing between framework-based components (React, Vue) and native Web Components, DSD closes a major gap. Web Components already avoided third-party library overhead. Now they can also match framework SSR capabilities using only what the browser provides natively.

What to do

Check your Web Component rendering path. If your components use imperative attachShadow() and you're seeing render delays tied to JS execution, DSD is worth adopting. Replace JavaScript-attached shadow roots with <template shadowrootmode="open"> in your server-rendered HTML.

Audit your SSR pipeline. Your server or static site generator needs to output the <template shadowrootmode="open"> markup inline. If you're using a framework like Lit, check whether it already supports DSD output during SSR.

Choose open vs. closed mode deliberately. The shadowrootmode attribute accepts both open and closed. For most use cases, open is the practical default. Closed mode prevents external JS from accessing the shadow root via element.shadowRoot, which adds encapsulation for third-party embeds but makes debugging harder.

Consider polyfill needs. Browser support is broad enough that most production sites can skip a polyfill. If you need to support older browsers, DebugBear points to a simple polyfill option.

Measure the impact. Compare LCP and TBT before and after migrating components to DSD. The improvement will be most visible on pages where Web Component JS was previously in the critical rendering path.

Watch out for

The old shadowroot attribute is deprecated. The spec renamed it to shadowrootmode in 2023. If you copied code from older tutorials, the attribute name may be wrong and the browser will not create a shadow root.

DSD templates are consumed during parsing. The browser removes the <template> element after attaching the shadow root. If your JS expects to find that template in the DOM later (e.g., to clone it), it won't be there. Imperative and declarative shadow DOM have different lifecycle expectations.

APAC search fragments across Bing, Naver, AI, and super-apps

Wed, 29 Apr 2026 04:30:33 GMT

What happened

Search strategy in Asia-Pacific can no longer rely on Google alone. Motoko Hunt's analysis for Search Engine Journal lays out how discovery across APAC has fragmented across local search engines, AI answer systems, and super-app platforms.

The market share numbers tell the story clearly. In Japan, Bing holds 31.63% of search share alongside Google's 59.58%. In South Korea, Google (46.81%) and Naver (43.96%) operate at near parity. Even in Vietnam, local engine CocCoc holds 5.34%, enough to matter in competitive categories.

Beyond traditional search engines, telecom providers are accelerating AI adoption by bundling tools into existing plans. Bharti Airtel partnered with Perplexity to distribute its Pro offering to roughly 360 million users in India. Reliance Jio is distributing Google's Gemini AI access across more than 500 million users. SK Telecom partnered with Perplexity in South Korea. Users aren't seeking out these tools. The tools are pre-installed.

Super-apps add another layer. KakaoTalk in South Korea and LINE in Japan function as discovery platforms, not just messaging apps. Hunt notes that Japanese TV commercials now direct users to LINE accounts rather than websites or app downloads.

Why it matters

Most global SEO teams still treat APAC as an extension of their Google strategy. Hunt's data shows that approach misses a significant share of discovery traffic in every major APAC market.

The Bing share in Japan is the clearest action item for technical SEOs. At nearly 32%, ignoring Bing means ignoring roughly a third of Japanese search traffic. Bing handles several technical signals differently from Google. Bing Webmaster Tools offers geo-targeting configuration that lets you specify target audiences at the site or directory level. Hreflang, meanwhile, is a Google-specific signal. Mangools' documentation on hreflang notes that Bing does not use hreflang tags and instead relies on its own geo-targeting settings and the content-language meta tag.

The telco distribution model changes the competitive picture in a way that's hard to overstate. When 500 million users get Gemini bundled into their Jio plan, adoption doesn't follow the usual curve. It happens almost overnight. For search teams, content that performs well in AI answer systems becomes a visibility requirement, not a nice-to-have.

Super-app discovery means some users never touch a search engine at all. If your brand's decision point happens inside LINE or KakaoTalk, traditional SERP rankings are irrelevant to that segment.

What to do

Audit your APAC traffic by engine. Check analytics for Bing, Naver, CocCoc, and other local engines. If you're only tracking Google, you're flying blind in markets where Google holds 60% or less.

Set up Bing Webmaster Tools for Japanese and other APAC properties. Configure geo-targeting at the directory or subdomain level. Don't rely on hreflang alone for Bing. Use the content-language meta tag and Bing's own geo-targeting settings.

Structure content for AI answer systems. Use schema.org markup (Product, Organization, Article) so AI-driven interfaces can parse your content. Clean, well-structured pages with clear headings and factual claims may be easier for AI systems to cite, though the ranking factors for AI citations are not yet well understood.

Evaluate super-app presence in your target markets. If you're targeting Japan, check whether a LINE official account makes sense for your brand. For South Korea, assess KakaoTalk. The question Hunt poses is the right one: not "how do we rank?" but "where do we need to exist?"

Review Naver-specific requirements for South Korea. Naver has its own webmaster tools, content ranking logic, and blog/cafe ecosystem. A Google-only technical setup won't transfer.

Watch out for

Hreflang doesn't work on Bing. Teams that rely solely on hreflang for language and region targeting will find it has no effect on Bing's results. Bing uses its own geo-targeting tool and the content-language meta tag instead.

Telco-bundled AI tools bypass your traditional funnel. When Perplexity or Gemini answers a user's question directly inside a telco's ecosystem, there may be no click to your site at all. Monitor whether your content is being cited in AI answers, not just whether you rank in traditional SERPs.

Google warns sites before back button hijacking penalty

Wed, 29 Apr 2026 04:21:55 GMT

What happened

Google has started sending email warnings through Search Console to sites that hijack the browser's back button. The notifications include sample URLs, links to the spam policy, and a reminder that enforcement begins June 15, 2026. SE Roundtable reported on the warnings on April 28.

Glenn Gabe shared a screenshot of the email on X. He noted that Google is sending these to "sites that are actively hijacking the back button" and that the emails include sample URLs along with links to the blog post about the new spam policy.

The email's subject line reads: "Warning: Your site may be in violation of Google's Back Button Hijacking policy." It tells site owners that Google has detected pages exhibiting back button hijacking behavior, which violates a newly launched spam policy on malicious practices. No manual action has been taken yet, but the email urges sites to fix the issue before the June 15 enforcement date.

One important detail from the email: changes made on or after April 17, 2026 are not reflected in the notification. Google says it will re-verify compliance before taking any manual action.

Why it matters

Back button hijacking is a pattern where a site interferes with browser navigation, preventing users from returning to the page they came from. Google's definition covers scenarios where users get sent to pages they never visited, are shown unsolicited ads, or are otherwise blocked from normal browsing.

The technique typically abuses the browser's History API. Sites can call history.pushState() to inject fake entries into the session history stack. When a user hits the back button, the browser traverses to one of these injected entries instead of returning to the previous page. Some implementations use popstate event listeners to intercept back-button presses and redirect users elsewhere.

The WHATWG HTML spec defines how session history entries and traversal work. The spec was designed to let single-page applications manage navigation state. Back button hijacking exploits that flexibility for deceptive purposes.

SE Roundtable's Barry Schwartz noted that "a number of SEOs are posting screenshots of clients or former clients that received this notification." The volume of warnings suggests Google is casting a wide net before enforcement begins.

Sites that rely on aggressive interstitials, ad-driven redirects, or certain pop-up implementations should pay close attention. Some implementations may hijack the back button unintentionally, particularly those using pushState for modal windows or overlay ad units.

What to do

Check your Search Console email for the warning. If you received one, Google has already flagged specific URLs. Start with those.

Search your codebase and third-party scripts for these patterns:

history.pushState() calls that fire on page load without corresponding user-initiated navigation. These inject fake history entries that trap users.
popstate event listeners that redirect users or load new content instead of allowing normal back-button behavior.
beforeunload or unload handlers combined with history manipulation. Some implementations use these to intercept navigation attempts.
Third-party ad scripts or analytics tags that modify the history stack. Audit tag manager containers for scripts you didn't write.

Test affected pages manually. Navigate to the page from a search result, then press the back button. If you don't return to the search results page immediately, something is hijacking the back button.

If you use pushState legitimately for single-page app routing, confirm that each pushed state corresponds to real content the user intentionally navigated to. The violation targets deceptive history manipulation, not standard SPA behavior.

Fix issues before June 15, 2026. Google's email confirms it will re-verify before taking action, so you have time. But the April 17 cutoff for detection means recent fixes won't appear in the current warning.

Watch out for

Third-party scripts you forgot about. Ad networks, affiliate tools, and pop-up plugins are common sources of back button hijacking. The code may not be in your repository at all. Audit every external script loaded on flagged URLs.

SPAs with aggressive preloading. Some single-page application frameworks push history entries for prefetched routes. If a user hasn't actually navigated to a route, that pushed state can look like hijacking to Google's detection system. Review your router configuration.

OpenAI crawl activity tripled after GPT-5, led by search bot

Wed, 29 Apr 2026 04:10:43 GMT

What happened

OpenAI's automated crawl activity roughly tripled after the August 2025 launch of GPT-5, according to an analysis published by Botify and guest author Chris Long, co-founder of SEO consultancy Nectiv. Long analyzed approximately 7 billion OpenAI bot log events from Botify's enterprise client dataset, spanning November 2024 through March 2026.

OAI-SearchBot, which retrieves content when ChatGPT performs web searches, recorded about 3.5x more events after August 2025. That works out to roughly 2.2 billion additional events. GPTBot, the training data crawler, saw about 2.9x more events over the same period, adding another 1.8 billion events.

The third user agent, ChatGPT-User, moved in the opposite direction. Long reports a 28% drop in ChatGPT-User log events between December 2025 and March 2026. ChatGPT-User fires when a ChatGPT session fetches a page on behalf of a logged-in user. Long offers two possible explanations: fewer sessions may be triggering real-time fetches, or OpenAI may be relying more on stored or indexed resources.

Why it matters

OAI-SearchBot now generates more log events than GPTBot in Botify's dataset. Before GPT-5, the two bots ran at roughly even volumes, with a ratio of about 0.95 search events per training event. After GPT-5, that ratio rose to about 1.14.

The shift matters for robots.txt decisions. OpenAI's own documentation confirms that each bot's robots.txt directive is independent. Sites blocking only GPTBot are not blocking the bot OpenAI uses to surface pages in ChatGPT search answers. Conversely, sites blocking OAI-SearchBot may be excluding themselves from ChatGPT search results entirely, though they can still appear as navigational links.

The post-GPT-5 increases varied by industry. Healthcare sites saw about 740% more OAI-SearchBot activity. Media and publishing saw 702%. Marketplaces, software, and retail ranged from 190–216%. Travel sites had the smallest rise at 30%.

Long also found that the balance between search and training crawls differs by vertical. Media and publishing showed the largest gap favoring OAI-SearchBot (+256% over GPTBot). Healthcare and retail leaned toward GPTBot. Botify and Long suggest OpenAI routes different prompt types to different crawlers: news queries trigger live search, while health and product queries draw more on trained knowledge.

Even after tripling, OpenAI's crawl volume is small compared to Google's. In Botify's most recent 30-day window, Googlebot registered 18.2 billion events versus 887 million from all OpenAI crawlers combined. That puts OpenAI at about 4% of Google's crawl volume, up from 1.38% a year earlier. Bingbot registered about 5.49 billion events, making OpenAI roughly 14% of Bing's volume.

What to do

Review your robots.txt for all three OpenAI user agents separately. If you want to appear in ChatGPT search answers, make sure OAI-SearchBot is allowed. Blocking GPTBot alone does not block ChatGPT search crawling. OpenAI's documentation states that robots.txt changes take about 24 hours to take effect in their systems.

Check your server logs for OAI-SearchBot and GPTBot activity. Compare volumes before and after August 2025 to see whether the tripling pattern holds for your site. If you're seeing large increases, verify your server can handle the additional load.

Segment bot activity by section or content type. The industry-level data suggests OpenAI's crawlers focus differently depending on content. Understanding which sections attract OAI-SearchBot versus GPTBot can inform your robots.txt strategy at the directory level.

Factor in the dataset's limitations. Botify's data skews toward large enterprise sites. Smaller sites may see different patterns. Botify also sells log file analysis and AI bot management software, and the original post promotes a webinar and product demo.

Watch out for

Blocking the wrong bot. Many sites added GPTBot blocks when it launched, assuming it covered all OpenAI crawling. OAI-SearchBot is a separate user agent with a separate robots.txt directive. If you blocked GPTBot but never addressed OAI-SearchBot, your training data is protected but your ChatGPT search visibility is unaffected. The reverse is also true: blocking OAI-SearchBot while allowing GPTBot means your content can still be used for model training.

Misreading ChatGPT-User drops as declining ChatGPT usage. The 28% decline in ChatGPT-User events measures logged user-initiated fetches, not overall ChatGPT traffic or interest. OpenAI may simply be caching or pre-indexing content more aggressively, reducing the need for real-time fetches.

Yoast SEO Abilities API exposes content scores to external tools

Tue, 28 Apr 2026 10:32:58 GMT

What happened

Yoast SEO released its Abilities API on April 28, 2026, giving external tools programmatic access to content analysis scores. The API is built on top of WordPress 6.9's new capabilities framework.

Compatible tools can now pull three data points from recent posts:

SEO scores and focus keyphrases
Readability scores
Inclusive language scores

The API works with AI assistants, automated workflows, and custom dashboards. According to Yoast's announcement, connected tools can read these scores without custom integrations or manual exports. Yoast points users to its developer documentation for endpoint details, data schemas, and implementation specifics.

Why it matters

Until now, getting Yoast scores out of WordPress required either scraping the admin UI, querying post meta directly, or building a custom solution. The Abilities API creates a supported, documented path for that data to flow into external systems.

For teams running content operations at scale, the practical use case is reporting. Pulling SEO and readability scores into a centralized dashboard no longer requires custom code against undocumented post meta fields. Yoast's announcement specifically highlights the ability to feed scores into dashboards and reporting tools without manual CSV exports.

The AI workflow angle is newer territory. Yoast describes scenarios where AI agents flag trends across recent posts or answer questions like "How is my SEO health looking this week?" The value here depends on what actions those agents can take. The API appears to be read-only for now, exposing scores but not allowing tools to write or update them.

The competitive context matters too. Rank Math's headless CMS support already exposes SEO meta tags through a REST API endpoint, but that focuses on front-end output (titles, descriptions, canonical URLs) rather than content analysis scores. A Rank Math support thread confirms that Rank Math does not currently offer a direct REST API endpoint for updating SEO attributes like focus keywords or meta descriptions. Yoast's Abilities API takes a different approach by exposing the editorial scoring layer rather than the rendered meta output.

The dependency on WordPress 6.9 is worth flagging. Sites still running older WordPress versions won't have access to the underlying capabilities framework the API relies on.

What to do

Check your WordPress version. The Abilities API requires WordPress 6.9. Confirm you're running it before attempting any integration.

Review the developer docs. Yoast's developer portal covers the REST API endpoints, data schema, and authentication requirements. If you're building integrations, start there.

Audit your authentication setup. Any external tool hitting the API will need proper credentials. WordPress supports Application Passwords for REST API authentication, which is likely the mechanism here. Create dedicated application passwords for each integration rather than sharing credentials.

Decide what's worth automating. The strongest immediate use case is pulling scores into reporting dashboards. If your team already runs content audits manually by checking Yoast scores post-by-post, the API can replace that workflow. The AI assistant use cases Yoast describes are plausible but depend on your existing tooling.

Don't expect write access. The announcement describes the API as exposing scores to external tools. There's no mention of endpoints for updating focus keyphrases, toggling analysis settings, or pushing content changes back into Yoast. Plan your workflows as read-only for now.

Watch out for

WordPress 6.9 dependency. The Abilities API is tied to WordPress 6.9's capabilities framework. If your staging environment runs a different version than production, test against the correct version before building integrations.

Score availability is limited to recent posts. Yoast's announcement specifies that tools can pull scores "from your most recent posts." The exact scope of "recent" is unclear. If you need historical scores across your full content library, verify the endpoint's coverage in the developer docs before committing to a reporting workflow built on top of it.

Bing Webmaster Tools previews Citation Share for AI queries

Tue, 28 Apr 2026 00:00:00 GMT

What happened

Microsoft previewed four new AI reporting features for Bing Webmaster Tools at SEO Week in New York City. Krishna Madhavan, Principal Product Manager at Microsoft AI and Bing, showed the additions during a presentation, according to Search Engine Journal's coverage.

The four features target the existing AI Performance dashboard:

Citation Share would show the percentage of citations a site captures within a specific grounding query, sitting alongside the raw citation counts already in the dashboard.
Grounding Query Intent would classify queries into 15 predefined intent labels. Screenshots shared by attendees on X show labels including Learning, Informational Search, Navigational, Research, Comparison, Planning, Conversational, and Content Filtered.
Grounding Query Topic would group queries under topic labels, adding a second classification layer alongside intent.
GEO-focused recommendations would surface guidance tied to AI visibility. The slide showed recommendation areas covering content structure, crawlability, indexing and canonicalization signals, structured data adoption, and structured data quality.

Microsoft has not published an official blog post about these features. The details come from attendee screenshots of the presentation.

Why it matters

The AI Performance dashboard launched in public preview in February 2026 and gave sites their first look at how often Copilot and Bing AI summaries cite their content. Microsoft expanded it in March with a feature mapping grounding queries to specific cited pages. Citation Share would add competitive context to those raw counts.

Knowing you received 12 citations for a query is useful. Knowing those 12 citations represent 80% of all citations for that query is more useful. The share metric tells you whether you dominate a grounding query or split visibility with competitors.

The intent and topic classifications address a real data problem. Grounding queries vary widely in phrasing, making trend analysis difficult. Grouping by intent and topic would let sites gauge visibility across categories rather than chasing individual query strings.

The GEO recommendations are the least defined of the four. The visible labels suggest the focus areas are familiar SEO fundamentals: crawlability, indexing, canonicalization, and structured data. Microsoft hasn't explained how recommendations would be generated or triggered.

The timing is notable. Google has also started surfacing AI Mode traffic data in Search Console, though Google's approach uses traditional impression and click metrics rather than citation-specific reporting. Bing's Citation Share concept has no direct equivalent in Google's tooling yet.

What to do

No action is needed right now. These are previews, not shipped features. Microsoft has not announced release dates for any of the four additions.

If you haven't already, set up and verify your site in Bing Webmaster Tools. The AI Performance dashboard is already live and shows grounding query data. Familiarize yourself with the existing reports so you have baseline data when the new features roll out.

Watch for official announcements on the Bing Webmaster blog or Microsoft Advertising blog confirming scope and timing. Until then, treat the screenshots as directional, not final.

For sites already tracking AI citation performance, note the intent taxonomy. The 15 predefined labels suggest Microsoft is building a standardized classification system. Understanding which intent categories your content serves in AI results could shape content strategy once the labels ship.

Watch out for

Preview vs. reality. The features were shown in a conference presentation, not an official product announcement. Feature scope, naming, and availability could change before release. Do not build reporting workflows around details that may shift.

GEO recommendations may be generic. The visible recommendation areas (crawlability, structured data, canonicalization) overlap heavily with existing SEO best practices. Wait to see whether the actual recommendations are site-specific or boilerplate before adjusting priorities based on them.

SE Ranking MCP server enables agentic SEO via Claude Code

Mon, 27 Apr 2026 14:52:09 GMT

What happened

SE Ranking has released an MCP server that connects Claude Code to its SEO data platform, giving practitioners access to keyword research, backlink analysis, competitive research, and AI search visibility data through an agentic terminal workflow. The integration uses the Model Context Protocol, an open-source standard for connecting AI applications to external data sources and tools.

The key distinction SE Ranking draws is between Claude Desktop (a chat interface where you prompt one task at a time) and Claude Code (a terminal-based agent that plans and executes multi-step workflows autonomously). With the MCP connection, Claude Code can pull live SE Ranking data, run analysis, and write results to files without the user managing each step manually.

SE Ranking says the MCP server is read-only. It queries account data but cannot modify projects, campaigns, or settings. Claude Code also runs in a sandboxed environment that requires explicit user approval before writing files or running commands.

Why it matters

The gap between "using AI for SEO" and "automating SEO workflows with AI" is mostly an execution problem. Chat-based AI tools hit a ceiling when a task involves more than a few steps. Context windows overflow, and practitioners end up copy-pasting intermediate results back into the conversation.

Claude Code's agentic model sidesteps that by saving intermediate results to files and managing its own memory. Pairing it with live SEO data through MCP means a practitioner can set an objective like "find keyword gaps, propose article ideas, save everything to files" and review the final output rather than supervising each step.

SE Ranking's MCP is particularly interesting because it combines traditional SEO metrics with AI search data. The server surfaces brand mention rates across ChatGPT, Gemini, Perplexity, AI Overviews, and AI Mode. Practitioners tracking visibility across both classic and AI search can query both datasets in a single workflow.

The MCP standard itself is gaining traction beyond SE Ranking. The protocol functions as a standardized bridge between AI applications and external systems. SE Ranking's adoption signals that SEO tool vendors are beginning to build for agentic use cases rather than just chat-based ones.

What to do

Try the setup if you have both accounts. SE Ranking says configuration takes about 10 minutes. You need a SE Ranking account and Claude Code access. Prompts are plain English, not code.

Start with a bounded workflow. SE Ranking's blog walks through three specific workflows. Pick one that maps to work you already do manually, like competitive keyword gap analysis. Run it through Claude Code and compare the output quality and time savings against your current process.

Audit what data you're feeding the agent. The MCP pulls from your SE Ranking account data. Make sure your projects and tracked keywords are current before running analysis, or the agent will work from stale inputs.

Check the sandbox approvals. Claude Code asks permission before writing files or running commands. Review each approval request during your first few runs to understand what the agent is doing at each step. You can deny any action that looks wrong.

Consider the Claude desktop app as an alternative. Anthropic's Claude desktop app supports file reading and project-scoped context without the terminal. If the command line feels like a barrier, the desktop app offers a friendlier entry point for exploring MCP-connected workflows.

Watch out for

Read-only does not mean risk-free. The MCP server cannot modify your SE Ranking account, but Claude Code can still write files to your local system and run commands. The sandbox approval step is your safety net. Do not auto-approve actions during early experimentation.

Context window limits still apply to complex prompts. Claude Code manages memory better than chat by writing to files, but extremely broad objectives can still produce incomplete results. Scope your initial prompts narrowly and expand once you understand how the agent breaks down tasks.

Bing Copilot test shrinks citation links to superscripts

Mon, 27 Apr 2026 14:46:14 GMT

What happened

Microsoft Bing is testing a new citation design in Copilot Search results that makes links far less prominent. Instead of the entire line of text being clickable to the source, only a small superscript citation mark at the end of the line links out.

The test was spotted by Sachin Patel on X and covered by Search Engine Roundtable's Barry Schwartz, who noted he could not replicate the behavior. Patel wrote: "Bing is testing a new design for links in their AI overview. Previously, the link covered the entire line, but now it appears differently."

The change has not rolled out broadly. It appears to be a limited A/B test.

Why it matters

Citation link size directly affects click-through rates. A full line of clickable text is a much larger target than a tiny superscript number. If Bing ships this design, sites that receive referral traffic from Copilot Search could see clicks drop even when they are still cited as a source.

The pattern mirrors a broader trend across AI search interfaces. Google's AI Overviews already use small superscript citations rather than full-line links. If Bing follows suit, the design convention across both major search engines would push users toward consuming the AI-generated answer without clicking through.

For publishers, the implication is clear: being cited in an AI answer becomes less valuable if the citation is visually de-emphasized. Visibility in the answer itself still matters for brand awareness, but the direct traffic payoff shrinks.

What to do

No immediate action is required since the test is not widely available. But there are a few things worth preparing for.

Check your Bing referral traffic baseline now. Use Bing Webmaster Tools and your analytics platform to establish current Copilot-related click volumes. If the design rolls out, you will need a clean before-and-after comparison.

Segment Bing traffic by query type. Informational queries answered directly in Copilot are most at risk for click loss. Transactional and navigational queries tend to retain clicks regardless of citation design.

Track Bing Copilot citations separately. If you are already monitoring AI Overview citations from Google, apply the same tracking to Bing. Knowing you are cited but losing clicks tells a different story than losing citations entirely.

Watch for a broader rollout announcement. Barry Schwartz noted he hopes the interface does not roll out. If it does, expect coverage on Search Engine Roundtable and similar outlets.

Watch out for

Misattributing traffic drops. If this test reaches your users before a wider rollout is announced, you might see Bing referral dips that look like a ranking loss. Check whether your Copilot citations are stable before assuming a visibility problem.

Ignoring Bing Copilot entirely. Bing's search market share is small, but Copilot is integrated into Edge, Windows, and Microsoft 365. The surface area for Copilot answers is larger than Bing.com's market share suggests.

Semrush playbook targets SaaS citation failures in AI search

Mon, 27 Apr 2026 14:39:24 GMT

What happened

Semrush published an eight-step playbook for SaaS AI search visibility on April 27, targeting how software companies can get cited accurately in ChatGPT, Perplexity, Google AI Overviews, and similar answer engines. The guide covers citation auditing, product page structure, schema markup, comparison content, and ongoing monitoring.

The core argument: SaaS buyers now ask multi-part questions about pricing, integrations, compliance, and use cases in a single AI prompt. AI systems pull from multiple sources and generate shortlists before the buyer ever clicks through. If your product pages aren't structured for extraction, you get skipped or misrepresented.

Why it matters

Most SaaS SEO guidance still focuses on keyword rankings and organic click-through rates. The Semrush playbook shifts the target to citation accuracy and share of voice inside AI-generated answers. That distinction matters because a ranking you hold in traditional search doesn't guarantee you appear in the AI summary for the same query.

The playbook identifies eight signals that affect whether a SaaS brand gets pulled into AI answers:

Consistent product and feature naming across all pages
Clean, scoped URL structures that crawlers can follow easily
FAQ schema on help and feature pages
SoftwareApplication schema with current pricing on product pages
Glossary and comparison pages using HTML tables rather than images
Conversation-led page structure that answers multi-part prompts
Off-site expert quotes anchored to data
Monthly citation monitoring tied to an ROI model

The SoftwareApplication schema point is worth attention. Google's structured data documentation supports this type for rich results, and the schema.org spec includes properties for pricing tiers via nested Offer markup. SaaS companies that already run Product schema on their pricing pages may need to evaluate whether SoftwareApplication is a better fit for how AI systems parse software-specific attributes like applicationCategory, featureList, and operatingSystem.

Semrush also flags that mature SaaS categories with abundant third-party coverage (review sites, comparison guides, analyst content) tend to show up more reliably in AI summaries. Brands in emerging or niche segments face a harder path because there's less corroborating content for AI systems to cross-reference.

What to do

Run the citation audit first. Test 8–12 realistic prompts across ChatGPT, Perplexity, and Google AI Overviews. Focus on category-level queries, not branded ones. Log whether your brand appears, where it ranks in the answer, whether details are accurate or outdated, and whether clickable source links are included. Semrush suggests timeboxing the manual audit at 30–45 minutes.

Check your product page structure. AI systems extract information more cleanly from pages with consistent naming, scoped URLs, and up-to-date specs. If your pricing page uses images instead of HTML tables, AI crawlers can't parse the tiers. The same applies to feature comparison matrices embedded as screenshots.

Add SoftwareApplication schema to product pages. Include current pricing using nested Offer properties. If you're already running Product or WebApplication schema, compare the property coverage against what SoftwareApplication supports. The featureList, applicationCategory, and operatingSystem fields give AI systems structured attributes to pull from.

Build comparison and glossary pages with extractable markup. HTML tables with clear headers beat prose-heavy paragraphs for AI extraction. If you have "vs." comparison pages, structure them so each product's attributes sit in labeled table cells.

Set up monthly citation monitoring. The playbook recommends tracking average citations per week, accuracy of brand mentions, and share of voice against competitors. Semrush points to its own AI Visibility Toolkit (drawing on 261M+ prompts) for benchmarking, but the manual audit method works as a starting point regardless of tooling.

Watch out for

Image-based pricing tables block AI extraction. If your pricing page renders tiers as designed graphics or screenshots rather than HTML, AI crawlers can't read the values. Check whether your pricing data exists in the DOM as text nodes.

Branded query audits give false confidence. Semrush specifically warns against relying on branded prompts when assessing AI visibility. Category-level prompts like "best project management tools for startups" reveal whether you're in the consideration set. Branded queries only confirm AI knows you exist.

Top-ranking sites still get skipped in AI search citations

Sun, 26 Apr 2026 13:32:12 GMT

What happened

A r/TechSEO discussion thread posted on April 26 asked a question many practitioners are wrestling with: how does a website get cited by AI-generated search features? The thread was removed by moderators for not being technical enough, but the question it raised reflects a real gap between traditional SEO performance and AI citation visibility.

The original poster asked whether citations in AI answers depend on SEO signals, authority, structured data, backlinks, content freshness, or something else entirely. One commenter recommended using tools like ModelMention.io to track which sources AI systems actually pull from. That commenter noted AI tools tend to favor "sites with human conversation" such as Reddit, Trustpilot, and YouTube.

Why it matters

Google's AI Overviews documentation states clearly that "there are no additional requirements to appear in AI Overviews or AI Mode, nor other special optimizations necessary." The guidance says standard SEO best practices still apply.

That message conflicts with what practitioners are seeing. Sites that rank well in traditional organic results do not automatically get cited in AI Overviews or third-party AI tools like Perplexity. The frustration in the Reddit thread reflects a broader pattern across SEO communities: ranking #1 for a query no longer guarantees your site appears in the AI-generated answer for that same query.

The citation mechanics differ across platforms. Perplexity's help documentation describes a process where it searches the web in real time, gathers information from "authoritative sources like articles, websites, and journals," and then compiles answers with numbered citations. Google's AI Overviews pull from its existing index but apply a separate selection layer to decide which sources get surfaced.

Neither system publishes the specific ranking factors that determine citation inclusion. Practitioners are left reverse-engineering patterns from observed results.

The commenter's observation about conversational sources is worth attention. AI systems trained on or retrieving from user-generated content may weight discussion forums and review platforms more heavily for certain query types. Brand mentions in Reddit threads and YouTube comments could carry more citation weight than a well-optimized product page.

What to do

Audit your current AI visibility. Check whether your site appears in AI Overviews for your core queries by searching in Google with AI Overviews enabled. Do the same in Perplexity and ChatGPT search. Note which competitors get cited and examine what their cited pages have in common.

Don't chase a separate "AI SEO" workflow. Google's documentation says standard best practices apply. Focus on clear, direct answers to specific questions within your content. AI citation systems tend to pull from content that directly addresses a query rather than content that ranks through link authority alone.

Look at where your brand gets discussed. The Reddit commenter's point about conversational sources suggests brand presence on forums, review sites, and video platforms may influence AI citations. Monitor mentions on Reddit, Quora, and industry-specific forums. Genuine participation in these spaces can increase the chances AI systems encounter your brand when assembling answers.

Structure content for extraction. Use clear headings, concise definitions, and direct factual statements. AI systems parsing your page need to identify discrete claims to cite. Walls of text with buried answers are harder to extract from.

Be cautious about new tools promising AI visibility tracking. As PPC Land reported, Google's John Mueller debunked fabricated claims about an AI Overview filter in Search Console in September 2025. No official GSC reporting for AI Overview performance exists yet. Any tool claiming direct measurement of AI citation performance is working from external observation, not API data.

Watch out for

Conflating ranking with citation. A page ranking #1 organically may not appear in AI answers for the same query. AI features apply a different selection layer. Track both channels separately.

Over-investing in schema markup as a silver bullet. While structured data helps search engines understand entities, no source confirms that schema markup directly increases AI citation rates. It is one signal among many, not a guaranteed path to AI visibility.

Fabricated Google core update ranked in Search and AI Overviews

Sat, 25 Apr 2026 10:53:04 GMT

What happened

SEO practitioner Jon Goodey published a fabricated "March 2026 Google Core Update" in a LinkedIn newsletter to test how easily misinformation spreads through Google Search. Search Engine Journal reported that the fake update ranked on page one for "Google March update 2026" and was picked up by AI Overviews as fact.

Goodey explained in a subsequent LinkedIn post that his AI-assisted newsletter workflow caught a hallucination about a nonexistent core update. Instead of correcting it, he published it deliberately to see whether anyone would challenge the claim.

Nobody did. Multiple independent SEO sites published detailed articles treating the update as confirmed. According to SEJ's coverage, these weren't thin posts. They included invented technical details like "Gemini 4.0 Semantic Filters," an "Information Gain" metric, and recovery strategies for an update that never happened. A technology site called TechBytes published a piece headlined "Google March 2026 Core Update: Cracking Down on 'Agentic Slop'" with fabricated specifics about a "Zero Information Gain" classification system and a "Discover 2.0 Engine."

Major search marketing publications including SEJ did not cover the fake update.

A real March 2026 core update did eventually roll out on March 27, according to the Google Search Status Dashboard. Goodey's experiment predated it.

Why it matters

Google's search results have no meaningful fact-checking layer for most queries. The experiment shows that a single LinkedIn article containing AI-generated misinformation can reach page one and feed directly into AI Overviews, where it gets presented without qualification.

For SEO practitioners, the implications cut two ways. First, the SEO information you find through Google Search may itself be fabricated. SEJ's Roger Montti compared searching for SEO information on Google to "playing a slot machine." Second, if you publish AI-assisted content without human review, you risk becoming part of the misinformation chain.

The AI Overviews angle is particularly concerning. Classic search results at least show source URLs that a reader can evaluate. AI Overviews synthesize claims into authoritative-sounding summaries. When the underlying source is fabricated, the AI Overview launders the misinformation into something that looks like consensus.

Google has explicitly declined to integrate fact-checking into its search results. SEJ's article references an Axios report in which Google's global affairs president Kent Walker told the European Commission that fact-checking integration "simply isn't appropriate or effective for our services."

Google does support ClaimReview structured data that lets fact-checkers annotate claims. However, Google is phasing out support for ClaimReview markup in Search results, according to its own documentation. The markup remains supported only in the Fact Check Explorer tool. The direction of travel is away from structured fact-checking in SERPs, not toward it.

What to do

Verify update claims against primary sources. The Google Search Status Dashboard lists every confirmed algorithm update with dates and durations. If an update isn't listed there, treat it as unconfirmed. Google's official Search Central blog and social accounts are the only other primary sources.

Audit your AI content workflows. If you use AI to draft content, build a verification step that checks factual claims against primary sources before publishing. Goodey's experiment worked precisely because other publishers skipped this step.

Be skeptical of AI Overview answers for SEO queries. AI Overviews can surface and synthesize misinformation from a single low-authority source. Cross-reference any AI Overview claim about algorithm updates, ranking factors, or technical SEO against official Google documentation or established industry publications.

Don't rush to publish update coverage. The sites that got burned were chasing traffic from a trending query. Waiting even a few hours to verify against the Search Status Dashboard would have prevented publishing fabricated content.

Watch out for

AI hallucinations about Google updates are plausible by default. Google releases updates frequently enough that a claim about a new one rarely triggers skepticism. Your AI writing tools have been trained on years of update coverage and can generate convincing fake details, complete with invented feature names and timelines.

LinkedIn content ranks surprisingly well for informational queries. Goodey's article wasn't on a high-authority SEO domain. It was a LinkedIn newsletter post. LinkedIn's domain authority can push even low-effort content onto page one for queries without strong competition, which makes it an effective vector for misinformation.

Scoped custom element registries can silently break crawlability

Sat, 25 Apr 2026 10:45:32 GMT

What happened

Chrome and Edge 146 now ship scoped custom element registries by default. The feature, developed by the Microsoft Edge team, lets developers create independent CustomElementRegistry instances instead of relying on the single global window.customElements registry. Each scoped registry maintains its own set of custom element definitions, isolated from the global registry and from other scoped registries.

The feature solves a real problem. Large applications that compose UIs from multiple teams or micro-frontend libraries frequently hit naming collisions. If two libraries both define <my-button>, the page throws an error. Scoped registries eliminate that by letting each shadow root, document, or individual element use its own registry.

Registries can be scoped three ways:

Shadow root scoping: Pass a customElementRegistry option when calling attachShadow(). All custom elements inside that shadow root resolve against the scoped registry.
Declarative shadow DOM: Add the shadowrootcustomelementregistry attribute to a <template> element. The browser reserves space for the registry, and JavaScript defines its elements later via the scoped registry's define() method.
Disconnected documents: Scope a registry to an off-screen document created by document.implementation.createHTMLDocument(), useful for template cloning and off-screen manipulation.

MDN's documentation on custom elements confirms the feature and notes that scoped registries can "limit definitions to a particular DOM subtree."

Why it matters

Googlebot renders pages using a Chromium-based renderer, but it does not execute JavaScript in real time the way a user's browser does. Content inside shadow DOM has always been harder for crawlers to access. Scoped registries add another layer of indirection.

When a custom element's definition lives in a scoped registry attached to a shadow root, the element only renders after JavaScript creates the registry, defines the element class, and attaches it. If any step in that chain fails or runs after the crawler's rendering window closes, the element stays undefined. The browser shows nothing, or shows fallback content.

The declarative shadow DOM path is particularly tricky. The shadowrootcustomelementregistry attribute tells the browser that the shadow root uses a scoped registry rather than the global one. Until JavaScript defines elements in the scoped registry and attaches it to that shadow root, the custom elements inside it remain unresolved. Web.dev's declarative shadow DOM article explains that declarative shadow DOM itself is now baseline across browsers, but the scoped registry extension requires explicit JavaScript initialization.

Micro-frontend architectures are the primary adopters. Sites built with frameworks like single-spa, Module Federation, or custom shell apps often compose pages from independently deployed bundles. Each bundle can now register its own element names without coordination. The tradeoff is that critical content rendered by these scoped elements depends entirely on JavaScript execution order.

For sites where scoped elements render above-the-fold content, product information, or navigational links, the crawlability risk is real. Googlebot may see an empty <my-card></my-card> instead of the rendered output.

What to do

Check whether your site uses scoped custom element registries by searching your codebase for new CustomElementRegistry() and shadowrootcustomelementregistry. If neither appears, no action is needed.

If your site does use scoped registries, test how Googlebot sees the rendered output. Use Google's Rich Results Test or the URL Inspection tool's "View Tested Page" to confirm that content inside scoped shadow roots actually renders. Compare the rendered HTML against what you expect.

For content that must be crawlable, avoid placing it exclusively inside scoped-registry shadow roots. Server-side render the critical text content into the light DOM or into declarative shadow DOM that does not depend on scoped registry initialization. The <slot> element can project light DOM children into shadow roots, keeping the text accessible to crawlers even if the component's JavaScript hasn't executed.

If you're using declarative shadow DOM with the shadowrootcustomelementregistry attribute, make sure your SSR pipeline defines the scoped registry's elements synchronously during hydration. Any delay or conditional loading risks leaving the elements undefined during Googlebot's render pass.

Monitor Google Search Console's coverage reports for pages that use micro-frontend composition. Watch for increases in "Excluded" pages or drops in indexed page counts that correlate with deploying scoped registries.

Watch out for

Fallback content is not automatic. Unlike some web component patterns where light DOM children serve as fallback, scoped registry elements that fail to initialize render as empty unknown elements. There is no built-in fallback mechanism.

Mixed registry conflicts during hydration. If your SSR output uses declarative shadow DOM with the shadowrootcustomelementregistry attribute but your client-side JavaScript defines the same element in the global registry instead, the shadow root will not pick up the global definition. The element stays undefined inside the shadow root even though it works everywhere else on the page.

ChatGPT uses SerpAPI to pull Google results, not its own crawler

Sat, 25 Apr 2026 10:37:22 GMT

What happened

ChatGPT uses a third-party service called SerpAPI to pull scraped Google search results rather than relying solely on its own crawler or index. Botify's research into how AI platforms source content confirms that OpenAI isn't alone in this practice. Meta and Perplexity are reportedly SerpAPI customers as well.

The data from SerpAPI helps ChatGPT deliver real-time answers to queries that fall outside its training data or OpenAI's internal index. Botify's analysis also notes that these companies are building proprietary indexes to reduce dependence on Google and Bing, but the current reality is simpler: most AI platforms still lean on traditional search results to source the content they recommend.

Why it matters

The SerpAPI dependency means your Google rankings directly influence whether AI platforms surface your content. If a page ranks well in Google, it has a better chance of being pulled into ChatGPT's answers. If it doesn't rank, ChatGPT may never see it.

Botify frames this around the compressed customer journey inside AI interfaces. Consumers ask conversational questions and get instant recommendations without opening a browser. A single interaction can move someone from discovery to decision. The content that feeds those answers comes from Google's index, accessed through services like SerpAPI.

Pages that are invisible to crawlers won't appear in any index, traditional or AI. Crawlability and indexability are now prerequisites for visibility across both channels. A page blocked by robots.txt or stuck behind client-side rendering that Googlebot can't parse is invisible to ChatGPT too, because ChatGPT is reading Google's results.

The proprietary indexes being built by OpenAI, Meta, and Perplexity add a second layer. These systems aim to crawl and categorize web content directly, separate from Google. Botify describes this as an effort to "expand each platform's reach and establish autonomy outside traditional search." For now, though, Google's index remains the primary source.

What to do

Treat Google indexing as your AI search foundation. Run a crawl audit in Google Search Console to identify pages that aren't indexed. Fix crawl errors, broken canonicals, and noindex tags on pages you want AI platforms to find.

Check your robots.txt for AI crawler blocks. If you've blocked OAI-SearchBot (OpenAI's crawler) or PerplexityBot, those platforms can't build you into their proprietary indexes. Decide whether the tradeoff is worth it. Blocking the crawler doesn't stop SerpAPI from pulling your Google listing, but it does prevent inclusion in their growing independent indexes.

Ensure your content answers conversational queries directly. AI platforms compress the search journey by matching user intent to a single answer. Pages structured around clear questions and concise answers are more likely to be selected. Use structured data from Schema.org (FAQ, HowTo, Product markup) to make your content machine-readable.

Strengthen foundational brand content. Botify notes that model training data captures evergreen brand information like company history, leadership, and flagship products. Pages covering these topics with consistent, factual details help LLMs build accurate representations of your brand. Training data cutoffs lag behind the live web, so focus on durable content that stays accurate over time.

Monitor which AI crawlers hit your site. Check your server logs for OAI-SearchBot, ClaudeBot, PerplexityBot, and GPTBot. Log analysis tells you which platforms are building their own indexes of your content and how frequently they crawl.

Watch out for

SerpAPI access doesn't equal direct crawling. Blocking GPTBot or OAI-SearchBot in robots.txt won't stop your content from appearing in ChatGPT answers if your pages rank in Google. ChatGPT reads Google's results through SerpAPI regardless of whether its own crawler can access your site.

Proprietary indexes are a moving target. OpenAI, Meta, and Perplexity are all building their own web indexes. The balance between SerpAPI-sourced results and proprietary index results will shift over time. A page that gets traffic from ChatGPT today because it ranks in Google may need direct crawler access tomorrow to maintain visibility.

AI search scores passages, not pages, killing pillar content

Sat, 25 Apr 2026 10:32:12 GMT

What happened

AI search engines score individual passages rather than full pages when deciding what to cite, according to a Sitebulb webinar featuring Dan Petrovic (founder of DEJAN agency) and Jes Scholz (growth marketing consultant). The session broke down the multi-step pipeline that runs between a user's prompt and the citations that appear in an AI-generated response.

Petrovic described a four-stage process:

Query reformulation. The system takes the user's prompt and generates multiple synthetic search queries. These aren't what the user typed. They're what the system decides it needs to answer the question. A single prompt might produce two to six separate queries, each returning its own result set.
Shortlisting. From several hundred results across those queries, the system filters down to a handful through re-ranking.
Passage-level relevance scoring. For each shortlisted page, the system pulls the cached version and scores individual passages against the query using a cross-encoder model. Petrovic described cross-encoders as models that embed both the query and a target text chunk together, then score the pair for relevance.
Grounding snippet creation. The most relevant passages get extracted into "grounding snippets" that feed into the language model's context window alongside the reformulated queries.

The language model then synthesizes its response from those snippets and attaches citations back to the grounding sources. Petrovic noted that the model also carries biases from pre-training and fine-tuning, meaning the grounding snippets compete with whatever the model already "knows."

If a user continues the conversation, the entire pipeline repeats for each follow-up.

Why it matters

The passage-level scoring step changes what "ranking" means in AI search. A page can rank well in traditional search and still get zero citations if no single passage on that page scores highly enough against the reformulated queries.

Long-form pillar content is particularly exposed. A 3,000-word guide covering ten subtopics might rank for dozens of keywords in classic search. But in the AI pipeline, each passage competes independently. If no individual section is the best answer for the specific sub-query the system generated, the page gets passed over entirely.

Cross-encoder scoring works differently from the bi-encoder models used in standard dense retrieval (the kind documented in toolkits like Pyserini). Bi-encoders embed queries and documents separately, then compare them. Cross-encoders embed both together, producing more accurate but computationally expensive relevance scores. The practical effect is that passage quality matters more than page-level authority signals.

The query reformulation step adds another wrinkle. You don't control which synthetic queries the system generates from a user's prompt. Your content needs to match sub-queries you can't predict or see in Search Console.

What to do

Audit your content at the passage level. Read each section of your key pages as if it were a standalone answer. Does it make a clear, complete claim that directly responds to a question? Passages that meander or require surrounding context to make sense will score poorly against a focused sub-query.

Tighten long-form content. If you have pillar pages covering many subtopics, check whether each section could stand alone as a strong answer. Sections that serve only as transitions or brief overviews are unlikely to score well in passage-level evaluation. Consider whether breaking them into focused pages would produce better individual passages.

Front-load key claims within sections. Cross-encoder models score chunks of text. If your main point appears in paragraph four of a section after three paragraphs of setup, the chunk containing the setup may be what gets evaluated. Put the answer first, then the explanation.

Write for sub-queries you can't see. Think about how a system might decompose a broad question into specific sub-queries. If someone asks "best CRM for small businesses," the system might generate sub-queries about pricing, integrations, ease of use, and support. Each of those needs a passage-level answer somewhere in your content.

Don't abandon traditional SEO. Both Scholz and Petrovic emphasized that the AI search pipeline still starts with traditional search results. Pages that don't rank in the initial retrieval step never make it to the shortlist. Technical fundamentals like crawlability, indexation, and relevance signals remain the entry ticket.

Watch out for

Grounding snippet ≠ featured snippet. Featured snippets are selected from the top-ranking result for a specific query. Grounding snippets are extracted from multiple pages across multiple reformulated queries. Writing content to win featured snippets is a different task than writing passages that score well under cross-encoder evaluation.

Model bias can override your passages. Even if your passage makes it into the grounding context, the language model's pre-training biases may steer the synthesized answer toward a different framing. Petrovic noted that you're "hoping for the best" that the grounding pipeline sends the right signals. Brand mentions across the model's training data still matter.

Botify launches AI Visibility, replaces rank with mentions

Sat, 25 Apr 2026 10:25:40 GMT

What happened

Botify has moved its AI Visibility product out of beta and into general availability for all customers. The tool, first introduced in beta last October, tracks how brands appear in AI-generated responses across platforms like ChatGPT, Google AI Mode, and Perplexity, with Gemini support coming soon.

In a companion blog post, Botify's Content Marketing Lead Felicia Crawford laid out the measurement framework behind the product. The core argument: traditional rank tracking doesn't translate to AI search, and practitioners need new metrics built around mention frequency rather than position.

Botify CEO Adrien Menard framed the launch as a response to customer demand, noting the beta saw "the fastest and widest customer adoption with a new offering yet."

Why it matters

The product reflects a growing industry consensus that AI search measurement requires different primitives than traditional SEO. Botify's framework replaces position-based tracking with four core metrics:

Visibility score: The percentage of time your brand, products, or URLs are mentioned in AI-generated responses over a given period. Frequency replaces position as the primary signal.
Visibility by platform: The same mention percentage broken down by AI platform, showing where a brand has strong coverage and where gaps exist.
Share of voice: How often your brand is mentioned compared to competitors across AI responses.
Citations: The percentage of AI-generated answers that reference and link to your URLs. Botify frames these as both credibility signals and traffic drivers.

Crawford's post also argues against keyword-only strategies for AI search. Because AI responses are personalized based on conversation history, user intent, and context, a specific input won't produce a predictable output. Botify recommends tracking "intent visibility" instead, which segments brand mentions by user intent categories like transactional, informational, and navigational.

The shift from "what rank am I" to "how often am I mentioned" has practical consequences. Practitioners accustomed to monitoring position changes will need to adjust their reporting workflows. A brand disappearing from one prompt on one platform isn't a red alert in Botify's framework. It's a single data point that only matters when aggregated over time.

For enterprise teams already using Botify's crawl and indexability tools, the addition of AI visibility data creates a more complete picture. Teams can connect technical health issues (crawlability, rendering) with downstream visibility in AI responses.

What to do

Start by auditing what you're currently measuring for AI search. If your reporting relies on prompt-level rank tracking or single-point-in-time snapshots, those metrics will be noisy and hard to act on.

If you're a Botify customer, the AI Visibility module is now available without waiting for beta access. Explore the platform-level breakdowns to identify which AI surfaces mention your brand most and least frequently.

For non-Botify users, the measurement framework still applies. Track mention frequency across AI platforms rather than chasing positions. Several tools in the market now offer some version of AI mention tracking, but Botify's approach of bundling it with technical SEO data is distinct.

Shift competitive analysis toward share of voice in AI responses. Knowing that a competitor appears in 40% of relevant AI answers while you appear in 15% is more actionable than knowing you "rank" third in one ChatGPT prompt.

Review your content strategy through the lens of intent coverage. Botify's framework segments visibility by intent type. Even without their tool, you can manually test whether your brand appears in AI responses for transactional, informational, and navigational queries in your space.

Watch out for

Metric volatility in early tracking. AI responses fluctuate based on model updates, user context, and conversation history. Short observation windows will produce unreliable baselines. Give any new AI visibility metric at least several weeks before drawing conclusions.

Platform coverage gaps. Not all AI visibility tools track the same platforms. Confirm which AI surfaces (ChatGPT, Gemini, Perplexity, Google AI Mode, Copilot) your tracking covers before assuming full visibility data.

AI Overview citations now diverge sharply from top 10 rankings

Sat, 25 Apr 2026 10:13:37 GMT

What happened

Only 38% of pages cited in AI Overviews also rank in the top 10 organic results for the same query, down from 76% seven months ago. The finding comes from an updated Ahrefs analysis of 863,000 keywords and 4 million AI Overview URLs, reported by Search Engine Journal on March 6.

The remaining citations split almost evenly between positions 11–100 (31.2%) and pages ranked beyond position 100 (31.0%). Roughly two out of three AI Overview citations now come from pages outside the top 10.

A separate BrightEdge analysis from February placed the top 10 overlap even lower, at about 17%. The two studies use different methodologies and datasets, but the direction is consistent.

Ahrefs attributes part of the shift to Google's query fan-out process. A single search gets decomposed into multiple sub-queries, and citations are drawn from pages that appear most often across those sub-query results. Google also upgraded AI Overviews to Gemini 3 globally in January, which Ahrefs flags as relevant timing context.

YouTube is the most cited domain in AI Overviews overall and has grown 34% over the past six months, according to Ahrefs' Brand Radar data.

BrightEdge's research also showed AI Overviews grew 58% year-over-year and now appear on approximately 48% of all tracked queries. The growth varies sharply by vertical:

Education: from 18% to 83%
B2B technology: from 36% to 82%
Restaurants: from 10% to 78%
Healthcare: from 72% to 88%

AI Overviews now consume more than 1,200 pixels on average, pushing the first organic result below the fold on standard desktop screens, as Roger Montti noted in his coverage.

Why it matters

Seven months ago, ranking in the top 10 and earning AI Overview citations were largely the same goal. That relationship has broken. Pages ranking well in classic search no longer have a reliable path into AI Overviews, and pages outside the top 10 are getting cited at nearly the same rate.

The query fan-out mechanism is the key factor. Google splits a complex query into sub-queries and pulls citations from whichever pages rank well for those narrower questions. A page that ranks 45th for a broad query might rank 3rd for one of the sub-queries, and that's enough to earn a citation.

The practical split matters, too. Classic search results still appear without any AI Overview on 52% of tracked queries. For verticals where AI Overviews remain uncommon, organic ranking is still the entire game. For education, B2B tech, healthcare, and restaurants, the calculus has shifted.

Google's own documentation on AI features states there are "no additional requirements to appear in AI Overviews or AI Mode, nor other special optimizations necessary." The guidance directs site owners to standard SEO best practices. Google also claims AI Overviews are driving visits to "a greater diversity of websites."

The Ahrefs and BrightEdge data suggest that "greater diversity" claim is directionally true. Pages outside the top 10 are getting cited. Whether those citations drive meaningful traffic is a separate question neither study answers.

What to do

Check your AI Overview citation rate against your organic rankings. Tools like Ahrefs' Brand Radar and BrightEdge can show where your pages appear in AI Overviews versus where they rank organically. If you're only tracking traditional rank positions, you're missing the picture for nearly half of all queries.

Identify your sub-query surface area. The query fan-out process means your page doesn't need to rank for the broad head term. It needs to be the best answer for one of the sub-queries Google generates. Review which specific, narrow questions your content answers well, and make sure those answers are clear and self-contained within the page.

Segment your keyword set by AI Overview presence. The 52% of queries still without AI Overviews warrant a different strategy than the 48% with them. If your vertical falls in the high-AIO group (education, B2B tech, healthcare, restaurants), citation tracking should be part of your reporting. If your vertical still sees mostly classic results, traditional rank tracking remains the primary metric.

Don't chase special AIO formatting. Google's documentation is explicit: no special requirements exist for appearing in AI Overviews. Standard SEO fundamentals apply. Focus on answering specific questions well rather than adding schema or markup specifically for AI features.

Watch out for

Misreading the overlap drop as a methodology change only. Ahrefs acknowledges improved parsing contributed to the gap between their July 2025 and current figures. The two datasets aren't perfectly comparable. But the BrightEdge study, using entirely different methods, arrived at an even lower overlap figure of 17%. The directional trend is real, even if the exact magnitude is debatable.

Assuming AI Overview citations equal traffic. Neither the Ahrefs nor BrightEdge study measured click-through rates from AI Overviews. Being cited is not the same as receiving visits. With AI Overviews consuming 1,200+ pixels of screen space, the click dynamics differ substantially from traditional blue links.

AI bot traffic starves Googlebot of crawl budget on large sites

Sat, 25 Apr 2026 10:11:16 GMT

What happened

AI bot traffic grew sharply across 2025, and Botify reports that the surge is creating real infrastructure problems for large websites. Crawlers from companies like OpenAI and Anthropic are consuming bandwidth, destabilizing servers, and crowding out the search engine bots that actually drive organic visibility.

The problem is straightforward. Every bot request consumes server resources regardless of intent. When AI crawlers pile on top of existing search engine crawlers, monitoring tools, and malicious scrapers, total request volume can exceed what a site's infrastructure was built to handle.

Botify flags five specific consequences: bandwidth consumption, uptime and stability risk, security exposure, analytics pollution, and unpredictable cost increases.

The Wikimedia Foundation offers a concrete example. The organization reported bandwidth surges of over 50% last year as AI crawlers scraped content for LLM training. When baseline bandwidth gets eaten by AI bots, less remains for human visitors and for Googlebot.

Why it matters

Crawl budget is finite. If AI bots are hammering your servers, Googlebot may get throttled or deprioritized before it reaches important pages. For large sites with millions of URLs, this can directly reduce the number of pages Google discovers and indexes.

The stability risks compound the problem. Sudden bot traffic bursts force servers to scale up and down unpredictably, which degrades caching and increases error rates during deployments. Diagnosing issues becomes harder because traffic patterns no longer reflect normal user behavior.

Analytics reliability also takes a hit. Botify notes that ChatGPT prompts have appeared in Google Search Console data as search queries, muddying keyword analysis. While OpenAI has claimed this specific issue is resolved, it illustrates how AI platform activity can quietly distort the data SEOs depend on.

The cost dimension is harder to pin down. Botify acknowledges that exact figures depend on each site's architecture, but the pattern is consistent: unplanned bot traffic drives unplanned infrastructure costs.

What to do

Audit your bot traffic first. Check server logs to understand which AI crawlers are hitting your site, how frequently, and which URLs they target. You cannot manage what you have not measured. Look for user-agent strings from GPTBot, ClaudeBot, and other known AI crawlers.

Block unwanted AI crawlers in robots.txt. If specific AI bots provide no value to your business, disallow them. A simple addition handles this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

If you want AI crawlers to access some content but not all, use targeted disallow rules for high-volume sections. Moz's robots.txt guide covers the syntax in detail.

Use the noai and noimageai robots meta directives where supported. Google's robots meta tag documentation explains how page-level meta tags work. Some AI crawlers honor these directives, though compliance varies.

Implement HTTP 429 responses for aggressive crawlers. RFC 6585 defines the 429 "Too Many Requests" status code, which tells a client it has sent too many requests. Configure your server or CDN to return 429 responses when bot traffic exceeds acceptable thresholds. Cloudflare and similar CDN providers offer bot management features that can help automate this.

Separate bot traffic in your analytics. Filter known bot user agents from your web analytics and GSC analysis. Treating bot-inflated numbers as real user signals will lead to bad decisions.

Prioritize Googlebot access. If you must throttle overall bot traffic, make sure Googlebot and Bingbot are allowlisted. These are the crawlers that feed your organic search visibility. Everything else is secondary unless you have a specific reason to allow it.

Watch out for

Not all AI crawlers identify themselves. Some bots use generic or misleading user-agent strings. Robots.txt rules only work against crawlers that declare who they are and choose to obey the file. Server-side rate limiting based on behavior patterns catches what robots.txt cannot.

Blocking too aggressively can backfire. If your brand benefits from appearing in AI-generated answers, blanket-blocking all AI crawlers removes you from those results. Decide which AI platforms matter to your traffic before setting disallow rules.

GSC shows pages as indexed but Google won't serve them

Sat, 25 Apr 2026 00:00:00 GMT

What happened

A practitioner in r/bigseo reported that Google Search Console shows URLs as indexed, yet those pages don't appear in search results or return anything via site: queries. The post, submitted on April 25, 2026, asks why GSC's index coverage report would confirm indexing for pages that Google effectively refuses to serve.

The scenario is a familiar one for practitioners: GSC's URL Inspection tool returns "URL is on Google," but the page is invisible in both organic results and site: operator checks.

Why it matters

GSC's index coverage data and Google's actual serving behavior are not the same thing. Google's own documentation states that Search works in three stages: crawling, indexing, and serving. A page can pass through the first two stages without making it to the third. Google explicitly notes it "doesn't guarantee that it will crawl, index, or serve your page."

The distinction between "indexed" and "served" catches practitioners off guard because GSC doesn't surface a separate "not serving" status. The URL Inspection tool reports whether a page is in the index. It does not report whether Google will actually return that page for any query. A page can sit in the index but be suppressed from results due to quality filters, duplicate content consolidation, or manual actions.

For sites that rely on GSC as their primary source of truth for indexation health, the gap creates a blind spot. Pages that appear healthy in reports may be generating zero impressions for months without triggering any alert.

The site: operator adds another layer of confusion. Google has repeatedly said that site: results are not a reliable indicator of what is or isn't indexed. However, when both site: and organic results return nothing for a page that GSC marks as indexed, the practical conclusion is that Google has chosen not to serve it.

What to do

Check the Performance report first. Filter by the specific URL in GSC's Performance tab. If a page shows zero impressions over 90 days despite being "indexed," Google is likely suppressing it from serving. The Performance report is a more reliable signal of serving status than the Index Coverage report.

Look for quality signals. Pages that are indexed but not served often have thin content, are near-duplicates of other pages on the site, or target queries where Google doesn't consider the page competitive. Review the page against other URLs on your site that cover similar topics.

Check for canonical conflicts. Run the URL Inspection tool and compare the "Google-selected canonical" with the "User-declared canonical." If Google has selected a different canonical than the one you declared, it may be consolidating the page's signals into another URL and choosing not to serve the inspected one.

Review manual actions and security issues. The Manual Actions report in GSC can confirm whether Google has taken action against specific pages or the site overall. Security Issues can also suppress serving.

Test with a verbatim title search. Search for the exact title of the page in quotes. If the page doesn't appear for its own title, the suppression is strong. If it does appear for the exact title but nothing else, the page likely lacks sufficient quality or relevance signals for broader queries.

Don't rely on site: as a diagnostic. The site: operator returns a sampled, approximate set of results. It is not a definitive list of indexed or served pages.

Watch out for

"Indexed" does not mean "serving." GSC's coverage report confirms that Google has processed and stored the page. It does not confirm that Google will return it for any query. Treat zero-impression indexed pages as functionally unindexed.

Canonical swaps happening silently. Google can change its selected canonical at any time without notification. A page that was serving last month may stop serving this month because Google decided a different URL is the canonical. The URL Inspection tool is the only place to check this on a per-URL basis.

GSC coverage report contradicts URL Inspection on index status

Thu, 23 Apr 2026 17:57:37 GMT

What happened

A practitioner in r/bigseo reported that pages listed as "Discovered - currently not indexed" in Google Search Console's coverage report simultaneously show "URL is on Google" when checked through URL Inspection. The post, submitted on April 23, 2026, highlights a contradiction between two GSC tools that should agree on whether a URL is indexed.

The coverage report (also called the Pages report or Index Coverage report) aggregates index status across an entire property. URL Inspection performs a live or cached lookup on individual URLs. When these two tools disagree, practitioners lose confidence in both.

Why it matters

The "Discovered - currently not indexed" status means Google found a URL but hasn't crawled it yet. Practitioners routinely use this status to diagnose crawl budget issues, thin content problems, or quality signals that prevent indexing. If the coverage report incorrectly labels indexed pages with this status, SEOs may waste time troubleshooting pages that are actually performing fine.

The reverse scenario is equally problematic. If URL Inspection shows a cached version that's stale, the coverage report might be the accurate one. Practitioners have no reliable way to determine which tool reflects the current state of the index.

According to Google's documentation on how Search works, Google does not guarantee it will crawl, index, or serve any page. The indexing pipeline has multiple stages, and a URL's status can change between them. Delays between the coverage report's batch processing and URL Inspection's on-demand checks are one likely explanation for the mismatch.

Coverage report data refreshes on a different schedule than URL Inspection lookups. The coverage report reflects a snapshot that can lag by days or even weeks. URL Inspection can perform a live test that reflects a more current state. When a page transitions from "discovered" to "indexed" between report refreshes, the discrepancy appears.

What to do

Use URL Inspection's "live test" feature for the most current view of individual URLs. The coverage report is useful for spotting patterns at scale, but its data is not real-time. Note that the live test shows how Googlebot would fetch and render a page right now, which is not the same as confirming the page is in the live index.

Run a site: search for the affected URLs directly in Google as a secondary check. If the page appears, it is likely indexed. However, site: results are not exhaustive and may omit indexed pages, so a missing result does not confirm the page is unindexed.

Check the timestamps in both tools. URL Inspection shows when Google last crawled the page. If that date is more recent than the coverage report's data range, the coverage report may be stale for that URL.

For pages genuinely stuck in "Discovered - currently not indexed," review these factors:

Internal linking: Pages with few or no internal links are less likely to be crawled and indexed.
Crawl budget: Large sites with many low-value pages may see Google deprioritize certain URLs.
Content quality: Thin or duplicate content can cause Google to discover a page but decline to index it.

If you have a batch of pages showing the discrepancy, export the affected URLs from the coverage report and run them through the URL Inspection API. Compare the results to identify whether the mismatch is widespread or limited to a few URLs that recently transitioned status.

Do not request re-indexing for pages that URL Inspection already confirms are indexed. Submitting unnecessary indexing requests does not help and may slow processing of URLs that actually need attention.

Watch out for

Stale coverage report data mistaken for an indexing problem. The coverage report can lag behind real index changes by days. Always cross-reference with URL Inspection or a live site: search before diagnosing an issue.

Conflating "Discovered - currently not indexed" with "Crawled - currently not indexed." These are different statuses. "Discovered" means Google found the URL but hasn't crawled it yet. "Crawled" means Google fetched the page but chose not to index it. The diagnosis and fix differ for each.

Mueller doubts freshness-based sitemap splits speed crawling

Sat, 18 Apr 2026 12:01:10 GMT

What happened

Google's John Mueller listed five reasons why SEOs split XML sitemaps into multiple files, but cast doubt on one of the most commonly cited strategies: separating evergreen content from fresh content to influence crawl frequency. Mueller shared the list in a discussion covered by Search Engine Journal on April 3, 2026.

An SEO had asked why anyone would choose to manage multiple sitemap files instead of keeping everything in one. Mueller responded with five reasons he's seen:

Tracking URL groups: Splitting by page type (e.g., product detail pages vs. category pages) to monitor indexing separately, similar to what the Page Indexing report in Search Console offers.
Freshness-based splitting: Placing evergreen content in a separate file so search engines might check the "old" sitemap less often. Mueller added a caveat: "I don't know if this actually happens though."
Proactive splitting: Breaking up a sitemap before it hits the 50,000-URL limit, so there's no scramble to restructure later.
Hreflang sitemaps: Multilingual markup can consume so much space that even 50,000 URLs push files past the size limit.
Automated tooling: Some CMS or deployment systems generate multiple files without anyone choosing that structure.

Why it matters

The freshness-based split is a tactic that circulates widely in enterprise SEO circles. The idea is straightforward: if Google sees a sitemap file that rarely changes, it might deprioritize re-crawling those URLs and spend more crawl budget on a separate file containing frequently updated pages. Mueller's explicit uncertainty about whether search engines actually behave this way is a meaningful signal. It suggests Google has not committed to treating sitemap files as crawl-priority signals.

The proactive splitting and hreflang reasons are more grounded. The sitemaps protocol specification caps each file at 50,000 URLs and 50MB uncompressed. Hreflang annotations add <xhtml:link> child elements inside each <url> entry, which can push the file past the 50MB uncompressed size limit even when the <url> count stays under 50,000.

Search Engine Journal's coverage also notes a claim from enterprise-level SEOs that keeping sitemaps well under 50,000 lines improves indexing. Mueller did not confirm or deny that claim.

What to do

For most sites under 50,000 URLs without hreflang in sitemaps, a single sitemap file works fine. The overhead of managing multiple files adds complexity without a confirmed crawl benefit.

If you're running a large site, split sitemaps by page type rather than by freshness. Grouping product pages, category pages, and editorial content into separate files makes the Search Console Page Indexing report more useful for diagnosing coverage issues by section.

Sites using hreflang sitemaps should check file sizes. The 50,000-URL cap counts <url> entries only, not the <xhtml:link> child elements within them. But hreflang annotations add significant XML bulk. If your file approaches 50MB uncompressed, split by language group or site section.

Don't rely on freshness-based splits as a crawl budget strategy. Mueller's own doubt about the practice means there's no confirmed mechanism backing it. If you've already structured sitemaps this way, there's no reason to undo it, but don't expect it to influence crawl frequency.

Check whether your CMS auto-generates multiple sitemap files. WordPress with Yoast, for example, creates separate sitemaps for posts, pages, and taxonomies by default. If that structure maps to how you'd want to monitor indexing, leave it. If it creates noise, consider whether a custom sitemap setup would be cleaner.

Watch out for

Assuming sitemap structure controls crawl priority. Sitemaps are a discovery mechanism, not a directive. Google decides crawl frequency based on its own signals, not on how you organize your sitemap files. Mueller's answer reinforces that even the freshness-split theory lacks confirmation.

Hreflang file sizes being deceptive. A sitemap with 5,000 pages might look small, but if each page lists 15 language alternates via <xhtml:link> elements, the XML file size grows substantially. The 50,000-URL cap counts <url> entries, not child elements, so hreflang won't push you past the URL limit at 5,000 pages. But the added XML markup can push the file past the 50MB uncompressed size limit. Check file size, not just URL count.

Claude plugin for GSC and Ads won't replace your SEO stack

Sat, 18 Apr 2026 11:48:56 GMT

What happened

A user on r/TechSEO posted an AMA claiming they replaced a $500/month SEO and Google Ads tool stack with a single Claude Code plugin. The post, titled "I replaced my 500USD/mo SEO + Google Ads stack with a Claude Code plugin. Open-sourcing it," was removed by moderators. The project appeared to require an external service called "adsagent.org" to function.

Community response was largely negative. Several commenters identified the post as a thinly veiled advertisement for the adsagent.org service, which linked back to the GitHub project. One commenter, u/jasongill, called it out directly: "This is clearly an ad for 'adsagent.org' which is required to use it." Another user, u/WebLinkr, labeled it "AI content phishing."

Why it matters

The post reflects a growing pattern of open-source tool announcements in SEO communities that function as lead generation for paid services. The "open-sourcing it" framing suggests free tooling, but requiring an external paid dependency undermines that premise.

The underlying idea of using LLM-based plugins to interact with the Google Ads API and Google Search Console API is technically plausible. Google provides programmatic access to both services. The Google Ads API supports automated account management, custom reporting, and campaign management at the keyword level. Connecting an LLM to these APIs could help with natural-language querying of performance data or generating reports.

The gap between "technically possible" and "production-ready replacement for your SEO stack" is enormous. Tools like Ahrefs, Semrush, or SE Ranking don't just pull API data. They maintain their own crawl indexes, backlink databases, keyword difficulty scores, and historical trend data. An LLM plugin that queries GSC and Google Ads APIs cannot replicate those datasets.

One commenter, u/Empty-Employment8050, made a practical observation: "you could just take this framework and build it yourself in like an afternoon." The comment highlights that the API integration itself is straightforward. The value in paid SEO tools comes from proprietary data and long-running infrastructure, not from the API connection layer.

What to do

No action is needed on this specific tool, which was removed by moderators. The broader trend of LLM plugins for SEO APIs is worth understanding, though.

If you're evaluating LLM-based SEO tools that claim to replace established platforms, check these things:

What data sources does it actually access? A plugin that only queries GSC and Google Ads APIs gives you data you already have in those platforms. The value of third-party SEO tools is in data Google doesn't provide, like backlink profiles, competitor keyword gaps, and crawl analysis.
Does "open source" require a paid service? Check the dependencies. If the plugin routes requests through a third-party service, you're not self-hosting anything meaningful.
What authentication model does it use? Any tool connecting to Google APIs needs proper authentication. Google's Application Default Credentials documentation outlines how credential discovery works. Passing your Google credentials through an unknown third-party service is a security risk for accounts with ad spend or sensitive search data.

LLM plugins that wrap Google APIs can be useful for ad-hoc querying and reporting automation. Treating them as replacements for a full SEO tool stack overstates what API access alone provides.

Watch out for

Credential exposure through third-party proxies. If an LLM plugin requires routing your Google API authentication through an external service like adsagent.org, your GSC and Google Ads credentials may be exposed to that service. Review the authentication flow before connecting any tool to accounts with ad spend.

Conflating API access with competitive intelligence. GSC data covers your own site's performance. Google Ads API covers your own campaigns. Neither provides competitor backlink data, SERP feature tracking, or third-party crawl data. A plugin that queries these APIs cannot replace tools that maintain independent indexes.

Blocking CSS and JS in robots.txt breaks indexing, not saves

Sat, 18 Apr 2026 11:44:55 GMT

What happened

A practitioner posted in r/TechSEO asking whether blocking CSS and JS files in robots.txt would help with crawl budget. The poster reported that 79% of their crawl budget was going to page resource loads, mostly CSS and JS files. The community response was unanimous: do not block these resources.

The top-voted reply, from user mjmilian with 18 points, was blunt: "No, because Google, and other bots, won't be able to render your pages correctly." Another commenter flagged by the subreddit as someone who "knows how the renderer works" endorsed the warning.

Why it matters

Googlebot renders pages. It executes JavaScript and applies CSS to understand page content and layout. When you block those resources in robots.txt, Googlebot can't complete the rendering step. The result is that Google sees a broken or incomplete version of your page.

Google's robots.txt documentation confirms that crawlers download and parse the robots.txt file before crawling any part of a site. Disallow rules for CSS and JS paths stop Googlebot from fetching those files entirely, not just for the initial crawl but for every rendering attempt.

The original poster's concern about crawl budget is understandable. Seeing 79% of crawl activity spent on resources feels wasteful. But those resource fetches are a normal part of how Googlebot processes pages. Blocking them doesn't free up crawl budget for content pages. It degrades Googlebot's ability to understand any page that depends on those resources.

One commenter, mathayles, made a useful distinction about AI crawlers. Bots like ClaudeBot only ingest raw HTML and don't execute JavaScript at all. Blocking JS has no effect on those crawlers because they never attempt to run it. The rendering concern is specific to Googlebot and a few others like Apple's crawler.

A better approach than blocking resources is to examine why Googlebot needs to fetch them so frequently. High resource fetch rates can indicate issues like uncacheable assets, excessive file counts, or poor use of cache headers.

What to do

Don't block CSS or JS in robots.txt. If you're seeing high resource crawl rates, the fix is not to hide resources from Googlebot.

Check your resource caching. If Googlebot is re-fetching the same CSS and JS files repeatedly, your cache headers may be too short or missing. Set appropriate Cache-Control headers on static assets so Googlebot doesn't need to re-download them on every visit.

Audit your resource count. A page that loads dozens of CSS and JS files forces Googlebot to make many requests per page render. Consolidating or reducing the number of resource files can lower the crawl overhead per page.

Review the crawl stats report in Google Search Console. The report breaks down crawl requests by file type. Look at whether resource fetches are spread across many unique URLs or concentrated on a few files being re-fetched. The pattern tells you whether the problem is too many files or poor caching.

Consider if your content is JS-dependent. If critical content or links are rendered via JavaScript, blocking JS doesn't just hurt rendering. It makes that content invisible to Google entirely. Server-side rendering or static HTML for key content reduces your dependency on Googlebot's renderer.

Watch out for

Legacy robots.txt rules you forgot about. Some older CMS setups or security-focused configurations ship with blanket disallow rules for /wp-includes/, /assets/, or /static/ directories. These can silently block CSS and JS without anyone realizing. Audit your robots.txt for rules that match resource paths.

Confusing AI crawler behavior with Googlebot behavior. As one commenter noted, AI crawlers like ClaudeBot don't render pages at all. Rules that affect Googlebot's rendering have zero impact on those bots. Don't assume that what works for one crawler type applies to another.

Google drops no-JS testing advice from JavaScript SEO docs

Sat, 18 Apr 2026 08:31:12 GMT

What happened

Google updated its JavaScript SEO basics page on March 4, removing a section that advised developers to design pages for users who "may not be using a JavaScript-capable browser." The removed section, titled "Design for accessibility," had recommended testing sites with JavaScript turned off or viewing them in text-only browsers like Lynx.

Search Engine Journal reported that Google explained the change in its documentation changelog. Google's stated reason: "The information was out of date and not as helpful as it used to be. Google Search has been rendering JavaScript for multiple years now, so using JavaScript to load content is not 'making it harder for Google Search'."

Google also noted that most assistive technologies now work with JavaScript, which was another reason the old guidance no longer applied. As SE Roundtable covered, the update is part of a broader cleanup of the JS SEO documentation. It marks the fifth change to that page since December, and each revision has replaced broad cautions with more specific technical advice.

Why it matters

The original JavaScript SEO basics page was written when JS rendering was a known pain point for crawling and indexing. Google included several warnings about ensuring Googlebot could process JavaScript content. That era is over, at least for Google's own crawler.

Removing the "test with JS disabled" advice signals a real shift in how Google wants practitioners to think about JavaScript. The old mental model was defensive: assume the crawler can't handle JS. The new model assumes rendering works and focuses on specific failure modes instead.

That said, the removal does not mean JavaScript causes zero indexing issues. Google's documentation still notes that Googlebot runs a version of Chromium and that there are things to watch for. The rendering pipeline still involves a crawl, render, and index phase with potential delays between each step.

The practical gap is with other crawlers. Google's rendering improvements don't extend to Bing, social media scrapers, or SEO tools that may not execute JavaScript at all. Sites relying heavily on client-side rendering still face discoverability risks outside of Google Search.

What to do

Don't stop checking rendered output. Use the URL Inspection tool in Google Search Console to verify what Googlebot sees after rendering. The fact that Google renders JS well doesn't mean your specific implementation is rendering correctly.

Keep server-side rendering (SSR) or static rendering on the table for critical content. Google's rendering-on-the-web guide still recommends SSR or static rendering over full client-side rendering for performance reasons. SEO aside, SSR gives you faster First Contentful Paint and avoids render-dependent indexing delays.

Audit for non-Google crawlers. If your site serves audiences that depend on Bing, Yandex, or social sharing previews, test how those bots see your pages. Tools like curl or wget show what a non-rendering bot receives. If critical content is missing from the initial HTML response, those crawlers will miss it.

Review your JS SEO checklist if it still includes "disable JavaScript and check." That workflow was useful when Google's own docs recommended it. It's no longer part of Google's guidance, but the underlying principle of verifying crawlable content hasn't changed. Replace the JS-off test with URL Inspection and rendered DOM comparisons.

Watch out for

Assuming all bots render like Google. Google's changelog specifically says its own renderer handles JavaScript well. Other search engines and crawlers may not. If you strip SSR fallbacks based on this update, non-Google traffic could suffer.

Conflating rendering capability with rendering speed. Google can render JavaScript, but the render queue still introduces delays. Pages that depend entirely on client-side JS may face a gap between crawling and indexing. Time-sensitive content like news or product launches should not rely on client-side rendering alone.

Yotpo injects duplicate FAQPage schema on Shopify pages

Fri, 17 Apr 2026 21:18:48 GMT

What happened

Yotpo's Shopify app is injecting FAQPage structured data into product pages without store owners' knowledge, causing duplicate schema and validation errors. A practitioner reported in r/TechSEO that their product pages were throwing "Main Entity Missing" errors, with the error field pointing to dynamically generated IDs from yotpo.com.

The user tried disabling the Yotpo addon and changing its settings, but neither resolved the issue. The FAQPage JSON-LD blocks persisted in the page source.

Why it matters

Duplicate or malformed FAQPage schema creates real problems in Google's Rich Results validation. When two FAQPage blocks exist on a single product page, Google may flag both as invalid or simply ignore them. Either outcome means lost eligibility for FAQ rich results, which are now limited to well-known government and health sites.

The deeper issue is that third-party Shopify apps can inject structured data into your pages without explicit consent. Store owners who have carefully implemented their own FAQ schema may not realize Yotpo is adding a second, conflicting block. Duplicate or conflicting structured data blocks on a single page can cause validation errors and make it harder for Google to determine which markup to use for rich results.

Shopify merchants running Yotpo are the most directly affected. Any store with its own FAQ schema on product pages is at risk of duplicate conflicts. Even stores without custom FAQ markup could see validation warnings from Yotpo's injected blocks if the required mainEntity property is missing.

What to do

Check your product pages first. Open a product page in Chrome DevTools, inspect the rendered DOM, and search for FAQPage in the JSON-LD blocks. Viewing raw page source will not show JavaScript-injected schema. If you see a block with a URL referencing yotpo.com, the app is injecting schema you didn't add.

Run those pages through Google's Rich Results Test to confirm whether validation errors are present.

The r/TechSEO poster shared a JavaScript fix that strips the Yotpo-injected FAQPage blocks from the DOM. Add the following to your theme.liquid file before the closing </body> tag:

<script>
(function() {
  function removeYotpoFAQ() {
    document.querySelectorAll('script[type="application/ld+json"]').forEach(function(el) {
      if (el.innerHTML.includes('yotpo.com/go') && el.innerHTML.includes('FAQPage')) {
        el.remove();
      }
    });
  }
  document.addEventListener("DOMContentLoaded", removeYotpoFAQ);
  setTimeout(removeYotpoFAQ, 1000);
  setTimeout(removeYotpoFAQ, 2000);
})();
</script>

The script runs at DOMContentLoaded and again at 1-second and 2-second delays. The repeated execution accounts for Yotpo potentially injecting its schema after initial page load.

This is a workaround for browsers, not a reliable fix for search crawlers. Googlebot's rendering pipeline separates crawling and rendering, and there is no guarantee the setTimeout callbacks will execute before Googlebot captures the DOM. Contact Yotpo support to disable the FAQPage injection at the source for a permanent fix.

Watch out for

Googlebot may still see the duplicates. Google renders JavaScript, but timing matters. If Googlebot captures the DOM before the setTimeout callbacks fire, the duplicate FAQPage blocks will still be indexed. Server-side removal or disabling the injection through Yotpo's settings would be more reliable.

Other Yotpo schema injections. If Yotpo is injecting FAQPage markup without explicit configuration, check whether it's also adding Review or Product schema that conflicts with your own structured data. Search your page source for all JSON-LD blocks containing yotpo.com references.

Wildcard DNS lets Googlebot index phantom subdomains as real pages

Fri, 17 Apr 2026 21:14:48 GMT

What happened

A discussion in r/bigseo raised questions about how Google handles 301 redirects across a network of dating websites using wildcard DNS. The practitioner described a setup where wildcard DNS records resolve all subdomains to the same server, causing Googlebot to discover and index subdomains that were never intentionally created.

Wildcard DNS means any subdomain typed or linked to will return a valid HTTP response. If the server doesn't distinguish between real and phantom subdomains, Googlebot treats them all as live pages. The result is indexed URLs that nobody built, pointing to duplicate or near-duplicate content.

Why it matters

Wildcard DNS is common in multi-brand or multi-location setups where operators spin up subdomains programmatically. Dating networks, regional directories, and white-label SaaS platforms often use this pattern. The problem surfaces when third parties link to arbitrary subdomains, or when Googlebot constructs URLs from patterns it finds elsewhere.

Once phantom subdomains get indexed, they compete with real pages for crawl budget. They also create duplicate content signals across what Google sees as separate hosts. Google's canonicalization process will try to pick a preferred version, but it may not pick the one you want.

Google's documentation on consolidating duplicate URLs lists three methods ranked by signal strength:

Redirects (301/302): The strongest canonicalization signal. The redirect target is treated as the canonical URL.
rel="canonical" annotations: A strong signal pointing Google to the preferred URL.
Sitemap inclusion: A weaker signal that hints which URLs should be canonical.

Google recommends combining these methods. But with wildcard DNS, the phantom subdomains may not have any canonical signals at all, leaving Google to pick on its own. Per the HTTP/1.1 specification (RFC 7231), a 301 status code indicates a permanent move. Google treats 301s as a strong signal that the target URL should be treated as canonical.

The practical risk is twofold. Crawl budget gets spent on URLs that add no value. And if Google picks a phantom subdomain as canonical over a real page, the real page can lose rankings.

What to do

Audit your DNS configuration. Check whether your domain uses wildcard DNS records (an asterisk record like *.example.com). If you don't need wildcard resolution, remove it. Explicit subdomain records are safer.

Return 404 or 410 for unknown subdomains. If you need wildcard DNS for legitimate reasons, configure your server to check incoming Host headers against a list of valid subdomains. Return a 404 for anything not on the list.

Set up 301 redirects for known phantoms. If phantom subdomains are already indexed, redirect them to the correct canonical URL. A 301 is the strongest signal you can send.

# Apache example: redirect unknown subdomains to main domain
RewriteEngine On
RewriteCond %{HTTP_HOST} !^(www|app|blog)\.example\.com$ [NC]
RewriteCond %{HTTP_HOST} \.example\.com$ [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [R=301,L]

Add rel="canonical" as a fallback. On any subdomain that must remain live, add a <link rel="canonical"> tag pointing to the preferred URL. Combine with sitemap inclusion for the strongest signal set.

Check Google Search Console. Use a Domain property (verified via DNS) to see data across all subdomains. Look at the "Pages" report filtered by "Alternate page with proper canonical tag" and "Duplicate without user-selected canonical." These statuses can reveal phantom subdomains Google has already found. Standard URL-prefix properties are scoped to a single subdomain and won't surface phantom subdomains.

Watch out for

Wildcard SSL certificates masking the problem. If you have a wildcard TLS certificate (*.example.com), phantom subdomains will serve over HTTPS without errors. Googlebot won't hit any certificate warnings, so nothing flags these URLs as suspicious during crawling.

Redirect loops between phantom subdomains. If your redirect logic is based on pattern matching rather than an explicit allowlist, two phantom subdomains can redirect to each other. Test redirects with curl -I before deploying broadly.

Cross-format structured data conflicts pass validation undetected

Fri, 17 Apr 2026 21:07:17 GMT

What happened

A post in r/TechSEO shared a tool from 8gwifi.org that validates JSON-LD, Microdata, RDFa, Open Graph, and Twitter Cards on a single page. The post drew attention to a practical gap: pages that use multiple structured data formats simultaneously can produce conflicting signals, and the standard validation tools don't flag those conflicts.

Google's own documentation recommends two tools for structured data testing. The Rich Results Test checks which Google rich results a page can generate. The Schema Markup Validator, which replaced the old Structured Data Testing Tool, validates schema.org markup without Google-specific warnings. Neither tool cross-checks one format against another.

Schema.org confirms that its vocabulary supports three encodings: RDFa, Microdata, and JSON-LD. Over 45 million web domains use schema.org markup. The standard doesn't prescribe using only one format per page, which means mixing formats is technically valid even when the resulting data conflicts.

Why it matters

Many sites end up with multiple structured data formats by accident. A CMS might inject Microdata into product templates while a plugin adds JSON-LD for the same entity. A theme could embed RDFa attributes on article pages alongside a separately managed JSON-LD block. Open Graph and Twitter Card meta tags add yet another layer of metadata that can contradict the schema.org markup.

The problem is that each format validates independently. You can run your page through the Rich Results Test and the Schema Markup Validator, get clean results on both, and still have conflicting information across formats. A JSON-LD Product block might declare a price of $49.99 while Microdata in the HTML shows $59.99. Both pass validation. Google has to pick one, and you don't control which.

Cross-format conflicts are especially common on sites that have gone through CMS migrations, added third-party review widgets, or layered SEO plugins over existing theme markup. The larger the site, the more likely these silent conflicts exist.

What to do

Check whether your pages output more than one structured data format. View source on key templates and search for application/ld+json script blocks (JSON-LD), itemscope/itemprop attributes (Microdata), and typeof/property attributes (RDFa). If you find more than one format describing the same entity, you have a potential conflict.

Pick one format and stick with it. Google supports all three schema.org encodings, but JSON-LD is the format Google recommends in its documentation and is the easiest to manage because it lives in a standalone script block rather than being woven into HTML attributes.

Audit Open Graph and Twitter Card tags against your schema.org data. Product names, descriptions, images, and prices should match across all metadata formats. Mismatches between og:title and your JSON-LD name property won't break anything, but they create ambiguity about what the page actually represents.

For large sites, automate the check. Crawl your templates with a tool that extracts all structured data formats per URL. Compare the entities and properties across formats programmatically. Flag any URL where the same property (price, name, rating) appears in multiple formats with different values.

Remove redundant Microdata or RDFa if you already have complete JSON-LD coverage. Leftover markup from old themes or plugins is the most common source of these conflicts.

Watch out for

Plugin stacking. WordPress sites with multiple SEO or schema plugins can each inject their own JSON-LD blocks. Two JSON-LD blocks for the same entity type on one page is just as problematic as cross-format conflicts. Check for duplicate @type declarations within the same format.

Review widget markup. Third-party review platforms often embed their own Microdata or JSON-LD for aggregate ratings. If your site also generates review schema through a plugin or CMS feature, you can end up with two competing AggregateRating objects with different values.

Indexing API bypasses 'Discovered - currently not indexed' queue

Fri, 17 Apr 2026 19:55:40 GMT

What happened

A practitioner on r/TechSEO shared results from a 5,000-page split test comparing the Google Indexing API against standard sitemap submission for bypassing the "Discovered - currently not indexed" queue.

The user, alexcobasb, split a new programmatic cluster into two equal groups of 2,500 URLs. The control group was submitted via a standard XML sitemap. The test group was pushed through the Indexing API V3 using a GCP service account. Over seven days, the control group reached 8.4% indexation with slow crawling. The test group hit 94% indexation, with most URLs crawled and indexed within 48 hours.

The test specifically targeted standard content pages, not job postings or livestream videos.

Why it matters

That distinction matters because Google's Indexing API documentation explicitly states the API "can only be used to crawl pages with JobPosting or BroadcastEvent embedded in a VideoObject." The API was designed for short-lived content types like job listings and live video events.

The practitioner's results suggest Google is currently processing Indexing API pings for non-qualifying page types and crawling them anyway. The 94% vs. 8.4% gap is dramatic. For sites stuck in the "Discovered - currently not indexed" limbo, especially large programmatic clusters or post-migration pages, the difference is significant.

However, this is an unsanctioned use of the API. Google could enforce the documented restrictions at any time, rejecting pings for pages that lack JobPosting or BroadcastEvent markup. Practitioners who build workflows around this behavior should treat it as a temporary advantage, not a reliable long-term strategy.

Sites running large-scale programmatic SEO or recovering from migrations are the most likely beneficiaries. Small sites with a few dozen pages stuck in the queue have less to gain since manual URL inspection in Search Console often handles those cases.

What to do

If you want to test this approach, here is the setup the practitioner described:

Create a GCP project and enable the API. Go to Google Cloud Console, create a new project, search for "Web Search Indexing API," and enable it.
Create a service account. Under IAM & Admin > Service Accounts, create a new account. Copy the generated email address (formatted as name@project.iam.gserviceaccount.com). Generate a JSON key file via Manage Keys > Add Key > Create New Key (JSON). Google's authentication docs cover service account setup in more detail.
Add the service account to Search Console. In GSC, go to Settings > Users and permissions. Add the service account email as a user. The practitioner stressed that the permission level must be set to Owner, not "Full."
Batch-submit URLs. The Indexing API supports up to 100 URLs per batch request. For a 5,000-page cluster, that means 50 batch calls.

Before scaling up, start with a small test group of 50–100 URLs. Compare indexation rates against a control group submitted only via sitemap. Monitor GSC's crawl stats and index coverage reports to verify the API pings are triggering crawls.

Keep JobPosting or BroadcastEvent markup considerations in mind. The practitioner's test succeeded without qualifying markup, but Google's documentation says it is required. A future enforcement change could invalidate this approach overnight.

Watch out for

Silent failures with wrong permissions. The practitioner flagged that setting the service account to "Full" permission in GSC instead of "Owner" causes the API to silently reject requests. You will not get an error. The URLs simply will not be crawled.

API quota limits. Google enforces daily quota limits on Indexing API calls. The default quota may not cover large-scale submissions. Check your GCP project's quota dashboard before batching thousands of URLs.

Policy risk. Google's documentation restricts the API to JobPosting and BroadcastEvent pages. Using it for other content types works today based on one practitioner's test, but Google could start enforcing the restriction or flag accounts that abuse it.

Pages ranking in Google can be invisible to AI search

Fri, 17 Apr 2026 17:00:03 GMT

What happened

iPullRank published an AI search audit framework arguing that pages performing well in traditional Google Search can be completely invisible to AI-powered search engines. The framework addresses a growing gap between conventional SEO visibility and discoverability across AI search surfaces like ChatGPT, Perplexity, and Google's AI Overviews.

The core premise is straightforward: the signals that help a page rank in Google's organic results are not the same signals that AI search systems use to retrieve, parse, and cite content. A page can sit comfortably on page one of Google while being effectively absent from AI-generated answers.

Why it matters

Traditional Google Search follows a well-documented crawl-index-serve pipeline. Googlebot crawls pages, renders JavaScript, indexes the content, and ranks it against queries. AI search engines operate differently. Many rely on their own crawlers or third-party data pipelines that may not render JavaScript at all, may respect different directives in robots.txt, or may extract content in ways that miss key information.

The practical gap shows up in three areas:

Rendering: Pages built with client-side JavaScript frameworks may render fine for Googlebot, which has a dedicated rendering pipeline. AI search crawlers often do not execute JavaScript. Content locked behind CSR (client-side rendering) may never reach these systems.
Crawl access: AI crawlers use different user agents than Googlebot. A robots.txt file that allows Googlebot but blocks or doesn't account for crawlers like GPTBot, PerplexityBot, or Anthropic's ClaudeBot will prevent AI search engines from accessing the content.
Structured data: Schema.org markup helps Google generate rich results, but AI search engines may weight structured data differently or use it as a primary extraction method rather than a supplementary signal. Pages without clear structured data may be harder for AI systems to parse and cite accurately.

Sites that invested heavily in JavaScript-rendered content or that haven't updated their robots.txt since AI crawlers emerged are most at risk. E-commerce product pages, SaaS documentation, and content-heavy publishers with complex front-end architectures should pay particular attention.

What to do

Audit your robots.txt for AI crawlers. Check whether your robots.txt explicitly allows or blocks known AI user agents. The main ones to account for include OAI-SearchBot (ChatGPT Search), GPTBot (OpenAI training), Google-Extended (Gemini training data), PerplexityBot, Amazonbot, ClaudeBot, and Bytespider. If you want AI search visibility, the retrieval crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot) need access to your content. Training crawlers like GPTBot and Google-Extended affect whether your content shapes model knowledge but don't directly control whether you appear in AI search results.

Test rendering without JavaScript. Disable JavaScript in your browser and check whether your key content is visible. If critical text, product details, or article bodies disappear, AI crawlers that don't execute JS will see an empty or partial page. Server-side rendering or static rendering solves this problem at the architecture level.

Review your structured data coverage. Ensure pages have accurate, complete Schema.org markup that describes the content type, author, dates, and key entities. AI systems that extract structured data as a primary signal will miss pages that rely solely on unstructured HTML.

Check AI search surfaces directly. Query your target keywords in ChatGPT Search, Perplexity, and Google AI Overviews. Note which competitors appear and whether your content is cited. There is no equivalent of Google Search Console for most AI search engines yet, so manual checks remain necessary.

Prioritize by page value. Start with your highest-traffic and highest-converting pages. If those pages are invisible to AI search, the revenue impact grows as user behavior shifts toward AI-assisted search.

Watch out for

Blocking AI crawlers unintentionally. Some CDNs and bot management tools classify AI crawlers as scrapers and block them at the edge, before they ever reach your server. Check your CDN and WAF logs to confirm AI user agents are getting 200 responses, not 403s.

Assuming Google visibility equals AI visibility. A page ranking #1 in Google may not appear in any AI search result. These are separate systems with separate retrieval methods. Treat AI search audits as a distinct workstream from traditional SEO audits.

Merchant Center feeds now power organic and AI surfaces

Fri, 17 Apr 2026 16:50:20 GMT

What happened

Google is reframing Merchant Center product feeds as the foundation for product discovery across its entire platform, not just Shopping ads. Search Engine Journal reported on the shift, citing a recent Google Ads Decoded podcast episode where Nadja Bissinger, General Product Manager of Retail on YouTube, described Merchant Center feeds as the "backbone that powers organic and ads experiences."

Bissinger urged merchants to submit the most detailed product data possible to increase discoverability. The podcast discussed product data in connection with free listings, AI-powered search experiences, YouTube formats, Google Lens, virtual try-on, and newer e-commerce surfaces still in development.

Google shared supporting numbers. People shop across Google more than 1 billion times per day, according to a 2025 retail insights piece cited in the SEJ article. Google Lens now processes more than 20 billion visual searches per month, and 1 in 4 of those searches carry commercial intent.

Google's own Retail landing page reflects the same positioning. Merchant Center is presented as a single entry point for surfacing products across Search, Maps, YouTube, Google Shopping, and Google Images.

Why it matters

For years, product feed work lived inside paid media teams. If you ran Shopping ads, your feed got attention. If you didn't, it was an afterthought. Google is now signaling that feed data influences where and how products appear in organic results, AI Overviews, YouTube, Lens, and Maps.

The practical implication is that feed quality now affects SEO outcomes. Structured product data from Merchant Center can determine whether a product shows up in free listings, visual search results, or AI-generated shopping experiences. Retailers who treat feed management as a PPC-only task are leaving organic visibility on the table.

The shift also matters for how teams are organized. SEJ's coverage notes that larger organizations may need closer coordination between paid media, SEO, e-commerce, merchandising, and product teams. Feed attributes like images, ratings, promotions, availability, and shipping details all feed into how Google matches products to user intent across surfaces.

Google has financial motivation here too. More structured product data means more surfaces where Google can insert commerce experiences, paid or free. The company's recent earnings reports show continued growth in Search and YouTube ad revenue, and richer product feeds feed directly into that growth.

The connection to schema.org Product markup is worth noting for SEOs. Properties like aggregateRating, offers, gtin13, color, and additionalProperty mirror many of the attributes Google pulls from Merchant Center feeds. Sites already using detailed Product structured data on their pages are partially aligned with what Google wants from feeds, but the Merchant Center feed adds availability, shipping, and promotional data that on-page markup typically doesn't cover.

What to do

Audit your Merchant Center feed for completeness. Google is pulling feed data into more surfaces, so gaps in attributes like product images, ratings, availability, shipping details, and promotional pricing reduce your chances of appearing. Fill in every relevant attribute, not just the required ones.

Coordinate feed work across teams. If your SEO team doesn't have visibility into what's in the Merchant Center feed, fix that. Feed attributes now affect organic surfaces and AI experiences, not just paid Shopping placements. SEO practitioners should review feed data the same way they review on-page structured data.

Align your on-page Product schema with your feed. Inconsistencies between your schema.org Product markup and your Merchant Center feed data can create mixed signals. Prices, availability, GTINs, and product titles should match across both sources.

Submit feeds even if you don't run Shopping ads. Google's free listings program uses Merchant Center data to surface products organically. If you sell products and haven't set up a Merchant Center account, you're missing a discovery channel that costs nothing to enter.

Prioritize image quality. With 20 billion monthly Lens searches and 25% commercial intent, visual search is a growing product discovery path. Feed images should be high-resolution, show the product clearly, and follow Google's image requirements.

Watch out for

Feed-only visibility without landing page support. Google may surface your product through a feed-driven experience where the user never sees your product page. Make sure your feed data alone tells a complete story, with accurate pricing, descriptions, and imagery, because that data may be the only thing a shopper sees before clicking.

Stale promotional data. If your feed includes promotions or sale pricing that has expired, Google may suppress listings or show inaccurate information across multiple surfaces. Automate feed updates or set calendar reminders to remove expired promotions promptly.

Cloudflare now enforces canonical tags as 301s for AI crawlers

Fri, 17 Apr 2026 16:46:39 GMT

What happened

Cloudflare launched Redirects for AI Training, a feature that converts <link rel="canonical"> tags into HTTP 301 redirects for verified AI training crawlers. The feature is available on all paid Cloudflare plans with a single toggle.

When a request arrives from a verified AI training crawler (GPTBot, ClaudeBot, Bytespider, and others in Cloudflare's AI Crawler category), Cloudflare reads the response HTML. If it finds a non-self-referencing canonical tag, it issues a 301 Moved Permanently to the canonical URL instead of serving the page content.

Human visitors, search engine crawlers, AI assistants, and AI search agents are unaffected. The feature only targets bots in Cloudflare's verified AI Crawler category, which is distinct from AI Assistant and AI Search categories.

Cloudflare reported the motivation from its own data: AI crawlers visited developers.cloudflare.com 4.8 million times in 30 days, consuming deprecated documentation at the same rate as current content. Deprecation banners, noindex meta tags, and canonical tags made "no measurable difference" to crawl behavior.

Why it matters

Canonical tags are advisory. RFC 6596 defines the canonical link relation as a way to designate a preferred URL, but nothing forces a crawler to follow it. Search engines treat canonicals as hints and sometimes ignore them. AI training crawlers, according to Cloudflare's data, ignore them almost entirely.

The downstream problem is compounding. AI agents draw on trained models, so when crawlers ingest deprecated docs, agents inherit outdated information. Blocking crawlers entirely produces a void with no signal about where current content lives. Canonical-as-redirect gives crawlers a machine-readable instruction they can't misinterpret.

Cloudflare says canonical tags already exist on 65–69% of web pages, generated automatically by platforms like WordPress and Contentful. For sites on Cloudflare's paid plans, the feature requires no code changes. Existing canonical markup becomes enforceable infrastructure.

The feature also addresses a scaling problem. Single redirect rules can handle a handful of deprecated paths, but every new deprecation requires a rule update. Canonical-based redirects scale automatically because the source of truth is already in the HTML.

What to do

If you're on a paid Cloudflare plan, enable Redirects for AI Training in the dashboard. Before you do, audit your canonical tags. The feature turns every non-self-referencing canonical into a hard 301 for AI crawlers, so incorrect canonicals will redirect crawlers to the wrong page.

Audit cross-origin canonicals separately. Cloudflare excludes cross-origin canonicals by design (tags pointing to a different domain). If you rely on cross-domain canonicals for content syndication, this feature won't affect those.

Check for redirect loops. Self-referencing canonicals are excluded, but chains are possible. If Page A canonicalizes to Page B, and Page B canonicalizes to Page C, an AI crawler hitting Page A will get a 301 to Page B, then another 301 to Page C. Review your canonical chains the same way you'd review redirect chains.

If you're not on Cloudflare, the feature doesn't help you directly, but the concept is worth understanding. You can approximate the behavior with server-side logic that checks user-agent strings against known AI crawler lists and returns 301s based on canonical tag values. The maintenance burden is higher without Cloudflare's verified bot detection.

Check Cloudflare Radar. Cloudflare added response status code analysis to Radar's AI Insights page, showing how the web responds to AI crawlers across all Cloudflare traffic. The breakdown covers 2xx, 3xx, 4xx, and 5xx response classes.

Watch out for

Stale or wrong canonicals becoming hard redirects. If a canonical tag points to a URL that no longer exists or was set incorrectly during a migration, AI crawlers will now get a 301 to a broken or irrelevant page. On a normal site, a bad canonical is a quiet problem. With this feature enabled, it becomes an active misdirection. Crawl your site and validate canonical targets before turning the feature on.

No retroactive fix for already-ingested content. Cloudflare is explicit that this feature does not correct training data that AI models have already consumed. It only affects future crawls. If deprecated content is already baked into a model, the 301 won't undo that. The benefit accrues over time as models retrain on fresher crawls.

Mueller lists nine reasons Google overrides your rel=canonical

Tue, 14 Apr 2026 11:16:37 GMT

What happened

Google's John Mueller shared nine distinct reasons why Google chooses one URL as canonical over another, even when site owners have set rel=canonical. The explanation came in a Reddit thread covered by Search Engine Journal, where a user asked Mueller to explain why Google sometimes picks the "wrong" canonical when two pages cover different topics.

Mueller prefaced the list by noting that no tool exists to tell you why something was considered duplicate. He said practitioners "often get a feel for it" over the years, but acknowledged the reasons aren't always obvious.

The nine scenarios he listed:

Exact duplicate content. The pages are fully identical, leaving no signal to distinguish one URL from another.
Substantial duplication in main content. A large portion of the primary content overlaps, such as the same article appearing in multiple places.
Too little unique content relative to template content. The page's unique content is minimal, so repeated elements like navigation and menus dominate. The pages end up looking effectively the same.
URL parameter patterns inferred as duplicates. When parameterized URLs like /page?tmp=1234 and /page?tmp=3458 return the same content, Google may generalize the pattern. Mueller noted this gets tricky with multiple parameters, asking rhetorically whether /page?tmp=1234&city=detroit would also be treated the same.
Mobile version used for comparison. Google evaluates the mobile version, not the desktop version. People who manually check on desktop may see different content than what Google is comparing.
Googlebot-visible version used for evaluation. Canonical decisions are based on what Googlebot actually receives, not what users see in a browser.
Serving Googlebot alternate or non-content pages. Bot challenges, pseudo-error pages, or other generic responses shown to Googlebot may match previously seen content and trigger duplicate treatment.
Failure to render JavaScript content. When Google can't render the page, it falls back to the base HTML shell. If that shell is identical across pages, duplication gets triggered.
Ambiguity or misclassification in the system. A URL may be treated as duplicate because it appears "misplaced" or because of limitations in how Google's system interprets similarity.

Why it matters

The rel=canonical attribute is one of the strongest signals site owners can send to Google about their preferred URL. But as Google's own documentation makes clear, it's still a hint, not a directive. Google's docs list redirects as a strong signal, rel=canonical as a strong signal, and sitemap inclusion as a weak signal, but none of them are required or guaranteed to work.

Mueller's list explains the gap between what SEOs expect and what Google does. Several of the nine reasons point to problems that are invisible from a desktop browser. Mobile rendering differences, Googlebot-specific responses, and JavaScript rendering failures all produce a version of the page that the site owner never sees during manual review.

The URL parameter inference scenario is particularly relevant for e-commerce and large sites with faceted navigation. Google may correctly identify that most parameter variations are duplicates, then incorrectly apply that pattern to a parameter combination that actually produces unique content.

The "too little unique content relative to template" scenario catches thin pages on sites with heavy global navigation. A short blog post surrounded by a large shared header, footer, and sidebar may look nearly identical to another short post when Google compares them.

What to do

Check your canonical overrides in Google Search Console under the "Pages" report. Filter for "Duplicate, Google chose different canonical than user." That report shows you where Google is actively disagreeing with your rel=canonical.

For each flagged URL, work through Mueller's list as a diagnostic checklist:

Fetch as Googlebot. Use the URL Inspection tool to see what Google actually receives. Compare that to what you see in a browser. Look for bot-detection pages, interstitials, or empty content areas.
Check the mobile version. Google uses mobile-first indexing. If your mobile page strips content that exists on desktop, Google may see two pages as more similar than they actually are.
Test JavaScript rendering. View source on the raw HTML Google receives before rendering. If your unique content loads via client-side JavaScript and rendering fails, Google sees only the template shell. The URL Inspection tool's "View Rendered Page" screenshot can confirm whether content appeared.
Audit parameter URLs. If you use query parameters, check whether Google is collapsing parameter variations that should remain distinct. Look at indexed URLs in Search Console to see which parameter combinations Google has kept.
Measure content-to-template ratio. Pages with very little unique text surrounded by large shared templates are at risk. Adding more substantive unique content is the direct fix.

Stacking multiple canonicalization signals helps. Google's documentation notes that combining methods like redirects, rel=canonical, and sitemap inclusion increases the chance your preferred canonical is respected.

Watch out for

Parameter inference spreading too far. Google may correctly learn that one type of parameter creates duplicates, then apply that pattern to a different parameter on the same domain. Sites with mixed parameter types (some cosmetic, some content-changing) are most exposed.

Bot-detection tools creating accidental duplicates. If your security or bot-management layer serves a challenge page or generic response to Googlebot, every affected URL looks identical. The result is mass duplicate classification that has nothing to do with your actual content.

Redirect chains that only appear after domain switches go live

Tue, 14 Apr 2026 03:08:35 GMT

What happened

A practitioner post in r/TechSEO details a migration case study focused on redirect chains that only surface after a domain switch goes live. The core problem: global rules that swap the old domain for the new one can silently create chains when old URLs already carry their own redirect history or return non-200 status codes.

The post walks through a full redirect-testing methodology for staging environments, covering URL mapping, chain prevention, and a specific trap with domain-switch catch-all rules.

Why it matters

Redirect chains during migrations are nothing new. What makes the domain-switch scenario tricky is that the chains don't exist on either domain independently. They only appear when global domain-replacement rules interact with existing per-URL redirects.

Consider this sequence: old.com/page-a already 301s to old.com/page-b. You add a catch-all rule that rewrites old.com/* to new.com/*. Now a request to old.com/page-a hits the catch-all first, redirects to new.com/page-a, which then redirects again to new.com/page-b (or wherever your 1:1 mapping sends it). That's a two-hop chain at minimum. If page-b also had a redirect, you get three hops.

RFC 7231 Section 6.4 defines redirect status codes but sets no hard limit on chain length. In practice, browsers cap redirect follows, and Googlebot follows a limited number of hops before abandoning the chain. Each extra hop also adds latency and dilutes link equity transfer during the critical post-migration window.

The timing matters too. Old domains accumulate URLs that return 4xx and 5xx errors. A blanket domain-replacement rule turns those dead URLs into live redirects pointing at new-domain paths that may not exist, generating soft 404s or unexpected chains on the new domain.

What to do

Map high-value URLs first. The post recommends pulling URLs with clicks and impressions from GSC, organic landing pages from GA4, ranked URLs from your rank tracking tool, and backlinked URLs from GSC plus external link databases. Keep these in a separate list so you can verify each one individually.

Use absolute URLs as redirect targets. Every target should include protocol, full domain, and path. Confirm each target returns a 200.

Order your rules carefully. The post recommends applying 1:1 redirect rules before global rules (http to https, non-www to www). When 1:1 rules fire first, the request reaches its final destination in one hop. Global rules then only handle URLs not covered by specific mappings.

Update legacy redirects before launch. If any existing redirect target on the old domain is itself a URL that needs redirecting, replace it. Point every link in the chain directly to the final destination. A migration is the right time to flatten chains that have accumulated over years.

Don't use a blanket domain-replacement rule as a catch-all. The post flags this as the biggest trap. Old domains carry dead URLs that return 4xx and 5xx status codes. A global find-and-replace rule converts those into redirects to new-domain paths that may not exist. Instead, let non-mapped old URLs return their original status codes, or redirect them to a relevant category page on the new domain.

Test on staging before the DNS switch. Crawl your staging environment with your redirect rules applied. Check for chains, loops, and targets that don't return 200s. Tools like Screaming Frog can follow redirect chains and flag anything over one hop.

Watch out for

Rule ordering varies by server. Apache processes rules top-down within .htaccess, but Nginx evaluates location blocks differently. Cloudflare Page Rules and redirect rules have their own priority logic. The "1:1 before global" advice only works if your server actually respects that order. Test with curl -v against a handful of URLs to confirm.

Trailing slash and protocol variants. Each variant of a URL (http vs https, www vs non-www, trailing slash vs none) needs to resolve in a single hop. If your global normalization rules fire before your 1:1 rules, you can get an extra hop. Map all variants in your crawl test.

Google dropped mobile breadcrumbs from SERPs, not their SEO value

Tue, 14 Apr 2026 00:00:00 GMT

What happened

Google stopped displaying breadcrumbs in mobile search results in January 2025. The change applies across all languages and regions where Google Search is available. Desktop search results still show breadcrumbs.

A guide from Sitebulb explains that this visual removal does not affect the SEO value of breadcrumb markup. The BreadcrumbList schema type still functions as documented, and Google's structured data documentation has not deprecated breadcrumb support.

Google's original announcement stated: "Starting today, we're rolling out a change to no longer show breadcrumbs on mobile search results in all languages and regions where Google Search is available (they continue to appear on desktop search results)."

Why it matters

The mobile SERP change prompted some SEOs to question whether breadcrumb implementation was still worthwhile. Sitebulb's guide argues that if you were only implementing breadcrumbs for the mobile SERP display, you were missing the bigger picture.

Breadcrumbs serve three purposes that have nothing to do with how they appear in search results:

Internal linking structure. Each breadcrumb level creates a clickable link back through your site hierarchy. These links pass authority up to category and section pages.
Crawlability signals. Breadcrumb trails give crawlers a clear map of how pages relate to each other within your site's hierarchy. The trail reinforces parent-child relationships between URLs.
User navigation. Breadcrumbs let visitors move up your site structure without relying on the back button or returning to the homepage.

The BreadcrumbList schema specification on Schema.org defines breadcrumbs as "a chain of linked Web pages, typically described using at least their URL and their name, and typically ending with the current page." The position property reconstructs the order of items, with integers starting at 1 for the first (topmost) item.

What to do

Keep your breadcrumb markup in place. Desktop SERPs still display breadcrumbs, and the structured data continues to help Google understand your site hierarchy. Removing it would mean losing both the desktop display and the crawlability benefits.

Audit your breadcrumb hierarchy for accuracy. Sitebulb's guide highlights a common mistake: breadcrumb trails that don't match the actual URL structure. If your breadcrumb shows Home > Start an LLC > California LLC but your URL is /random-state-123, you're sending conflicting signals about your site structure.

Pick the right breadcrumb type for your site. Hierarchy-based breadcrumbs work well for content sites with clear parent-child page relationships. Ecommerce sites with filtering may need attribute-based breadcrumbs that reflect how users navigate through product filters. Using the wrong type creates confusion rather than clarity.

Validate your BreadcrumbList markup. Confirm that the position values are integers in ascending order and that every itemListElement includes both a URL and a name. Malformed breadcrumb data won't help Google or your users.

Watch out for

Breadcrumbs that create orphan paths. If a breadcrumb trail links to a category page that doesn't exist or returns a 404, you've broken the navigation chain. Every page in the breadcrumb trail needs to be a real, crawlable URL.

Mismatched hierarchy and URL structure. When your breadcrumb path suggests one site structure but your URLs suggest another, crawlers get mixed signals. The breadcrumb hierarchy should reflect how your site is actually organized, not an aspirational information architecture you haven't built yet.

Battling Next.js SEO Issues on a Government Jobs Aggregator

Sat, 28 Mar 2026 13:19:49 GMT

import ToolSection from '../../components/ToolSection.astro'

Next.js is the default choice for React-based web applications, and Vercel makes deployment effortless. But "effortless" hides a minefield of SEO pitfalls, ones that only surface at scale, across different page types, and under the unforgiving lens of Googlebot's rendering pipeline. (If you are weighing framework options, see our Astro vs Next.js comparison for SEOs for a side-by-side breakdown.)

This case study follows a government jobs aggregator we will call GovJobsHub, a Next.js App Router site on Vercel with roughly 20,000 pages. The site aggregates federal, state, and local government job listings into programmatically generated pages organized by location, category, and agency. It is the kind of site where technical SEO determines whether tens of thousands of pages get indexed or disappear into a crawl budget black hole.

We will walk through every pitfall we encountered, explain why each one happens in the Next.js + Vercel stack, and show how to configure the stack properly from the start.

Understanding the Page Types

Before diagnosing problems, you need to understand that different page types on a Next.js site can have completely different rendering behaviors, crawl characteristics, and SEO requirements. This was our most important lesson: never assume one working page type means they all work.

GovJobsHub has six distinct page types:

Job Detail Pages (`/jobs/[id]`)

Individual job listings. Each has a title, description, salary range, location, agency, and application deadline. These are the most valuable pages for Google Jobs rich results. Count: ~15,000 pages, constantly churning as listings expire and new ones appear.

SEO requirements: JobPosting structured data, proper 410 status on expiry, fresh content signals, unique meta descriptions.

Location Pages (`/jobs/[state]`, `/jobs/[state]/[city]`)

Aggregation pages listing jobs by geography. State pages show all jobs in that state. City pages narrow further. Count: ~2,500 pages (50 states + major cities).

SEO requirements: Unique content beyond the job list itself, BreadcrumbList schema, proper pagination, no thin-content signals.

Category Pages (`/jobs/category/[slug]`)

Jobs grouped by field: IT, healthcare, law enforcement, administration. Count: ~200 pages.

SEO requirements: Similar to location pages. Category descriptions must not be boilerplate.

Agency Pages (`/jobs/federal/[agency]`)

Federal jobs grouped by agency: VA, DOD, USPS, etc. Count: ~150 pages.

SEO requirements: Agency-specific context, not just a filtered job list.

Hub Pages (`/jobs`, `/jobs/federal`, `/jobs/remote`)

Top-level entry points that aggregate across all jobs or major segments. Count: ~10 pages.

SEO requirements: Strong internal linking, SSR mandatory, canonical management for pagination.

Static Pages (homepage, `/about`, `/faq`, `/checker`)

Marketing and utility pages. Count: ~10 pages.

SEO requirements: Standard on-page SEO, FAQPage schema where applicable.

Each of these page types had different problems. That is the nature of Next.js at scale, the framework's flexibility means each route can end up with a different rendering strategy, and silent inconsistencies compound.

</ToolSection>

Pitfall 1: The Rendering Gap Across Page Types

This was the most damaging issue and the hardest to detect.

GovJobsHub's job detail pages were fully server-rendered. You could curl the URL and see complete job descriptions, salary data, and structured markup in the raw HTML. These pages looked great to Googlebot.

But the main /jobs hub page, the highest-traffic listing page and the root of the site's internal link graph, told a different story. The raw HTML contained React Server Component flight data: serialized self.__next_f.push() arrays instead of actual job cards. The content only appeared after JavaScript execution.

The location pages were a mix. State-level pages (/jobs/california) rendered server-side. But city-level pages (/jobs/california/los-angeles) used a hybrid approach where the job list was delivered as serialized JSON in RSC payloads rather than rendered HTML.

Why This Happens

In the Next.js App Router, any component marked with 'use client' renders on the client. If your job listing grid is a Client Component, maybe because it has sorting, filtering, or pagination interactions, the actual job data is not in the initial HTML. The server sends a placeholder and the RSC payload, and the client hydrates it.

The insidious part: this works perfectly in the browser. You never notice unless you view source or disable JavaScript. As Sam Torres noted in her JavaScript SEO AMA, rendering queues, not crawl budget, are the real bottleneck for JavaScript-heavy sites.

How to Detect It

# Fetch raw HTML and check for actual content vs RSC payloads
curl -s https://yoursite.com/jobs | grep -c "self.__next_f.push"
curl -s https://yoursite.com/jobs | grep -c "<article"

# If you see many __next_f.push calls and zero <article> tags,
# the page depends on client-side rendering

Run this check against every page type. Do not assume, verify. For a more structured approach, Sitebulb's rendering comparison (response vs. render) covers this analysis in depth.

How to Fix It

Move SEO-critical content out of Client Components. In the App Router, the default is Server Components, content renders on the server unless you explicitly opt out with 'use client'. The fix is architectural:

// BAD: Job list in a Client Component
'use client'
export function JobList({ jobs }) {
  // This content is NOT in initial HTML
  return jobs.map(job => <JobCard key={job.id} job={job} />)
}

// GOOD: Server Component with client interactivity separated
// JobList is a Server Component (default)
export function JobList({ jobs }) {
  // This content IS in initial HTML
  return (
    <div>
      {jobs.map(job => <JobCard key={job.id} job={job} />)}
      {/* Only the interactive filter is a Client Component */}
      <JobFilter />
    </div>
  )
}

The principle: render the content on the server, hydrate only the interactivity on the client. Every page type that you want indexed should have its primary content in Server Components.

</ToolSection>

Pitfall 2: _rsc Parameter Pollution

When a user navigates between pages on a Next.js App Router site, the framework fetches an optimized RSC payload by appending ?_rsc=XXXXX to the URL. This is an internal mechanism, it is not meant to be seen by search engines.

But Googlebot sees everything. It discovers these _rsc URLs during JavaScript rendering, follows them, and attempts to index them. The result: thousands of "Duplicate, Google chose different canonical" entries in Search Console.

GovJobsHub had over 1,300 of these entries within three months of launch.

Why It Is Hard to Fix

This is a framework-level issue with no clean solution:

robots.txt Disallow: /*?_rsc=: Google still discovers and reports the URLs. They show as "Indexed, though blocked by robots.txt" instead of disappearing.
Middleware redirect: Next.js strips _rsc from the NextRequest object before middleware processes it. You literally cannot see the parameter in middleware code.
next.config.js redirects: Using has conditions to redirect _rsc URLs reduced errors from 1,300 to about 400, but did not eliminate the problem.
Disabling prefetch: Setting prefetch={false} on all <Link> components prevents _rsc requests entirely but sacrifices the performance benefits of prefetching.

The Pragmatic Approach

There is no silver bullet. The combination that worked best for GovJobsHub:

robots.txt Disallow: blocks most crawling of these URLs
Canonical tags on every page: pointing to the clean URL without parameters
Selective prefetch disabling: turn off prefetch on pages with dozens of internal links (like listing pages) where the _rsc generation is heaviest
Accept the noise: some _rsc entries in Search Console are cosmetic. Focus on whether your clean URLs are indexed correctly, not on eliminating every duplicate report

# robots.txt
User-agent: *
Disallow: /*?_rsc=
Disallow: /*&_rsc=

The Bigger Question

This issue is tracked in multiple GitHub discussions with hundreds of participants and no official resolution from the Next.js team. Making matters worse, Google recently removed its JavaScript SEO guidance, leaving practitioners without an official render validation framework. If you are building a site where clean index coverage matters, and at 20,000 pages, it absolutely does, you need to account for this as a known, ongoing maintenance burden.

</ToolSection>

Setting It Up Right: Rendering Strategy Per Page Type

The core mistake is treating all pages the same. Each page type on a programmatic site needs its own rendering strategy based on content volume, update frequency, and SEO value.

Here is what worked for GovJobsHub after the fixes:

Page Type	Rendering	Revalidation	Rationale
Job detail (`/jobs/[id]`)	ISR	24 hours	High volume, moderate churn. Cannot SSG 15K pages at build time.
State pages (`/jobs/[state]`)	SSG	Build time	50 pages, stable URLs, high SEO value. Pre-build all of them.
City pages (`/jobs/[state]/[city]`)	ISR	48 hours	~2,500 pages, moderate churn. Too many for full SSG.
Category pages (`/jobs/category/[slug]`)	SSG	Build time	~200 pages, stable. Pre-build all.
Agency pages (`/jobs/federal/[agency]`)	SSG	Build time	~150 pages, stable. Pre-build all.
Hub pages (`/jobs`, `/jobs/federal`)	ISR	1 hour	High traffic, content changes with each new listing.
Static pages	SSG	Build time	Rarely changes.

The Decision Framework

Use SSG (generateStaticParams) when:

Page count is under 500
URLs are stable and predictable
Content changes infrequently (weekly or less)
Pages are high SEO value (location hubs, category landing pages)

Use ISR when:

Page count is in the thousands
Content updates daily but not in real-time
You need fresh content without full rebuilds
Set revalidate shorter than your content's lifespan

Never use client-side rendering for:

Any page you want indexed
Any page with structured data
Any page that is a target for internal linking

// Example: generateStaticParams for state pages
// This pre-builds all 50 state pages at build time
export async function generateStaticParams() {
  return US_STATES.map(state => ({
    state: state.slug,
  }))
}

// Example: ISR for job detail pages
// Revalidates every 24 hours
export const revalidate = 86400

</ToolSection>

Pitfall 3: Robots.txt and Meta Robots Contradictions

GovJobsHub had a resume checker tool at /checker. The robots.txt blocked it with Disallow: /checker/. But the page's HTML included <meta name="robots" content="index, follow">. These directives conflict, robots.txt prevents crawling, but the meta tag (which Googlebot never sees because it cannot crawl the page) says to index it.

This is not just a GovJobsHub problem. It is a pattern on Next.js sites where robots.txt is managed in one file and meta robots are set in page-level metadata, two different systems with no built-in consistency check.

Other robots.txt Mistakes

Blocking static assets: Several Next.js sites block /_next/static/ in robots.txt, thinking they are hiding implementation details. This prevents Googlebot from loading CSS and JavaScript needed to render pages. Only block /_next/data/ if you want to prevent JSON endpoint crawling.

Missing _rsc blocking: As covered above, _rsc parameters should be disallowed.

Overly broad API blocking: Disallow: /api/ blocks all API routes, but some sites serve structured data or public content through API routes that should be crawlable.

Proper robots.txt for Next.js on Vercel

User-agent: *
Allow: /_next/static/
Allow: /_next/image/
Disallow: /_next/data/
Disallow: /api/
Disallow: /*?_rsc=
Disallow: /*&_rsc=

Sitemap: https://www.yoursite.com/sitemap.xml

Pair this with consistent meta robots in your layout:

// app/layout.tsx — default for all pages
export const metadata = {
  robots: {
    index: true,
    follow: true,
  },
}

// app/admin/layout.tsx — override for non-public sections
export const metadata = {
  robots: {
    index: false,
    follow: false,
  },
}

</ToolSection>

Pitfall 4: Soft 404s and Wrong Status Codes

The SALT.agency study of 50 Next.js sites found that 41 out of 50 failed to return proper 404 status codes for non-existent URLs. GovJobsHub was among them initially.

The problem manifests differently per page type:

Job Detail Pages

When a job listing expires, what should happen? The page should return 410 Gone, telling Google the content existed but has been permanently removed. Instead, GovJobsHub was returning 200 with a "This job is no longer available" message. Google kept these pages in the index with stale JobPosting structured data, wasting crawl budget and showing expired listings in search results. This matters even more than you might expect: Google may skip JavaScript rendering entirely for non-200 pages, so getting the status code right determines whether your error handling is even rendered.

Dynamic Route Catchalls

Requesting /jobs/not-a-real-state returned a 200 status code with a generic "No jobs found" page instead of a 404. At scale, this means any URL under /jobs/ appears valid to crawlers, encouraging them to waste budget on non-existent paths.

The Fix

// app/jobs/[id]/page.tsx
import { notFound } from 'next/navigation'

export default async function JobPage({ params }) {
  const job = await getJob(params.id)

  if (!job) {
    notFound() // Returns 404 status code
  }

  if (job.expired) {
    // For expired content, return 410 Gone
    return new Response(null, { status: 410 })
  }

  return <JobDetail job={job} />
}

Verify per page type. Curl non-existent URLs under each route pattern and check the status code:

curl -o /dev/null -s -w "%{http_code}" https://yoursite.com/jobs/fake-id-12345
curl -o /dev/null -s -w "%{http_code}" https://yoursite.com/jobs/not-a-state
curl -o /dev/null -s -w "%{http_code}" https://yoursite.com/jobs/category/fake

Every one of those should return 404, not 200.

</ToolSection>

Pitfall 5: Missing and Broken Structured Data

GovJobsHub's structured data situation was a mixed bag. Job detail pages had solid JobPosting schema. Everything else was bare.

What Was Missing

Page Type	Had	Needed
Job detail	JobPosting	Already good
Location pages	Organization only	JobPosting aggregate, BreadcrumbList
Category pages	Organization only	BreadcrumbList
Hub pages	Organization, WebSite	BreadcrumbList
FAQ page	Nothing	FAQPage
All pages	Nothing	BreadcrumbList

Next.js-Specific JSON-LD Gotcha

In Next.js, you cannot put JSON-LD in the <head> the way you might in a traditional HTML site. The JSON-LD <script> tag must be rendered within a Server Component in the page body:

// app/jobs/[id]/page.tsx
export default async function JobPage({ params }) {
  const job = await getJob(params.id)

  const jsonLd = {
    '@context': 'https://schema.org',
    '@type': 'JobPosting',
    title: job.title,
    description: job.description,
    datePosted: job.postedDate,
    validThrough: job.expiryDate,
    hiringOrganization: {
      '@type': 'Organization',
      name: job.agency,
    },
    jobLocation: {
      '@type': 'Place',
      address: {
        '@type': 'PostalAddress',
        addressLocality: job.city,
        addressRegion: job.state,
        addressCountry: 'US',
      },
    },
  }

  return (
    <>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{
          // Sanitize to prevent XSS — replace < with unicode escape
          __html: JSON.stringify(jsonLd).replace(/</g, '\\u003c'),
        }}
      />
      <JobDetail job={job} />
    </>
  )
}

Critical: The JSON.stringify XSS sanitization (replacing < with \u003c) is not optional. Without it, malicious job descriptions could inject scripts via structured data.

BreadcrumbList for Hierarchical Pages

Every page with a position in the site hierarchy should have BreadcrumbList schema. For a site with /jobs/california/los-angeles, that means:

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "@type": "ListItem", "position": 1, "name": "Jobs", "item": "https://www.govJobshub.com/jobs" },
    { "@type": "ListItem", "position": 2, "name": "California", "item": "https://www.govJobshub.com/jobs/california" },
    { "@type": "ListItem", "position": 3, "name": "Los Angeles" }
  ]
}

Implement this as a reusable Server Component that takes the breadcrumb trail as a prop. Use it on every page type except the homepage.

</ToolSection>

Pitfall 6: Boilerplate Content at Scale

Every state page on GovJobsHub had an "About Government Jobs in [State]" section that read almost identically:

"Government jobs in [State] offer competitive salaries, excellent benefits, and job security. Browse our latest listings from federal, state, and local agencies."

That sentence appeared, with minor variations, on 50 state pages, hundreds of city pages, and dozens of category pages. At scale, this is a thin content signal. Google's quality systems look for pages that add substantive, unique value. When the only difference between /jobs/california and /jobs/texas is the state name in a template sentence, both pages risk being classified as low-quality.

The Fix: Data-Driven Unique Content

Replace boilerplate with programmatically generated content that is genuinely unique per page:

// Generate unique location context
function getLocationContent(state: string, stats: StateStats) {
  return {
    intro: `${state} has ${stats.activeListings.toLocaleString()} open government positions across ${stats.agencyCount} agencies. The average salary is $${stats.avgSalary.toLocaleString()}.`,
    topAgencies: `The largest employers are ${stats.topAgencies.slice(0, 3).join(', ')}.`,
    trends: stats.monthOverMonth > 0
      ? `Listings are up ${stats.monthOverMonth}% compared to last month.`
      : `Listings are down ${Math.abs(stats.monthOverMonth)}% compared to last month.`,
  }
}

Even two or three sentences of unique, data-driven content per page significantly differentiates them. The key is pulling from real data, job counts, salary ranges, top employers, trending categories, not just swapping a place name into a template.

Category and Agency Pages

The same principle applies. A category page for "IT & Technology" should reference the specific agencies hiring for tech roles, the salary range for that category, and any notable trends. An agency page for the VA should mention its hiring volume, locations, and most common position types.

This content does not need to be hand-written. It needs to be data-driven and genuinely different per page.

</ToolSection>

Pitfall 7: Page Churn and the Expiring Content Problem

A job board is not a blog. Content does not accumulate, it churns. GovJobsHub has roughly 15,000 job detail pages at any given time, but individual listings have a lifespan of 30 to 90 days. That means 2,000 to 5,000 pages expire every month and roughly the same number of new pages appear.

This creates a cascade of SEO problems that static content sites never face.

The Index Bloat Cycle

Here is what happens without intervention:

A job listing is posted. ISR generates the page. Googlebot crawls it. It enters the index with JobPosting rich results.
60 days later, the listing expires. The source data is removed.
But the ISR cache still serves the old page. Googlebot crawls the cached version and sees active content.
Eventually ISR revalidates and the page updates, but to what? If the code renders a "This job is no longer available" message with a 200 status code, Google keeps the URL in the index as a soft 404.
Meanwhile, the expired listing still has JobPosting structured data in Google's cache, showing in search results with stale salary, location, and apply links.

At scale, this means hundreds of expired listings sitting in Google's index at any given time, damaging user trust and wasting crawl budget.

The Fix: A Page Lifecycle Strategy

Every page type with expiring content needs a defined lifecycle:

Active listing (200 OK):

Full content, JobPosting schema, in sitemap
ISR revalidation every 24 hours

Expired listing (410 Gone):

Return 410 status immediately, not a soft 404, not a redirect
Strip JobPosting schema
Remove from sitemap on next generation
Trigger on-demand ISR revalidation so the 410 is served immediately, not after the next cache interval

// app/jobs/[id]/page.tsx
export default async function JobPage({ params }) {
  const job = await getJob(params.id)

  if (!job) {
    notFound() // 404 for never-existed
  }

  if (job.status === 'expired') {
    // Return 410 Gone — this listing existed but is permanently removed
    return new Response('This job listing has been removed.', {
      status: 410,
      headers: { 'Content-Type': 'text/html' },
    })
  }

  return <JobDetail job={job} />
}

The Google Indexing API

For job boards specifically, Google offers the Indexing API which supports URL_DELETED notifications. This is dramatically faster than waiting for Googlebot to recrawl, deletions are processed within minutes, not days.

// Notify Google when a listing expires
async function notifyGoogleOfRemoval(url: string) {
  const auth = new google.auth.GoogleAuth({
    scopes: ['https://www.googleapis.com/auth/indexing'],
  })
  const client = await auth.getClient()

  await client.request({
    url: 'https://indexing.googleapis.com/v3/urlNotifications:publish',
    method: 'POST',
    data: {
      url,
      type: 'URL_DELETED',
    },
  })
}

GovJobsHub was not using this API at all initially. After implementing it, stale listings were deindexed within hours instead of lingering for weeks.

Sitemap Freshness

Your sitemap must reflect page removals quickly. If you generate sitemaps at build time, expired listings stay in the sitemap until the next deploy. For a job board, sitemaps should be generated dynamically or regenerated on a schedule shorter than your content's average lifespan.

At minimum, run sitemap regeneration daily. Include only active listings. Set lastmod to the listing's actual post date, not the sitemap generation time.

</ToolSection>

Pitfall 8: Filter Pills, the Hidden Client Rendering Trap

GovJobsHub has filter pills on every listing page. Users click pills to filter by job category (IT, Healthcare, Law Enforcement), location type (Remote, On-site, Hybrid), salary range, and agency. These pills are a standard UI pattern, small, rounded chips that toggle on and off.

They are also an SEO disaster in a typical Next.js implementation.

The Rendering Problem

Filter pills are interactive. Users click them. They toggle state. They update the job list below. In Next.js, this means they are almost always implemented as Client Components:

// Typical implementation — entirely client-rendered
'use client'

export function FilterPills({ categories, activeFilters, onToggle }) {
  return (
    <div className="flex gap-2 flex-wrap">
      {categories.map(cat => (
        <button
          key={cat.slug}
          onClick={() => onToggle(cat.slug)}
          className={activeFilters.includes(cat.slug) ? 'active' : ''}
        >
          {cat.name} ({cat.count})
        </button>
      ))}
    </div>
  )
}

The problem: none of this renders in the initial HTML. Googlebot sees an empty <div> where the pills should be. The category names, the job counts, the entire navigational structure of the filter UI, all invisible to crawlers.

This matters because those pill labels are often keyword-rich terms that help search engines understand the page's topic. "IT & Technology (342 jobs)" is a strong relevance signal for a listing page. When it is client-rendered, that signal disappears.

The URL Parameter Problem

Clicking pills typically updates the URL: /jobs?category=IT&location=remote. Each combination is a unique URL that Googlebot can discover and attempt to index. With 20 categories, 3 location types, and 5 salary ranges, that is potentially hundreds of filtered URL variations per listing page, most of which contain duplicate or near-duplicate content.

GovJobsHub had over 800 filtered URL variations discovered in Search Console, each generating a "Duplicate, Google chose different canonical" warning.

The Fix: Server-Rendered Pills with Client Interactivity

Separate the rendering from the interaction:

// Server Component — renders pill labels and counts in initial HTML
export function FilterPills({ categories, activeFilters }) {
  return (
    <div className="flex gap-2 flex-wrap">
      {categories.map(cat => (
        <PillButton
          key={cat.slug}
          slug={cat.slug}
          label={`${cat.name} (${cat.count})`}
          isActive={activeFilters.includes(cat.slug)}
        />
      ))}
    </div>
  )
}

// Client Component — only handles the click interaction
'use client'

export function PillButton({ slug, label, isActive }) {
  const router = useRouter()

  return (
    <button
      onClick={() => {
        // Update URL params and re-fetch
        const params = new URLSearchParams(window.location.search)
        if (isActive) {
          params.delete('category', slug)
        } else {
          params.append('category', slug)
        }
        router.push(`?${params.toString()}`, { scroll: false })
      }}
      className={isActive ? 'active' : ''}
    >
      {label}
    </button>
  )
}

Now the pill labels and counts render in the initial HTML (visible to Googlebot), while the click behavior hydrates on the client.

Managing Filtered URLs

Even with server-rendered pills, the URL parameter problem remains. The fix is canonical management:

// All filtered views canonical to the unfiltered URL
export async function generateMetadata({ searchParams }) {
  const hasFilters = Object.keys(searchParams).some(
    key => ['category', 'location', 'salary'].includes(key)
  )

  return {
    alternates: {
      canonical: 'https://www.yoursite.com/jobs', // Always clean URL
    },
    // Noindex filtered views to prevent duplicate content
    ...(hasFilters && {
      robots: { index: false, follow: true },
    }),
  }
}

The follow: true is important, even though the filtered page is noindexed, you want Googlebot to follow the links on it to discover individual job detail pages.

</ToolSection>

Pitfall 9: Core Web Vitals Across Page Types

The SALT.agency study of 50 Next.js sites found that only 3 out of 50 passed LCP and only 1 out of 50 passed all three Core Web Vitals thresholds. GovJobsHub was not an outlier, it failed LCP on listing pages and had INP issues on interactive pages.

LCP: The Image Problem

Listing pages have hero images and dozens of job cards with agency logos. The default behavior of next/image is to lazy-load everything. But the hero image is above the fold, it should not be lazy-loaded.

// BAD: Hero image lazy-loads by default
<Image src={heroImage} alt="..." width={1200} height={600} />

// GOOD: Hero image preloaded with priority
<Image src={heroImage} alt="..." width={1200} height={600} priority />

This single prop (priority) was the difference between a 3.8s and a 2.1s LCP on GovJobsHub's listing pages.

INP: The Hydration Problem

Interactive pages, those with search filters, sorting, and pagination, had poor Interaction to Next Paint (INP) scores. The cause: heavy hydration. When the client-side JavaScript boots up and hydrates Server Component output, the main thread is blocked. Any user interaction during hydration (clicking a filter, typing in search) queues behind the hydration work.

Mitigations:

Reduce Client Component scope: hydrate only the interactive parts, not the entire page
Use React.lazy and dynamic imports: defer hydration of below-the-fold interactive components
Avoid CSS-in-JS: libraries like Styled Components inject styles at runtime, causing layout recalculations that block the main thread

CLS: The Dynamic Content Problem

Location pages that load job counts and statistics asynchronously caused Cumulative Layout Shift. The page renders, then numbers pop in and push content down.

Fix: Reserve space for dynamic content using CSS min-height or skeleton placeholders that match the final content dimensions. Better yet, fetch the data on the server so it is in the initial render.

Test Per Page Type

CWV scores vary dramatically across page types. The homepage might score 95 on Lighthouse while listing pages score 45. Test every template independently:

# Test each page type with Lighthouse CLI
lighthouse https://yoursite.com/ --output=json
lighthouse https://yoursite.com/jobs --output=json
lighthouse https://yoursite.com/jobs/california --output=json
lighthouse https://yoursite.com/jobs/12345 --output=json

</ToolSection>

Pitfall 10: Vercel-Specific Issues

Vercel makes Next.js deployment simple, but its platform constraints create SEO-specific challenges that are not obvious until you hit them.

ISR Cache Staleness

Vercel's ISR implementation caches pages on its Edge Network. When revalidate is set to 86400 (24 hours), the page can serve stale content for up to 24 hours after the source data changes. For a job board, this means:

Expired job listings still appear in search results with active JobPosting schema
Google crawls the cached page and sees content that no longer exists
When the cache finally revalidates, the page updates, but Google may not re-crawl for days

Fix: Use on-demand revalidation. When a job listing is removed from the database, call Vercel's revalidation API:

// API route: /api/revalidate
export async function POST(request: Request) {
  const { path, secret } = await request.json()

  if (secret !== process.env.REVALIDATION_SECRET) {
    return new Response('Unauthorized', { status: 401 })
  }

  await revalidatePath(path)
  return Response.json({ revalidated: true })
}

Serverless Function Timeouts

Vercel's default function timeout is 10 seconds on the Hobby plan, 60 seconds on Pro. Pages that query large datasets, like a hub page aggregating 20,000 job listings for sorting and filtering, can timeout during SSR.

Fix: Pre-compute aggregations. Do not query the full dataset on every request. Build summary data at deploy time or via a scheduled job, and have the SSR page read from the pre-computed summary.

Middleware Limitations for SEO

Vercel middleware runs at the Edge, which means:

No access to Node.js APIs (no fs, no database drivers)
_rsc parameters are stripped from the request before middleware sees them
Response body cannot be modified (only headers and redirects)

If you need server-side SEO logic, like conditionally setting X-Robots-Tag headers based on content state, you need to do it in the route handler or page component, not middleware.

www vs non-www

Vercel does not automatically redirect between www and non-www. Both versions serve content, creating duplicate pages. Configure this in vercel.json or via middleware:

// middleware.ts — redirect non-www to www
import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'

export function middleware(request: NextRequest) {
  const hostname = request.headers.get('host') || ''
  if (hostname === 'govJobshub.com') {
    return NextResponse.redirect(
      new URL(request.url.replace('govJobshub.com', 'www.govJobshub.com')),
      301
    )
  }
}

</ToolSection>

Pitfall 11: Internal Linking at Scale

With 20,000 pages, internal linking is not something you do manually. It is an architectural decision that determines which pages get crawled, how link equity flows, and which pages rank.

The Hub-and-Spoke Problem

GovJobsHub's initial linking structure was flat: the main /jobs page linked to paginated results, and each job card linked to a detail page. Location pages and category pages existed but were poorly connected to the job detail pages and to each other.

This created a shallow hub-and-spoke pattern where:

Job detail pages were 3+ clicks from the homepage
Location pages did not link to related category pages
Category pages did not link to related location pages
No cross-linking between related geographic areas

The Fix: Programmatic Cross-Linking

Build internal links into your templates:

// On a state page (/jobs/california), link to:
// 1. Child city pages
// 2. Related category pages for that state
// 3. Neighboring state pages
// 4. Parent hub page

function StatePageLinks({ state, topCities, topCategories }) {
  return (
    <>
      <nav aria-label="Cities in this state">
        <h2>Top Cities in {state.name}</h2>
        <ul>
          {topCities.map(city => (
            <li key={city.slug}>
              <Link href={`/jobs/${state.slug}/${city.slug}`}>
                {city.name} ({city.jobCount} jobs)
              </Link>
            </li>
          ))}
        </ul>
      </nav>

      <nav aria-label="Job categories in this state">
        <h2>Popular Categories in {state.name}</h2>
        <ul>
          {topCategories.map(cat => (
            <li key={cat.slug}>
              <Link href={`/jobs/category/${cat.slug}`}>
                {cat.name}
              </Link>
            </li>
          ))}
        </ul>
      </nav>
    </>
  )
}

Crawl Depth Matters

The goal: every page should be reachable within 3 clicks from the homepage. For a 20,000-page site, that requires deliberate hub-and-spoke architecture with cross-links between spokes.

Homepage links to hub pages (jobs, federal, remote) and featured locations/categories. Hub pages link to location and category index pages. Location and category pages link to individual job listings and cross-link to each other. Job detail pages link back to their location and category parents.

</ToolSection>

Setting It Up Right: Sitemaps for 20K Pages

Next.js's built-in sitemap.ts works for small sites. At 20,000 pages, you need a sitemap index that splits pages by type.

The Problem

A single sitemap file with 20,000 URLs is technically valid (the limit is 50,000), but it is harder to debug and monitor. When Google reports indexing issues, a monolithic sitemap gives you no granularity about which page types are affected.

The Fix: Sitemap Index by Page Type

// app/sitemap.ts — generates sitemap index
import { MetadataRoute } from 'next'

export default function sitemap(): MetadataRoute.Sitemap {
  return [
    // Return sitemap index entries
    // Each points to a type-specific sitemap
  ]
}

// app/sitemaps/jobs/sitemap.ts
// app/sitemaps/locations/sitemap.ts
// app/sitemaps/categories/sitemap.ts

Or use the next-sitemap package, which handles sitemap index generation, splitting, and per-route configuration automatically.

Priority and Changefreq per Page Type

Page Type	Priority	Changefreq
Homepage	1.0	daily
Hub pages	0.9	daily
State pages	0.8	daily
Category pages	0.8	weekly
City pages	0.7	daily
Agency pages	0.7	weekly
Job detail pages	0.6	weekly
Static pages	0.5	monthly

Note: Google has stated it largely ignores priority and changefreq, but other search engines (Bing, Yandex) still use them, and they help with debugging.

lastmod Must Be Accurate

Do not set lastmod to the current build time for every page. Use the actual content modification date. For ISR pages, this means tracking when the underlying data last changed, not when the cache was last generated.

</ToolSection>

Setting It Up Right: Canonical URLs

Canonical mismanagement is the silent killer of large Next.js sites. GovJobsHub had three distinct canonical problems.

Problem 1: Pagination Canonicals

Paginated pages (/jobs?page=2, /jobs?page=3) should each have a self-referencing canonical. The second page of results is not a duplicate of the first, it is a distinct page with different content. But some Next.js SEO guides incorrectly suggest pointing all paginated pages to page 1.

// Correct: self-referencing canonical on paginated pages
export async function generateMetadata({ searchParams }) {
  const page = searchParams.page || '1'
  const canonicalUrl = page === '1'
    ? 'https://www.yoursite.com/jobs'
    : `https://www.yoursite.com/jobs?page=${page}`

  return {
    alternates: {
      canonical: canonicalUrl,
    },
  }
}

Problem 2: Parameter Pollution

Beyond _rsc, other query parameters can create duplicates: ?sort=salary, ?filter=remote, ?q=engineer. Each parameter combination is a unique URL to Google.

Rule: Pages with sort/filter parameters should canonical to the unfiltered version. Search result pages should be noindexed.

// Canonical always points to clean URL
export async function generateMetadata({ searchParams }) {
  return {
    alternates: {
      canonical: 'https://www.yoursite.com/jobs',
    },
    // Noindex search results
    ...(searchParams.q && {
      robots: { index: false, follow: true },
    }),
  }
}

Problem 3: Trailing Slashes

Next.js uses 308 redirects (not 301) for trailing slash normalization. Pick one format and enforce it in next.config.js:

// next.config.js
module.exports = {
  trailingSlash: false, // /jobs, not /jobs/
}

</ToolSection>

The Setup Checklist

If you are starting a Next.js + Vercel project today and SEO matters, configure these before writing a single page component.

1. Rendering Defaults

All page content in Server Components by default
'use client' only for interactive UI elements (filters, modals, forms)
Audit with curl or view-source: before launch, if content is not in raw HTML, it is not server-rendered

2. Metadata Configuration

Use the Metadata API (metadata object or generateMetadata function), not next/head
Set defaults in app/layout.tsx, override per route
Every page gets: title, description, canonical, robots, Open Graph

3. robots.txt

Allow /_next/static/ and /_next/image/
Disallow /_next/data/, /api/, /*?_rsc=
Include sitemap reference

4. Sitemap

Split by page type for sites over 1,000 pages
Accurate lastmod from content timestamps, not build time
Submit to Search Console immediately

5. Structured Data

JSON-LD in Server Components, not Client Components
Sanitize with .replace(/</g, '\\u003c')
Match schema type to page type (JobPosting, BreadcrumbList, FAQPage, etc.)

6. Status Codes

Use notFound() for missing content
Return 410 for expired content
Verify every dynamic route pattern returns 404 for invalid params

7. Vercel Configuration

www/non-www redirect in middleware or vercel.json
On-demand ISR revalidation for content removals
Function timeout appropriate for your data volume

8. Internal Linking

Programmatic cross-links between page types
Maximum 3 clicks from homepage to any page
Use <Link>, never router.push() for navigation between indexable pages

9. Core Web Vitals

priority on above-the-fold next/image components
Use next/font for web fonts
Avoid CSS-in-JS libraries
Test every page template, not just the homepage

10. Monitoring

Google Search Console coverage report, check weekly for the first 3 months
Log file analysis if possible, see which pages Googlebot actually crawls
Automated crawl audits with tools like Screaming Frog (or its MCP server for AI-assisted audits)
CrUX data for real-user CWV per page type

</ToolSection>

Results and Takeaways

After implementing the fixes described above, GovJobsHub saw measurable improvements over 8 weeks:

Indexed pages increased from ~4,000 to ~14,000 (of 20,000 total)
_rsc duplicate entries in Search Console dropped from 1,300 to under 200
Soft 404 errors eliminated entirely
Average LCP on listing pages improved from 3.8s to 2.1s
JobPosting rich results started appearing for individual job listings within 3 weeks of schema implementation

What We Learned

1. Audit every page type independently. The rendering strategy, status codes, structured data, and CWV scores can be completely different across page types on the same Next.js site. A passing grade on the homepage means nothing for your listing pages.

2. Next.js defaults are not SEO defaults. The framework does not block you from doing SEO well, but it does not do it for you. Every SEO requirement, rendering strategy, canonical management, structured data, status codes, needs explicit configuration.

3. _rsc is a fact of life. There is no clean fix. Budget time for managing the noise in Search Console and implement the mitigation stack (robots.txt + canonicals + selective prefetch disabling).

4. Vercel adds a caching layer you must account for. ISR staleness, function timeouts, and middleware limitations are platform constraints that affect SEO. On-demand revalidation is not optional for sites with expiring content.

5. Programmatic content needs programmatic quality. Templated pages at scale require data-driven unique content, not just name-swapped boilerplate. Every page type needs enough unique content to justify its existence in the index.

6. Internal linking is architecture, not afterthought. At 20,000 pages, you cannot manually link. Build cross-linking into your templates and ensure crawl depth stays under 3 clicks for every page.

The Next.js + Vercel stack is powerful, and it can absolutely support large-scale SEO. But it requires deliberate configuration at every layer, rendering, caching, metadata, structured data, and crawl management. The pitfalls are real, well-documented, and largely avoidable if you know where to look.

</ToolSection>

Astro vs Next.js for SEOs: Which Framework Won't Sabotage Your Rankings

Sun, 08 Mar 2026 12:54:09 GMT

import ToolSection from '../../components/ToolSection.astro';

Next.js powers some of the best sites on the web. It also powers some of the most frustrating SEO debugging sessions I've had. After building and maintaining production sites on both Next.js and Astro, I have opinions about which framework makes SEO easier and which one makes it harder. This is that comparison, from an SEO practitioner's perspective.

The short version: if your site is mostly content, Astro removes entire categories of SEO problems that Next.js creates. If your site is a web application that happens to need SEO, Next.js is still the right tool. The distinction matters.

The Next.js Rendering Problem

Next.js has five rendering modes: SSG (Static Site Generation), SSR (Server-Side Rendering), ISR (Incremental Static Regeneration), RSC (React Server Components), and client-side hydration. On top of that, Vercel adds three caching layers: Full Route Cache, Data Cache, and CDN edge cache.

Each of these solves a real problem. Together, they create a system where content delivery depends on which cache layer responds, whether ISR has revalidated, whether hydration has completed, and whether the client-side router has taken over navigation. When any of these layers misbehave, the result is often a page that returns HTTP 200 but serves wrong or missing content. Search engines index these pages. You don't find out until rankings drop.

Here are bugs I've encountered on production Next.js sites.

Empty 200 Pages from ISR

ISR revalidates pages in the background. During revalidation, the old cached version is served. If revalidation fails silently (API timeout, build error), the stale version persists indefinitely. Worse, I've seen cases where ISR returns an empty shell with a 200 status. Google indexes the empty page, the original content disappears from search results, and the only symptom is a traffic drop days later.

The fix involves monitoring revalidation failures, but Next.js doesn't expose these failures by default. You need custom error tracking on revalidation callbacks, which most teams don't set up.

_rsc Parameter Pollution

Next.js appends ?_rsc={hash} to URLs for React Server Component payloads. These URLs return text/x-component content type, not HTML. On one production site, Screaming Frog found roughly 45,000 URLs with _rsc parameters against only 1,500 actual pages. That's a 30:1 ratio of junk URLs to real content.

These URLs waste crawl budget and can confuse search engines if they're not blocked. The fix is adding Disallow: /*?_rsc= to robots.txt, but you have to know the problem exists first.

Client Components Kill Link Crawlability

Google treats links inside 'use client' components as plain text. If your navigation, footer links, or internal linking components are client components, Googlebot may not follow those links during the initial crawl. We discovered this on a production site via Search Console URL Inspection, where pages that should have been crawled through navigation links were showing as "Discovered - currently not indexed."

The rule is simple: every component that renders <a> tags must be a server component. But Next.js makes it easy to accidentally promote components to client components through import chains. One 'use client' at the top of a shared utility file can cascade through your entire component tree.

loading.tsx Creates JS-Dependent Content

Adding a loading.tsx file to your route creates an implicit Suspense boundary. This means all content in that route streams behind <div hidden>, invisible until JavaScript executes. Googlebot executes JavaScript, but with delays. Other crawlers (Bing, Yandex) may not render JavaScript at all.

This pattern also causes duplicate <title> tags (the streaming response sends the title twice) and missing canonicals during the streaming phase. If you're using loading.tsx on SEO-critical pages, you're relying on every crawler to fully render JavaScript before indexing.

Soft Navigation Loses Meta Tags

Next.js App Router uses client-side navigation (soft navigation) between pages by default. When a user clicks a link, the client-side router fetches the new page's content and updates the DOM without a full page load. This is fast for users but can cause issues with meta tags.

If your Open Graph tags, canonical URLs, or structured data are set via generateMetadata(), they update correctly on soft navigation. But any meta tags set outside of generateMetadata() (via custom <Head> components or third-party scripts) may not update. The page looks correct to users, but sharing the URL on social media or having it crawled mid-session can produce wrong metadata.

</ToolSection>

The Astro Alternative

Astro has one default rendering mode: static HTML at build time. No JavaScript ships unless you explicitly request it with a client:* directive on a specific component.

The HTML that exists at build time is the HTML that gets served. There is no ISR, no hydration race, no cache invalidation timing, no revalidation webhooks to secure. If the MDX file has content, the built HTML has content. If the build succeeds, every page is correct.

What Fails in Astro

Build failures. That's essentially it. If your content has a syntax error or your schema validation fails, the build stops and deployment is blocked. This is loud and obvious, the opposite of Next.js's silent 200-with-wrong-content failures.

The other failure mode is content staleness: if you commit new content but don't trigger a rebuild, the site serves the previous version. This is solved by a deploy hook in your CI pipeline. For sites using GitHub Actions, it's a one-line webhook call after content commits.

Zero JavaScript by Default

A typical Astro content page sends HTML and CSS. Nothing else. No React runtime (45KB+ gzipped), no hydration logic, no client-side router. The browser parses HTML and renders. That's it.

When you do need interactivity, Astro's islands architecture lets you add React, Vue, Svelte, or any other framework to specific components. A search modal can be a React component with client:visible (loaded only when scrolled into view). The other 95% of the page remains static HTML.

This means you get React where you need it and zero JavaScript where you don't. The framework tax only applies to interactive widgets, not to your entire content library.

</ToolSection>

Core Web Vitals: Head-to-Head

Performance isn't just about speed. Core Web Vitals directly affect rankings. Here's how the two frameworks compare on the metrics Google measures.

Largest Contentful Paint (LCP)

Astro wins by default. With no JavaScript blocking the render, the browser can paint content immediately after receiving HTML. There's no hydration phase where the framework takes over the DOM.

Next.js can achieve good LCP with static export and proper image optimization, but hydration adds a processing step. If your LCP element is inside a component that hydrates, the paint is delayed until React reconciles the server-rendered HTML with the client-side virtual DOM.

Interaction to Next Paint (INP)

Astro pages with no JavaScript have an INP of zero. There's nothing to process on interaction except native browser behavior.

Next.js ships event handlers, state management, and the React reconciliation cycle. Even on static content pages, the hydrated React runtime processes every click, scroll, and keyboard event through React's synthetic event system.

Cumulative Layout Shift (CLS)

Both frameworks can achieve zero CLS with proper image sizing and font loading. But Next.js has an additional CLS risk: hydration. When React hydrates a server-rendered page, it can cause layout shifts if the client-side render differs from the server render. This is rare with careful coding but happens in practice, especially with conditional rendering based on browser APIs like window.innerWidth.

Astro doesn't hydrate content components, so this category of CLS doesn't exist.

Real Numbers

This site (technicalseonews.com) runs on Astro. Lighthouse scores: Performance 100, LCP 1.5s, Total Blocking Time 0ms, CLS 0. These numbers require zero optimization effort. The framework defaults produce them.

Achieving similar scores on Next.js is possible but requires work: optimizing the React bundle, code-splitting aggressively, lazy-loading non-critical components, and avoiding hydration on content pages. You're spending engineering time to reach numbers that Astro gives you for free.

</ToolSection>

Next.js Caching: Six Layers of Confusion

Next.js has a well-documented caching problem. The framework itself has four cache layers, and Vercel adds two more. Understanding which layer is serving (or blocking) your content requires expertise that most teams don't have.

Layer 1: Request Memoization. Deduplicates identical fetch calls within a single render. Mostly harmless, rarely causes SEO issues.

Layer 2: Data Cache. Caches the results of fetch() calls across requests. Persists across deployments unless explicitly invalidated. This is where stale data lives longest.

Layer 3: Full Route Cache. Caches the rendered HTML and RSC payload at build time for static routes. ISR pages get re-cached after revalidation.

Layer 4: Router Cache. Client-side cache that stores previously visited pages for 30 seconds (dynamic) or 5 minutes (static). Users see stale content during this window.

Layer 5: Vercel Edge Cache. CDN-level caching with edge nodes worldwide. Invalidation propagates asynchronously, so different edge nodes can serve different versions.

Layer 6: Browser Cache. Standard HTTP caching based on headers. Interacts with all other layers.

When a page serves stale content, the debugging question is: which of these six layers is responsible? The answer is often "multiple layers, in combination." This is not a theoretical problem. Teams building content sites on Next.js spend real engineering hours on cache debugging that produces no user-visible value.

Astro's caching model: the CDN serves static files. Set your cache headers. Done.

</ToolSection>

Should You Migrate?

Not every Next.js site should move to Astro. The decision depends on what your site does and what problems you're experiencing.

Migrate If

Your site is primarily content (blog, docs, news, marketing pages) with limited interactivity. You've experienced ISR staleness, hydration bugs, or cache confusion. Your team spends time debugging rendering issues instead of building features. Your Core Web Vitals need work and you'd rather fix the root cause than add optimization patches.

The migration cost is proportional to your interactive surface area. If you have 3 interactive components across 500 content pages, the migration is straightforward: rebuild the content templates in Astro, port the 3 components as islands. Most of your content (MDX or Markdown) transfers with minimal changes.

Stay on Next.js If

Your site is a web application with complex state management, real-time data, user authentication, and dynamic UI that changes per request. Dashboard-style products, e-commerce with personalized pricing, SaaS platforms with user-specific views. These need Next.js's server-side rendering and API routes.

Also stay if your team has deep Next.js expertise and hasn't experienced the bugs described above. The problems I've outlined are real but not universal. Teams that understand Next.js's rendering modes and caching layers can avoid them. The question is whether the ongoing vigilance is worth the cost.

The Hybrid Option

Some teams keep Next.js for their web application and use Astro for their marketing site, docs, or blog. This gives each site the framework that fits its content model. The marketing site gets Astro's simplicity and performance. The application gets Next.js's dynamic capabilities. Two deploys, but each one is simpler than a combined deployment would be.

</ToolSection>

What Astro Gets Wrong (Or At Least Harder)

Astro isn't perfect. Here are the genuine trade-offs.

No Incremental Updates

Every content change requires a full rebuild. For a 500-page site, this takes 10-30 seconds. For a 50,000-page site, build times grow and become a bottleneck. If you need content to appear within seconds of publishing (breaking news, stock prices, live scores), Astro's rebuild cycle adds latency that ISR can avoid.

For most content sites, a 30-second deploy pipeline is fast enough. But if you need sub-second content freshness, Astro isn't the right choice.

Smaller Community and Fewer Packages

Next.js has more tutorials, Stack Overflow answers, and third-party packages. If you hit an obscure bug, there's more community help available. Astro's community is growing quickly (especially after the Cloudflare acquisition in January 2026) but it's still smaller in absolute terms.

For a content site, this rarely matters. The core features you need (MDX, Tailwind, search, RSS, sitemap, syntax highlighting) all have first-class Astro support.

No API Routes in Static Mode

Astro can serve API endpoints, but only in SSR mode. In static mode (the default and recommended mode for content sites), you can't have server-side API routes. Contact forms, webhooks, and other server-side logic need external services (Supabase, Cloudflare Workers, etc.) or a separate API.

Next.js API routes are convenient for keeping everything in one framework. Astro's approach separates concerns more strictly, which is architecturally cleaner but requires more infrastructure decisions.

Font Loading

There's no equivalent to next/font's automatic font optimization. You self-host fonts with preload links and font-display: swap, which is what next/font does under the hood, but you manage it yourself. It's a few lines of HTML, not a real burden, but it's less automatic.

</ToolSection>

Side-by-Side Comparison

Dimension	Next.js	Astro
Default rendering	SSR + client hydration	Static HTML, zero JS
JavaScript shipped	React runtime + hydration (~45KB+ gzipped)	Zero by default
Rendering modes	5 (SSG, SSR, ISR, RSC, client)	1 (static) + optional SSR
Cache layers	6 (request, data, route, router, edge, browser)	1 (CDN)
Interactive components	React (always loaded)	Islands (React/Vue/Svelte, loaded on demand)
Content updates	ISR (background revalidation)	Full rebuild + deploy
Build time (500 pages)	15-45 seconds	5-15 seconds
Lighthouse Performance (content page)	85-100 (depends on config)	95-100 (default)
SEO debugging complexity	High (which cache layer? which rendering mode?)	Low (check the HTML file)
MDX support	Via @next/mdx or contentlayer	Built-in Content Collections + Zod
Sitemap	Custom generation or next-sitemap	@astrojs/sitemap (official)
RSS	Custom generation	@astrojs/rss (official)
Structured data	Manual JSON-LD	Manual JSON-LD (same)
robots.txt	Custom API route or static file	Static file or dynamic endpoint

</ToolSection>

Recommendations

If you're starting a new content site: Use Astro. The rendering simplicity, zero-JS default, and built-in content tooling (Collections, MDX, Zod schemas) are purpose-built for content. You'll spend your time writing content instead of debugging caching.

If you're maintaining a Next.js content site with SEO issues: Audit your rendering modes first. Check for _rsc parameter pollution, test pages with JavaScript disabled, and verify that your ISR revalidation is working. Many Next.js SEO issues can be fixed without migrating, they just require understanding the rendering pipeline.

If you're maintaining a Next.js content site without SEO issues: Keep it. Migration has a cost, and if your current setup works, the rendering reliability gains from Astro may not justify the switch.

If you're building a web application that needs SEO: Use Next.js (or Remix, or SvelteKit). Astro is not a web application framework. Its strength is content, not complex interactive UIs.

The framework choice matters less than understanding the framework you choose. Both Next.js and Astro can produce excellent SEO outcomes. The difference is how much effort and expertise each one demands to get there. For content sites, Astro demands less.

</ToolSection>

Screaming Frog MCP, AI-Powered SEO Audits

Thu, 05 Mar 2026 06:56:53 GMT

import ToolSection from '../../components/ToolSection.astro';

The Screaming Frog MCP server lets you run technical SEO audits through any MCP-compatible AI assistant. Instead of exporting CSVs and filtering spreadsheets, you ask conversational questions about your crawl data and get prioritized, actionable output. It wraps Screaming Frog's headless CLI, exposes 8 tools for crawling, exporting, and querying, and works with Claude Desktop, Claude Code, Cursor, Windsurf, Cline, and any other MCP client. This guide covers setup, real audit workflows, multi-site automation, and how to extend the server for your own tools.

The source code is on GitHub.

What the Screaming Frog MCP Actually Does (and Doesn't)

I built the Screaming Frog MCP server because crawling is just the first step. The real work happens after: exporting data, filtering results, cross-referencing against other sources, and extracting actionable patterns from thousands of URLs. The MCP connects this analysis phase directly to your AI assistant without the export-CSV-open-spreadsheet loop.

It wraps Screaming Frog's headless CLI and exposes crawl data through 8 tools. You ask your AI assistant conversational questions about your crawl data. "Show me pages with missing H1 tags." "Find redirect chains longer than 3 hops." "Which pages have the highest crawl depth?" The assistant runs the queries, processes results, and hands you actionable output in one session.

This is not a replacement for the GUI. The GUI is where you configure crawls, set spider options, and tweak advanced settings. The MCP is where you do analysis and automation. Configure in the GUI. Analyze with your AI assistant.

Task	GUI	MCP
One-off crawl of a single site	Faster	Slower (setup overhead)
Visual inspection of results	Better (interactive filtering)	Slower (text-based)
JavaScript rendering analysis	Full support	Limited (default settings only)
Batch crawls across multiple sites	Manual loop required	Native (parallel, up to 2 concurrent)
Simple single-tab queries (missing titles, H1s)	Faster (few clicks)	Faster (but not by much)
Cross-referencing crawl data with external sources	Manual import/export	Native (orchestrated queries)
Recurring audits across a portfolio	Manual each time	Automated (schedule crawls, export, analyze)
Natural language follow-up questions	Not applicable	Native (converse with results)

The MCP shines for batch operations, cross-tool integration, and automation. For a single site where you'll interact with the GUI once per month, stick with the GUI. For recurring audits, multi-site comparisons, or questions that require threading together data from multiple crawls, the MCP saves the repetitive work.

</ToolSection>

Five-Minute Setup

Install the MCP using pip or uvx, depending on your preferences.

pip install screaming-frog-mcp

Or, if you prefer not to install permanently.

uvx screaming-frog-mcp

Next, verify that Screaming Frog's CLI is accessible. You'll need to tell your MCP client where to find it, because the path varies by operating system.

On macOS, the CLI lives at /Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher by default. Most clients find it automatically.

On Linux or Windows, set the SF_CLI_PATH environment variable to the full path of your Screaming Frog CLI executable.

Linux example

export SF_CLI_PATH=/usr/bin/screamingfrogseospider

Windows example

set SF_CLI_PATH=C:\Program Files (x86)\Screaming Frog SEO Spider\ScreamingFrogSEOSpiderCli.exe

Claude Desktop setup. Open ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or the equivalent on your OS, and add the following.

{
  "mcpServers": {
    "screaming-frog": {
      "command": "uvx",
      "args": ["screaming-frog-mcp"],
      "env": {
        "SF_CLI_PATH": "/path/to/ScreamingFrogSEOSpiderLauncher"
      }
    }
  }
}

Restart Claude Desktop. The MCP is now available.

Claude Code setup. Run this command.

claude mcp add screaming-frog https://github.com/bzsasson/screaming-frog-mcp

Cursor, Windsurf, or other MCP-compatible clients. Check your client's MCP configuration documentation. The setup is similar: point the client to the MCP server and set the SF_CLI_PATH environment variable.

Test it by asking your AI assistant to list your saved crawls. You'll see the database IDs and sizes of every crawl you've stored in Screaming Frog. If you see output, you're connected.

One gotcha on Linux and Windows. SF_CLI_PATH must point to the exact path where Screaming Frog installs the CLI executable. Install paths vary by OS and installer version. Verify the path on your machine before configuring the env var. If sf_check fails, you've likely got the path wrong.

The GitHub repo has the latest installation instructions and troubleshooting for platform-specific issues.

</ToolSection>

The Database Lock (Why You Close the GUI First)

Screaming Frog stores crawl data in SQLite, in a directory called ProjectInstanceData. When you open the Screaming Frog GUI, it acquires an exclusive write lock on that database. No other process can read from it while the lock is held. This is standard SQLite exclusive locking behavior and protects data consistency.

The MCP uses Screaming Frog's headless CLI, which tries to acquire a read lock. It cannot do this while the GUI has the exclusive lock. You'll see an error message: "The database is locked. Please quit the SF GUI first, then retry."

Fix: Close the Screaming Frog GUI before running any MCP commands. Wait a moment for the lock to release, then run your analysis.

Exception: The list_crawls tool is read-only and works even while the GUI is open. You can run it to check which crawls exist without closing the GUI. Every other tool (crawl_site, export_crawl, delete_crawl) requires the GUI to be closed.

The lock is not a bug. It's a safety feature that prevents simultaneous writes. If you see a database lock error, you know the fix: quit the GUI and retry.

</ToolSection>

How to Use Screaming Frog to Improve On-Page SEO

Export the Page Titles, Meta Description, H1, and H2 tabs from a crawl, then ask your AI assistant to flag pages with missing, duplicate, or truncated elements. The MCP lets you cross-reference on-page issues with crawl depth and inlink count so you fix the highest-impact pages first.

The standard on-page audit covers five signals: title tags (missing, duplicate, over 60 characters), meta descriptions (missing, duplicate, over 160 characters), H1 tags (missing, multiple, or duplicate), H2 structure (missing on long pages), and image alt text (missing on above-the-fold images).

Ask your AI assistant:

"Export Page Titles:All, Meta Description:All, H1:All, H2:All, and Images:All from my crawl. Show me a summary of on-page issues sorted by page count."

Your assistant exports, reads, and aggregates. The output might look like this:

On-page issues summary:

Missing meta descriptions:     342 pages
Duplicate title tags:          89 pages (31 unique duplicates)
Missing H1 tags:               23 pages
Multiple H1 tags:              156 pages
Title tags over 60 chars:      412 pages
Missing image alt text:        1,204 images across 389 pages

Now drill in:

"Which of the 342 pages missing meta descriptions have the most inlinks? Show me the top 20."

Your assistant cross-references with the Internal:All export. The pages with the most inlinks are your highest-priority fixes, they're the most internally linked (and likely most visited) pages without descriptions.

For duplicate titles, ask:

"Group the duplicate titles and show me how many pages share each one. Are any of these on different URL patterns like /products/ vs /category/?"

This reveals whether duplicates are a template issue (all product pages sharing a generic title) or a content issue (two distinct pages with the same title). Template issues are a one-line fix in your CMS. Content issues need individual attention.

</ToolSection>

How Do You Find Broken Pages That Actually Matter?

Export Response Codes and Internal Links data, then cross-reference to find broken pages that still have internal links pointing to them. These are the pages costing you crawl equity, and fixing them has the most immediate impact.

Most audits start with "find all 404s." That's too basic. The real question is which broken pages have internal links pointing to them, and how many.

Start by listing your saved crawls. Ask your AI assistant:

"Show me my saved crawls with their database IDs."

Your assistant runs the list_crawls command and returns something like this.

Site Crawl 1, Database ID 1234 (8,234 URLs)
Site Crawl 2, Database ID 5678 (3,156 URLs)

Pick the crawl you want to analyze. Ask:

"Export the Response Codes and Internal Links data from crawl 1234."

Your assistant exports both data sets. This takes 20-40 seconds depending on crawl size.

Next, ask:

"Show me pages that returned 4xx or 5xx errors and have at least one internal link pointing to them. Sort by the number of inlinks (highest first)."

Your assistant reads the exported data, cross-references the Response Codes tab with the Internal Links tab, filters for broken pages with inlinks, and sorts. This takes another 10-20 seconds.

The output looks something like this.

Pages with errors AND internal links (sorted by inlink count)

https://example.com/products/widget-1  | 404 | 127 inlinks
https://example.com/blog/old-post-2019 | 410 | 89 inlinks
https://example.com/category/tools      | 503 | 45 inlinks

These are the pages costing you the most crawl equity. The first page has 127 internal links pointing to a 404. The impact is severe.

Now ask:

"Which of these are linked from the homepage, site navigation, or footer?"

Your assistant filters by link source and identifies the highest-impact broken links. Then ask:

"What's the pattern? Are all 404s on a specific section like /products or /blog?"

This kind of follow-up analysis is where the MCP shows its value. You're not re-exporting, not re-opening spreadsheets. You're asking questions and getting answers in real time.

Once you have your prioritized list, hand it to your development team with the link sources. They know exactly what to fix and in what order. For pages that should return 2xx (missing pages, migration targets), redirect them. For pages that should genuinely return 404, remove the internal links or add a sitemap note.

One gotcha here. The default export tabs don't include all response code granularity. If you need to distinguish between 404s and 410s (Gone), or between different 5xx error types, the default "Response Codes:All" tab gives you the full breakdown. But if you need response times or other performance metrics, you'll need to specify additional tabs. Ask your assistant: "Include Response Codes:All and Performance:All in the export."

</ToolSection>

Can 301 Redirect Chains Hurt Your SEO?

Yes. Each hop in a redirect chain adds latency, wastes crawl budget, and dilutes link equity. Google follows up to 10 redirects but recommends keeping chains to one hop. Two hops maximum is the safe target.

Long chains signal poor site architecture and create UX friction.

You've crawled a site you recently migrated. You expect some redirect chains from old URLs to new ones, but you want to find the longest chains and any loops. For more on handling redirects during major site changes, see our guide on site migrations (coming soon).

Ask your AI assistant:

"Export the redirect chain data from my crawl. I need to find chains longer than 2 hops and identify any circular redirects."

Your assistant runs an export using the bulk export option (not the standard tab exports). The export parameter is important: bulk_export='All Redirect Chains', not export_tabs='Redirect Chains:All'. This is a common gotcha. Bulk exports are a different category from standard tab exports, and they use different parameters.

The export returns a CSV with every redirect chain in the crawl. Your assistant parses it and shows you something like this.

Chain 1: /old-product.html -> /products/new-product.html -> /products/new-product-v2.html (3 hops)
Chain 2: /about.html -> /about-us.html -> /about-us.html (loop detected)
Chain 3: /blog/2019 -> /blog/2020 -> /blog/current (2 hops)

Your assistant identifies chain 1 (over your 2-hop threshold) and chain 2 (a loop, a critical issue).

Ask:

"Generate a redirect remediation map. For chains over 2 hops, show the old URL and the final destination URL (skipping the intermediate hops)."

Your assistant outputs:

/old-product.html -> /products/new-product-v2.html
/about.html -> /about.html (loop, needs manual intervention)

Instead of three hops, you now have a direct redirect. One hop instead of three. Your developers can update the redirect rules in your web server config. Crawl budget improves immediately.

What is the difference between bulk_export and export_tabs? bulk_export handles cross-tab aggregate data (redirect chains, all inlinks, all outlinks). export_tabs handles individual tab views (Response Codes:All, Page Titles:All). Using the wrong one returns empty or incomplete data. For redirect chains specifically, bulk_export='All Redirect Chains' is correct. If you accidentally use export_tabs='Redirect Chains:All', you get nothing useful. When hunting for redirect loops, also include the Response Codes bulk export so your assistant can cross-reference each chain's final destination with its HTTP status and separate true dead ends (4xx, 5xx) from intentional redirects.

</ToolSection>

How Do You Find Orphan Pages in Screaming Frog?

Crawl the site normally, then compare the discovered URLs against your full sitemap. Pages in the sitemap that the crawler never reached through internal links are your orphans. The MCP automates this comparison.

Orphan pages are URLs that exist on your site but have zero internal links pointing to them. Search engines discover pages by following links. If nothing links to a page, crawlers may never find it, or they'll deprioritize it in their crawl queue. These pages often rank poorly even if the content is strong.

The tricky part: Screaming Frog's standard crawl only finds pages it can reach by following links from the start URL. By definition, it can't crawl orphan pages this way. To find them, you need to feed Screaming Frog a list of all known URLs (from your sitemap or server logs) and then compare that list against the pages the crawler actually discovered through links.

Here's the workflow with the MCP.

First, crawl the site normally. This discovers all pages reachable through internal links.

"Crawl https://example.com and label it 'example-link-crawl-march-2026'."

While that runs (or after), prepare a list of all URLs from your sitemap. You can ask your assistant to fetch and parse the sitemap, or upload a URL list file.

Once the crawl finishes, export the Internal:All data.

"Export Internal:All from the crawl I just ran."

Now ask:

"Compare the sitemap URLs against the crawled URLs. Which sitemap URLs were never discovered through internal links?"

Your assistant cross-references the two lists and returns the orphans.

Orphan pages (in sitemap but not discovered via links):

https://example.com/landing/summer-sale       (in sitemap, 0 inlinks)
https://example.com/resources/old-whitepaper   (in sitemap, 0 inlinks)
https://example.com/products/discontinued-widget (in sitemap, 0 inlinks)

For each orphan, decide: should this page exist? If yes, add internal links from relevant pages. If it's outdated or irrelevant, remove it from the sitemap and consider adding a redirect or noindex.

Ask your assistant for a recommendation:

"For each orphan page, suggest which existing pages on the site would be the best candidates to link from, based on URL structure and topic similarity."

Your assistant analyzes the URL patterns and suggests linking opportunities. The /landing/summer-sale page probably belongs in a promotions section. The whitepaper should be linked from the resources hub. The discontinued product should redirect to its replacement or category page.

</ToolSection>

How Do You Find Duplicate Content with Screaming Frog?

Export the Content:All and Duplicate:All tabs, then group duplicate pairs by canonical status. Pages with no canonical tag set are the highest priority, because search engines are guessing which version to index.

Duplicate content causes crawl waste and can dilute ranking signals. Two URLs serving identical or near-identical content compete against each other instead of consolidating authority on a single page.

Screaming Frog identifies exact duplicates (same content hash) and near-duplicates (high similarity percentage). The MCP can export and analyze both.

Ask your AI assistant:

"Export the Content:All data and the Duplicate:All data from my crawl."

Your assistant exports both tabs. The Duplicate tab shows URL pairs with their similarity scores.

"Show me pages with duplicate content. Group them by the canonical URL (if set) and flag any pairs where neither page has a canonical tag."

The output highlights three categories:

Duplicates with correct canonicals (low priority, already handled)
Duplicates with conflicting canonicals (each page points to itself, medium priority)
Duplicates with no canonical at all (high priority, search engines are guessing)

For category 3, your assistant can suggest which URL should be the canonical based on factors like URL length, crawl depth, and inlink count. The shorter URL with more inlinks is usually the right canonical target.

A common pattern on e-commerce sites: parameterized URLs like /products/widget?color=blue and /products/widget?color=red serve nearly identical content. Ask your assistant:

"Are any of the duplicate pairs caused by URL parameters? Show me the parameter patterns."

Your assistant groups the duplicates by their base URL (stripping parameters) and shows which parameters are causing the duplication. This gives you a clear list of parameters to handle via canonical tags, parameter handling in Google Search Console, or URL rewriting.

</ToolSection>

How to Use Screaming Frog for JavaScript Crawling

Save a .seospiderconfig file with JavaScript rendering enabled in the GUI, then pass it to the MCP's crawl_site tool. After the crawl, export the JavaScript:All tab and compare rendered content against raw HTML to find pages where critical elements only appear after JS execution.

JavaScript-rendered content is a common source of indexing issues. If your site uses React, Vue, Angular, or any client-side framework, the content that Screaming Frog sees by default (raw HTML) may differ from what search engines see after rendering.

Screaming Frog supports JavaScript rendering through its built-in Chromium engine. The MCP can trigger crawls with JavaScript rendering enabled, but only if you've saved a .seospiderconfig file with those settings.

Here's the workflow.

Open the Screaming Frog GUI
Go to Configuration > Spider > Rendering and set it to "JavaScript"
Adjust the rendering timeout (5 seconds is a good starting point for most sites)
Save the configuration: File > Save Configuration > js-rendering.seospiderconfig
Close the GUI

Now use the MCP with that config.

"Crawl https://example.com using my js-rendering.seospiderconfig file."

Your assistant passes the config file to the crawl_site tool.

After the crawl finishes, compare the rendered content against the raw HTML.

"Export the JavaScript:All tab from this crawl. Show me pages where the rendered title or H1 differs from the raw HTML version."

Pages where the title or H1 only appears after JavaScript execution are vulnerable to indexing issues. Googlebot renders JavaScript, but with delays. Other search engines (Bing, Yandex) may not render at all.

For each flagged page, the fix depends on your stack. Server-side rendering (SSR) is the safest option. If SSR isn't feasible, ensure critical content (titles, headings, main body text) is present in the initial HTML response, not injected by JavaScript after load.

</ToolSection>

How Do I Perform a Sitewide SEO Audit?

Crawl the site, export all key tabs in one batch (Response Codes, Page Titles, Meta Descriptions, H1s, Images, Canonicals, Directives), then ask your AI assistant to summarize and prioritize issues by impact. The full workflow takes five steps: crawl, export, summarize, prioritize, drill in.

Here's the complete workflow using the MCP.

Step 1: Crawl. Ask your assistant to crawl the site. For sites under 10,000 pages, a standard crawl takes 5-15 minutes. For larger sites, use a saved config with appropriate limits.

"Crawl https://example.com with a max of 50,000 URLs. Label it 'full-audit-march-2026'."

Step 2: Export everything. Once the crawl finishes, export all the data you'll need in one go.

"Export Internal:All, Response Codes:All, Page Titles:All, Meta Description:All, H1:All, H2:All, Images:All, Canonicals:All, and Directives:All from this crawl."

Step 3: Get the summary. Ask for the big picture first.

"Give me a summary: total pages crawled, broken pages (4xx and 5xx), pages with missing titles, pages with missing meta descriptions, pages with missing H1s, pages with duplicate titles, and pages with duplicate meta descriptions."

Step 4: Prioritize. Ask your assistant to rank the issues by impact.

"Rank these issues by severity. For broken pages, weight by number of inlinks. For missing titles, weight by crawl depth (shallow pages are higher priority). Give me the top 20 items to fix first."

Step 5: Drill into specifics. Pick the highest-priority issue and drill in.

"Show me the 10 highest-traffic pages with missing meta descriptions."
"Which of our product pages have duplicate titles?"
"Are there any pages blocked by robots.txt that have inlinks?"

This is the core loop: crawl, export, summarize, prioritize, drill in. Each follow-up question costs you seconds instead of the minutes it takes to re-export and re-filter in a spreadsheet.

</ToolSection>

How Often Should You Perform an SEO Audit?

After every major deployment, weekly for high-churn sites (e-commerce, job boards, news), monthly to quarterly for stable sites, and daily for the first week after any migration. The MCP makes weekly and post-deploy audits practical by automating the crawl-export-analyze loop.

The right cadence depends on how fast your site changes:

After every major deployment. Any code release that touches URL structure, redirects, meta tags, robots.txt, or sitemap generation should trigger an audit. The MCP makes this practical because you can automate it (see the next section).

Weekly for high-churn sites. E-commerce, classifieds, job boards, news sites. Focus on response codes and redirect health rather than full audits.

Monthly to quarterly for stable sites. Blogs, SaaS marketing sites, corporate sites. Monthly if you publish regularly, quarterly for brochure sites. Full audit covering all signal types.

After every migration or redesign. This is the most critical audit. Run it the day of migration and repeat daily for the first week. Redirect chains and broken pages surface quickly when you're checking every day.

The MCP's automation capabilities make weekly audits practical even for large portfolios. Set up crawls to run on a schedule and have your assistant analyze the results when they're ready.

</ToolSection>

Automating Recurring Audits Across Multiple Sites

The GUI is fine for a single site. The MCP shows its value when you're auditing the same portfolio of sites every month or after every deployment.

Imagine you manage five client sites. Each is crawled weekly. You have two hours to identify the top issues per site and compile a summary. Manually crawling and exporting each site takes 15-30 minutes per site. You're looking at 75-150 minutes of work just to get the data.

With the MCP, ask your AI assistant:

"List all my saved crawls and group them by client."

Your assistant returns something like this.

Client A: crawl_id_001 (8,234 URLs)
Client B: crawl_id_002 (3,156 URLs)
Client C: crawl_id_003 (12,401 URLs)
Client D: crawl_id_004 (2,890 URLs)
Client E: crawl_id_005 (5,678 URLs)

Now ask:

"For each crawl, find the top 3 issues: missing meta descriptions, pages with 404 responses, and pages with no internal links. Rank by frequency and show me a summary."

Your assistant runs a bulk analysis. For each database ID, it exports Meta Description:All and Response Codes:All, reads the CSVs, and compiles the results. Within minutes, you have a table.

Client A: 324 missing, 18 404s, 6 orphaned
Client B: 145 missing, 3 404s, 0 orphaned
Client C: 8,934 missing, 234 404s, 45 orphaned
Client D: 98 missing, 0 404s, 2 orphaned
Client E: 612 missing, 89 404s, 12 orphaned

Client C's 404 count is alarming. Ask:

"For Client C, show me the 404 pages. Which ones are orphaned (no internal links)?"

Your assistant cross-references the 404 list with the internal link data. Result: 89 of the 234 404s are truly orphaned. The other 145 still have internal links and should redirect or be restored.

For recurring audits at scale, create a Python script that runs on a schedule.

import subprocess
import json
from datetime import datetime

# Sites to audit
sites = {
    "client-a": "https://clienta.com",
    "client-b": "https://clientb.com",
    "client-c": "https://clientc.com",
}

# Trigger crawls
for site_name, url in sites.items():
    print(f"Starting crawl for {site_name}...")
    # Your AI assistant or script calls crawl_site(url=url, label=f"{site_name}-{datetime.now().isoformat()}")

# Poll status until complete
# Your script polls crawl_status() for each crawl

# Export and analyze
results = {}
for site_name in sites.keys():
    # Your script calls export_crawl and read_crawl_data
    # Aggregates issues per site
    pass

# Generate report
print(json.dumps(results, indent=2))

The workflow is trigger crawls, poll status, export results, analyze, report. For 5 sites running in parallel (up to 2 concurrent per the MCP limits), the whole process takes 30-50 minutes instead of 2+ hours.

For 50 sites, the automation saves days of manual work. This is the genuine value of the MCP: eliminating the repetitive loop of clicking-crawling-exporting-analyzing across a portfolio.

One gotcha for batch crawls. You can run a maximum of 2 concurrent crawls. If you have 10 sites to audit, queue them. Also, headless crawls via the MCP use default Screaming Frog settings unless you pass a custom .seospiderconfig file. For consistent results across sites, create a config in the GUI with your preferred settings (authentication, JavaScript rendering timeout, URL filters), save it, and reference it in your crawl_site calls.

</ToolSection>

How Do You Cross-Reference Crawl Data with Backlink and SERP Data?

Connect the Screaming Frog MCP alongside the Ahrefs MCP (or DataForSEO MCP) in the same AI session. Your assistant can match crawl response codes against backlink data to find dead pages with live external links, or combine crawl depth with organic traffic to find high-ranking pages buried deep in your site architecture.

The real power of MCP comes from combining tools in one session. Your AI assistant can pull crawl data from Screaming Frog, cross-reference it with backlink data from Ahrefs (via the Ahrefs MCP), and check SERP positions via DataForSEO, all in one conversation.

"Show me pages with broken backlinks that also have 404 status in my crawl" is a query that would take 30 minutes with manual exports. With MCP, it's one question.

The workflow: export your crawl's Response Codes data, then query Ahrefs for backlinks pointing to those URLs. Your assistant matches the two datasets and identifies external links pointing to dead pages. These are link reclamation opportunities: reach out to the linking sites and ask them to update the URL, or redirect the broken page to the most relevant live page.

Similarly, you can combine crawl depth data with organic traffic from Ahrefs to find pages that rank well but sit deep in your site architecture. Those pages deserve better internal linking.

As more SEO tools ship MCP servers, the combinations multiply. Sitebulb, Lumar, and ContentKing don't have MCP servers yet. If you build one for a tool you use, share it on the MCP registry.

</ToolSection>

Power User: Export Options and Custom Configs

What export options does the Screaming Frog MCP support? Over 30 tabs and bulk exports. The default export covers internal structure, response codes, meta tags, headings, images, and directives. Beyond that, you can export External Links, JavaScript execution status, Structured Data validation, sitemaps, AMP metadata, security headers, hreflang configuration, and pagination signals.

Say you're auditing structured data. Ask your AI assistant:

"Export all structured data validation errors from my crawl."

Your assistant exports using bulk_export='All Structured Data,Validation Errors' and returns a CSV like this.

URL                                | Schema Type  | Error
/product/widget-a.html             | Product      | Missing "offers" property
/article/blog-post.html            | NewsArticle  | Invalid date format
/event/conference-2026.html        | Event        | Missing "location" property

Fifteen percent of your product pages are missing the offers property. All articles have date format issues. Events are missing location data. These are fixable, ranked by frequency.

How do you use custom crawl configurations with the MCP? Create a .seospiderconfig file in the GUI (File > Save Configuration), then pass it to crawl_site via the config_file parameter. The config stores spider options, excluded patterns, JS rendering settings, and authentication, so every crawl runs with identical settings.

crawl_site(url='https://example.com', config_file='/path/to/mycrawl.seospiderconfig', label='client-site-march-2026')

How do you filter exports to specific error types? Use the exact filter name from the Screaming Frog GUI. For example, "Response Codes:Client Error (4xx)" instead of "Response Codes:All" to get only 4xx errors. The names are case-sensitive and must match the GUI labels exactly. One character off and the export silently produces nothing for that filter.

For large crawls (5,000+ URLs), the MCP handles pagination. By default, read_crawl_data returns 100 rows. Use the offset parameter to fetch the next batch.

# First call: rows 1-100
read_crawl_data(export_id='...', file='Meta Description:All.csv', limit=100, offset=0)

# Second call: rows 101-200
read_crawl_data(export_id='...', file='Meta Description:All.csv', limit=100, offset=100)

Your assistant can automate this. Ask: "Read all rows from the meta description export and categorize them by URL pattern (product, blog, category)." Your assistant batches the reads and aggregates the results.

To see the complete reference of all available export tabs and filters, ask your assistant: "Show me the screaming-frog://export-reference resource." This gives you every option available for exporting.

</ToolSection>

For Builders: Extending the MCP

The Screaming Frog MCP is open source, built on FastMCP, and clocks in at roughly 905 lines of Python. If you want to add custom analysis tools, fork it and extend it.

The architecture is straightforward. Each tool is a function decorated with @mcp.tool().

@mcp.tool()
async def my_custom_tool(param1: str, param2: int = 10) -> str:
    """Tool description here."""
    # Validation
    if not _validate_input(param1):
        raise ValueError("Invalid input")

    # Logic
    result = await _do_work(param1, param2)

    return json.dumps(result)

A real example. You want a tool that identifies pages with high crawl depth and suggests internal linking improvements. You'd write a function that exports the Internal:All data, filters by crawl depth, and returns the deep pages alongside shallow candidates for internal linking. Add the function to the server, decorate it, and your AI assistant can call it directly.

What security protections does the MCP include? URL validation blocks private IPs, localhost, and cloud metadata endpoints like metadata.google.internal. Argument validation rejects command-line injection. Database IDs are regex-validated to prevent traversal attacks. CSV reads are sandboxed to the export directory, so paths like ../../../etc/passwd are rejected.

You can adapt the MCP for other SEO tools. If a tool exposes a CLI or API, the pattern is the same: wrap the interface, expose the data through tools, and let your AI assistant query it conversationally.

Deploy your extended MCP locally.

uvx /path/to/your/fork/screaming-frog-mcp

Or containerize with Docker.

FROM python:3.11
COPY . /app
WORKDIR /app
RUN pip install -e .
CMD ["screaming-frog-mcp"]

One gotcha for builders. The server runs as a subprocess wrapper around Screaming Frog's CLI. Every tool call spawns a CLI process. If SF isn't installed or the license has expired, every call fails. The sf_check tool exists for exactly this reason. Run it first in any automated setup to verify the CLI is available and licensed.

</ToolSection>

Troubleshooting

"The database is locked" error. The Screaming Frog GUI is open. Close it, wait 2-3 seconds for the SQLite lock to release, then retry. The list_crawls tool is the one exception -- it works with the GUI open because it only reads metadata.

sf_check fails or returns "not found." The SF_CLI_PATH environment variable points to the wrong location. On macOS, the default path is /Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher. On Linux and Windows, the path depends on your installer. Run which screamingfrogseospider (Linux) or check your Start Menu shortcut properties (Windows) to find the actual path.

Exports return empty or incomplete data. Three common causes. First, you used export_tabs when you needed bulk_export (or vice versa). Redirect chains, all inlinks, and all outlinks require bulk_export. Individual tab views like Response Codes:All use export_tabs. Second, the filter name doesn't match the GUI label exactly. Filter names are case-sensitive: "Response Codes:Client Error (4xx)" works, "response codes:client error (4xx)" doesn't. Third, the crawl hasn't finished. Check crawl_status before exporting.

Crawl hangs or takes much longer than expected. Headless crawls use default Screaming Frog settings unless you pass a .seospiderconfig file. If the site has millions of pages and you didn't set a URL limit, the crawl will keep going. Use the max_urls parameter on crawl_site to cap it. For JavaScript-heavy sites, rendering adds significant time per page. Start with a 5,000-URL limit to estimate total crawl time before running a full crawl.

"License expired" or "License not found" errors. The MCP uses Screaming Frog's CLI, which requires a valid license. A free license crawls up to 500 URLs. For larger crawls, you need a paid license. Run sf_check to verify your license status. If the license recently expired, renew it in the GUI before running MCP commands.

Two concurrent crawls maximum. The MCP limits you to 2 simultaneous crawls. If you queue a third, it waits. For large portfolios, batch your crawls in groups of 2. Each crawl still runs independently, so a slow site won't block a fast one.

read_crawl_data only returns 100 rows. This is the default limit. Use the offset parameter to paginate through larger datasets, or ask your AI assistant to read all rows automatically. It will batch the reads and aggregate the results.

</ToolSection>

Conclusion

The Screaming Frog MCP turns crawl analysis from a manual, repetitive process into a conversational one. Export data, ask questions, drill into specifics, and get prioritized output without re-opening spreadsheets between each step.

Start with a single crawl. Ask your assistant to find missing H1 tags or redirect chains. Get comfortable with the tool sequence: list crawls, export data, read results, analyze. Once you've run one audit through the MCP, the patterns are clear and you can scale to weekly audits across multiple sites.

The biggest shift isn't speed (though that matters). It's that follow-up questions cost seconds instead of minutes. "Which of these broken pages have the most inlinks?" is one sentence, not a re-export and a VLOOKUP. That changes how thoroughly you audit, because drilling deeper stops being expensive.

The source code is on GitHub. If you build something useful on top of it, open a PR or share it on the MCP registry.