robots.txt for AI Bots — 2025 Guide for Local Businesses

robots.txt for AI Bots — 2025 Guide for Local Businesses
robots.txt for AI Bots — 2025 Guide for Local Businesses
Key idea 1 of 8

robots.txt for AI Bots — 2025 Guide for Local Businesses

Key idea 2 of 8

robots.txt is the bouncer — AI systems still ask who you are

robots.txt is the bouncer — AI systems still ask who you are

Key idea 3 of 8

How robots.txt works (quick refresher)

How robots.txt works (quick refresher)

Key idea 4 of 8

AI-related user-agents to know (2025)

AI-related user-agents to know (2025)

Key idea 5 of 8

The blocking debate — privacy vs accuracy

The blocking debate — privacy vs accuracy

Key idea 6 of 8

Recommended allow/disallow patterns for local sites

Recommended allow/disallow patterns for local sites

Key idea 7 of 8

Common local-business robots.txt mistakes

Common local-business robots.txt mistakes

Key idea 8 of 8

robots.txt vs meta robots vs HTTP headers

robots.txt vs meta robots vs HTTP headers

robots.txt tells crawlers which paths they may fetch — including GPTBot, ClaudeBot, and PerplexityBot. Blocking AI bots does not hide you from AI assistants; it often makes citations less accurate. Local businesses should allow public service and location pages while blocking admin, cart, and duplicate shells.

robots.txt is the bouncer — AI systems still ask who you are

A dental group in Austin discovers ChatGPT cites old Saturday hours — discontinued six months ago. GBP is correct. The website shows the update. Yet AI keeps quoting Saturday appointments.

The culprit, more often than owners expect: robots.txt or template-level blocks preventing AI-oriented crawlers from fetching /hours or the location template entirely. The model falls back to stale directory snippets or training-era text.

robots.txt is a plain-text file at /robots.txt that tells compliant crawlers which paths they may request. It is not authentication. It is not DRM. Malicious bots ignore it. Legitimate bots — including many used in retrieval-augmented answers — respect it.

This 2025 guide is for local operators, marketers, and devs deciding what to allow, what to block, and how those choices interact with llms.txt and schema, IndexNow freshness, and the review graph that still drives how AI assistants choose businesses.

Honest scope: robots.txt hygiene enables accurate retrieval. It does not replace reviews, NAP, or GBP. Fix the bouncer; still stock the shelves.

How robots.txt works (quick refresher)

Syntax basics

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Allow: /

Sitemap: https://example.com/sitemap.xml
  • User-agent — names the crawler; * is wildcard default
  • Disallow — path prefix crawlers should not fetch
  • Allow — exception within a disallowed tree (order matters in Google-style parsers)
  • Sitemap — optional hint pointing to sitemap.xml

Robots.txt governs crawl, not necessarily index. A URL blocked in robots.txt may still appear in search results if linked elsewhere — with sparse snippets. For AI retrieval, blocked often means not fetched — old facts persist elsewhere.

Compliance model

Major AI crawlers generally self-identify and honor robots.txt. They are not obligated by law worldwide; practice varies. Assume honor-system for product crawlers; assume nothing for scrapers.

Names evolve — confirm in vendor docs before audits.

User-agent Operator Typical purpose
GPTBot OpenAI Training and browsing-related fetch
ChatGPT-User OpenAI Real-time browsing fetches triggered by user prompts
ClaudeBot Anthropic Crawl for Claude products
anthropic-ai Anthropic Alternate identifier — check docs
PerplexityBot Perplexity Citation retrieval
Google-Extended Google Generative AI product use (distinct from Googlebot)
Applebot-Extended Apple Apple Intelligence / Siri-related surfaces
Bytespider ByteDance Various AI/search products
FacebookBot Meta Social + meta AI stacks
cohere-ai Cohere Enterprise retrieval partners

Googlebot still matters for Maps-adjacent web signals and AI Overviews — separate from Google-Extended policy choices.

Local businesses rarely need bot-by-bot philosophical debates. They need public money pages crawlable and admin junk blocked.

The blocking debate — privacy vs accuracy

Arguments for blocking AI crawlers

  • Reduce training inclusion of proprietary content
  • Limit exposure of unpublished pricing or internal PDFs
  • Enterprise legal caution on bulk ingestion
  • Competitive paranoia — rarely justified for local service pages

Arguments against blanket blocks (local default)

  • Retrieval needs HTML — Perplexity and browsing ChatGPT cite URLs they can fetch
  • Stale citations hurt — models use directories and old snapshots when your site is unreachable
  • You already publish public facts — hours, services, phone — blocking crawlers does not hide them from aggregators
  • Competitors remain crawlable — asymmetric information loss

For most plumbers, dentists, and law firms, the correct posture is: allow public entity pages; block sensitive paths — not Disallow: / for GPTBot.

If you block, do so surgically with eyes open: expect why ChatGPT does not recommend your business diagnoses to include crawl access.

Always allow (public entity graph)

  • / homepage with NAP and LocalBusiness context
  • /locations/* or /service-area/* geography pages
  • /services/* scope and FAQ content
  • /contact and /about trust pages
  • /llms.txt if deployed
  • /faq or on-page FAQ sections

Always disallow (non-public or low value)

  • /wp-admin/, /admin, /cpanel
  • /cart, /checkout, /my-account
  • /search?, internal site search result URLs
  • /tag/, /author/ archive noise (site-dependent)
  • Staging hostnames — block entire staging via separate robots or auth
  • Parameterized duplicates — ?print=, session IDs where applicable

Example — local service business template

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Allow: /llms.txt
Allow: /locations/
Allow: /services/
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.com/sitemap.xml

Duplicate User-agent blocks merge by crawler — tailor if legal requires stricter rules for specific bots.

Common local-business robots.txt mistakes

Mistake 1 — Blanket disallow after plugin install

Security plugins sometimes ship aggressive defaults: Disallow: / for unknown bots. Audit after every plugin update.

Mistake 2 — Blocking /wp-content/ or JS/CSS

Legacy SEO myth. Modern crawlers need assets; blocking can harm render and fetch quality.

Mistake 3 — Accidental block of location subdirectory

Franchise restructure leaves Disallow: /locations/ from old CMS migration notes — AI retrieves homepage only; city pages invisible.

Mistake 4 — Staging robots copied to production

Disallow: / on production — catastrophic. CI should enforce environment-specific files.

Mistake 5 — Blocking llms.txt or sitemap paths

Some teams disallow /llms.txt thinking it hides facts — it hides your version of facts, not Yelp's.

Mistake 6 — Relying on robots.txt for secrets

PII, client portals, and HIPAA forms need auth — not robots hints.

Mistake 7 — Blocking Google-Extended but ignoring Googlebot

Understand which product each rule affects. Local Maps traffic still flows through Google crawl ecosystems.

robots.txt vs meta robots vs HTTP headers

Control Layer Effect
robots.txt Site-wide path crawl permission Crawler may not fetch body
<meta name="robots" content="noindex"> Page HTML Ask not to index after fetch
X-Robots-Tag header HTTP response Same as meta, server-side

AI retrieval pipelines may never see noindex if robots.txt blocked fetch — contradictory signals confuse debugging.

Best practice for public location pages: allow crawl + index (index, follow). Use noindex only on thank-you pages, thin duplicates, and internal search results.

Interaction with AI retrieval and citations

Perplexity-style citation engines

How Perplexity cites local businesses: retrieval fetches candidate URLs. Blocked robots → URL skipped → competitor or directory cited instead.

ChatGPT browsing modes

When browsing triggers, fetch respects robots.txt for identified bots. User-visible answers may omit your site entirely even if GBP is strong.

Google Gemini and AI Overviews

Google-Extended policy is separate from standard indexing. Many local sites allow Googlebot while evaluating Google-Extended — document your choice; Maps relevance still ties to normal Google crawl health.

Offline model knowledge

Training corpora may include older page text regardless of current robots — blocking today does not erase history. Fresh corrections need fetchable pages.

Audit checklist — run quarterly

1. Fetch live file

curl -s https://yourdomain.com/robots.txt

Compare to repo — production drift happens.

2. Validate sitemap reference

Sitemap URL returns 200 and lists money pages.

3. Spot-check AI user-agents

Use Bing/Webmaster or third-party robots testers simulating GPTBot and PerplexityBot against:

  • Homepage
  • One location page
  • One service page
  • /llms.txt

Expect Allowed for public paths.

4. Cross-check Search Console

Coverage errors, blocked by robots reports on /services/* — fix immediately.

5. Log CMS changes

Plugin Changelog → robots section in runbook.

6. Sample AI prompts

Monthly: prompts that should cite your service page URL. If citations point only to directories, crawl access may be failing — share of AI voice measurement.

Multi-location and franchise considerations

  • Single robots.txt per domain — standard for most brands
  • Subdomainslocations.brand.com may need separate files
  • Path-based locales — ensure /locations/us/tn/nashville/ not disallowed by overly broad /us/ rule from legacy i18n experiments
  • White-label partners — franchisee microsites on unique domains each need audits

Central marketing should ship robots templates in franchise kits — drift is the enemy.

Blocking selectively — enterprise and YMYL

Law firms and clinics sometimes block AI training crawlers but allow search crawlers:

User-agent: GPTBot
Disallow: /

User-agent: Googlebot
Allow: /

Understand tradeoff: less training inclusion, potentially fewer live citations from OpenAI products. Measure mention rate before and after — anecdotes are not policy.

For patient portals at /portal/, disallow all bots:

User-agent: *
Disallow: /portal/

Public marketing remains allow — separate concerns.

Relationship to llms.txt and schema

llms.txt summarizes entity facts at /llms.txt. If robots blocks that path, models lose publisher-curated grounding.

JSON-LD on allowed pages still helps parsers that fetch HTML. Blocked pages mean invisible schema.

Order of operations:

  1. Fix NAP on listings
  2. Deploy schema + llms.txt on crawlable URLs
  3. Audit robots.txt
  4. Submit IndexNow on updates

DevOps and deployment

  • Store robots.txt in version control
  • Environment-specific builds — never copy staging disallow wholesale
  • Post-deploy smoke test in CI — curl robots + simulate GPTBot on /locations/test-city
  • Alert if production robots contains Disallow: /

Agency handoffs: require robots diff in transition doc.

When blocking makes sense

Legitimate block scenarios:

  • Paid member content or course libraries
  • Unreleased product pages behind marketing holds (use auth instead when possible)
  • Aggressive scraper abuse on /api/ — rate limit plus disallow
  • Regulatory hold on specific document paths

Not legitimate: hiding from AI because competitors might read your public services list — they already can.

Worked example — plumber after site migration

A Raleigh plumber migrates WordPress → headless. Old robots had:

User-agent: *
Disallow: /wp-content/uploads/

New stack serves images from /assets/ — fine. But migration plugin added:

User-agent: GPTBot
Disallow: /

Copied from a blog post about "protecting content from AI."

Symptoms: Perplexity cites Angi listing only; ChatGPT gives wrong emergency line from old Yelp.

Fix:

  1. Remove blanket GPTBot disallow
  2. Allow /services/, /service-area/, /llms.txt
  3. Disallow /api/ and /studio/ (CMS preview)
  4. Resubmit sitemap; IndexNow push on service-area pages
  5. Week-four rescan — citation URLs shift to owned domain

Lesson: ideological blocking without measurement creates silent invisibility.

Policy documentation for stakeholders

Marketing, legal, and dev should share a one-page policy:

  • Public pages: allow major AI crawlers
  • Blocked paths list with rationale
  • Review cadence quarterly
  • Owner for changes

Reduces "legal said block AI" without specifying paths.

Staging, pre-production, and preview hosts

Local businesses break AI citations during redesigns when staging leaks into indexes or production copies staging robots.

Staging subdomain pattern:

# staging.example.com/robots.txt
User-agent: *
Disallow: /

Production must never inherit this file. CI guard: fail build if production artifact contains Disallow: / without accompanying Allow rules for public paths.

Password-protected staging is better than robots-only protection — robots is a hint, not a lock.

Preview URLs from headless CMS (preview.example.com) should always disallow all crawlers and use noindex — preview NAP experiments have appeared in Perplexity when teams pushed IndexNow against preview hostnames by mistake.

Document hostname → robots mapping in the agency SOP.

Log analysis — spotting AI bot traffic

Server logs complement robots audits. Filter access logs for:

  • GPTBot
  • ClaudeBot
  • PerplexityBot
  • Applebot-Extended

Healthy pattern after launching /locations/new-city/:

  • First GPTBot or PerplexityBot hit within days to weeks post-launch
  • Repeated fetches after content updates
  • 200 responses on HTML, not 403 from WAF

Red flags:

  • Zero AI bot hits for 90 days while Googlebot hits location URLs — suspect user-agent block
  • 403/503 on all bot traffic — hosting firewall louder than robots.txt
  • Only homepage hits — /locations/ disallowed or unlinked

Share log snippets with dev when mention rate stalls despite strong reviews — crawl access is an under-tested hypothesis.

Vendor policy pages — stay current

OpenAI, Anthropic, Perplexity, Google, and Apple publish crawler documentation that changes names and scopes. Assign one owner to review quarterly:

  • New user-agent strings to add to monitoring
  • Policy on Google-Extended vs Googlebot
  • Whether opt-out mechanisms beyond robots exist (some vendors offer publisher controls)

robots.txt remains the universal first lever; publisher dashboards are second where offered.

Bottom line

robots.txt for AI bots is not a philosophical war — it is operational clarity. Local businesses should default to allowing compliant AI crawlers on public homepage, location, service, FAQ, and llms.txt paths while blocking admin, account, and staging trees.

Blanket blocks rarely protect secrets; they often deprive retrieval systems of fresh facts and leave AI assistants quoting directories instead of you. Audit after every redesign, align with schema and sitemap strategy, and measure mention rate — not fear headlines.

Technical next steps: llms.txt and schema checklist · sitemap.xml for AI · structured data guide · free scan.


Frequently asked questions

Should local businesses block GPTBot and other AI crawlers?

Usually no for public marketing pages. Blocking reduces the chance retrieval systems fetch your current hours, services, and geography. Block only sensitive paths — admin, staging, account areas — not your homepage or location pages.

Does blocking AI bots prevent ChatGPT from mentioning my business?

No. Models still use reviews, listings, and third-party directories. Blocking your site can prevent accurate citations and updates when browsing or retrieval runs.

What is the difference between robots.txt and noindex?

robots.txt disallows crawling of paths; it does not remove URLs from all indexes. noindex on a page asks not to index that URL. You need both concepts in a proper technical audit.

Which AI user-agents should I know in 2025?

Common ones include GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended (Google AI products), Applebot-Extended, and Bytespider. Verify current names in vendor documentation — they change.

How often should I audit robots.txt for AI visibility?

After every site redesign, CMS plugin update, or agency handoff — at minimum quarterly. Wrong disallow rules are a silent cause of stale AI answers.

Frequently asked questions

Usually no for public marketing pages. Blocking reduces the chance retrieval systems fetch your current hours, services, and geography. Block only sensitive paths — admin, staging, account areas — not your homepage or location pages.

No. Models still use reviews, listings, and third-party directories. Blocking your site can prevent accurate citations and updates when browsing or retrieval runs.

robots.txt disallows crawling of paths; it does not remove URLs from all indexes. noindex on a page asks not to index that URL. You need both concepts in a proper technical audit.

Common ones include GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended (Google AI products), Applebot-Extended, and Bytespider. Verify current names in vendor documentation — they change.

After every site redesign, CMS plugin update, or agency handoff — at minimum quarterly. Wrong disallow rules are a silent cause of stale AI answers.

See what AI says about your business

Free six-platform scan · shareable report · ~15 seconds