Technical

July 22, 202510 min read

robots.txt for AI Bots — 2025 Guide for Local Businesses

Scott Tischler Founder, AIrecommend.ai AI visibility, AEO & GEO research for local businesses

Key idea 1 of 8

robots.txt for AI Bots — 2025 Guide for Local Businesses

Key idea 2 of 8

robots.txt is the bouncer — AI systems still ask who you are

Key idea 3 of 8

How robots.txt works (quick refresher)

Key idea 4 of 8

AI-related user-agents to know (2025)

Key idea 5 of 8

The blocking debate — privacy vs accuracy

Key idea 6 of 8

Recommended allow/disallow patterns for local sites

Key idea 7 of 8

Common local-business robots.txt mistakes

Key idea 8 of 8

robots.txt vs meta robots vs HTTP headers

robots.txt tells crawlers which paths they may fetch — including GPTBot, ClaudeBot, and PerplexityBot. Blocking AI bots does not hide you from AI assistants; it often makes citations less accurate. Local businesses should allow public service and location pages while blocking admin, cart, and duplicate shells.

robots.txt is the bouncer — AI systems still ask who you are

A dental group in Austin discovers ChatGPT cites old Saturday hours — discontinued six months ago. GBP is correct. The website shows the update. Yet AI keeps quoting Saturday appointments.

The culprit, more often than owners expect: robots.txt or template-level blocks preventing AI-oriented crawlers from fetching /hours or the location template entirely. The model falls back to stale directory snippets or training-era text.

robots.txt is a plain-text file at /robots.txt that tells compliant crawlers which paths they may request. It is not authentication. It is not DRM. Malicious bots ignore it. Legitimate bots — including many used in retrieval-augmented answers — respect it.

This 2025 guide is for local operators, marketers, and devs deciding what to allow, what to block, and how those choices interact with llms.txt and schema, IndexNow freshness, and the review graph that still drives how AI assistants choose businesses.

Honest scope: robots.txt hygiene enables accurate retrieval. It does not replace reviews, NAP, or GBP. Fix the bouncer; still stock the shelves.

How robots.txt works (quick refresher)

Syntax basics

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Allow: /

Sitemap: https://example.com/sitemap.xml

User-agent — names the crawler; * is wildcard default
Disallow — path prefix crawlers should not fetch
Allow — exception within a disallowed tree (order matters in Google-style parsers)
Sitemap — optional hint pointing to sitemap.xml

Robots.txt governs crawl, not necessarily index. A URL blocked in robots.txt may still appear in search results if linked elsewhere — with sparse snippets. For AI retrieval, blocked often means not fetched — old facts persist elsewhere.

Compliance model

Major AI crawlers generally self-identify and honor robots.txt. They are not obligated by law worldwide; practice varies. Assume honor-system for product crawlers; assume nothing for scrapers.

Names evolve — confirm in vendor docs before audits.

User-agent	Operator	Typical purpose
GPTBot	OpenAI	Training and browsing-related fetch
ChatGPT-User	OpenAI	Real-time browsing fetches triggered by user prompts
ClaudeBot	Anthropic	Crawl for Claude products
anthropic-ai	Anthropic	Alternate identifier — check docs
PerplexityBot	Perplexity	Citation retrieval
Google-Extended	Google	Generative AI product use (distinct from Googlebot)
Applebot-Extended	Apple	Apple Intelligence / Siri-related surfaces
Bytespider	ByteDance	Various AI/search products
FacebookBot	Meta	Social + meta AI stacks
cohere-ai	Cohere	Enterprise retrieval partners

Googlebot still matters for Maps-adjacent web signals and AI Overviews — separate from Google-Extended policy choices.

Local businesses rarely need bot-by-bot philosophical debates. They need public money pages crawlable and admin junk blocked.

The blocking debate — privacy vs accuracy

Arguments for blocking AI crawlers

Reduce training inclusion of proprietary content
Limit exposure of unpublished pricing or internal PDFs
Enterprise legal caution on bulk ingestion
Competitive paranoia — rarely justified for local service pages

Arguments against blanket blocks (local default)

Retrieval needs HTML — Perplexity and browsing ChatGPT cite URLs they can fetch
Stale citations hurt — models use directories and old snapshots when your site is unreachable
You already publish public facts — hours, services, phone — blocking crawlers does not hide them from aggregators
Competitors remain crawlable — asymmetric information loss

For most plumbers, dentists, and law firms, the correct posture is: allow public entity pages; block sensitive paths — not Disallow: / for GPTBot.

If you block, do so surgically with eyes open: expect why ChatGPT does not recommend your business diagnoses to include crawl access.

Recommended allow/disallow patterns for local sites

Always allow (public entity graph)

/ homepage with NAP and LocalBusiness context
/locations/* or /service-area/* geography pages
/services/* scope and FAQ content
/contact and /about trust pages
/llms.txt if deployed
/faq or on-page FAQ sections

Always disallow (non-public or low value)

/wp-admin/, /admin, /cpanel
/cart, /checkout, /my-account
/search?, internal site search result URLs
/tag/, /author/ archive noise (site-dependent)
Staging hostnames — block entire staging via separate robots or auth
Parameterized duplicates — ?print=, session IDs where applicable

Example — local service business template

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Allow: /llms.txt
Allow: /locations/
Allow: /services/
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.com/sitemap.xml

Duplicate User-agent blocks merge by crawler — tailor if legal requires stricter rules for specific bots.

Common local-business robots.txt mistakes

Mistake 1 — Blanket disallow after plugin install

Security plugins sometimes ship aggressive defaults: Disallow: / for unknown bots. Audit after every plugin update.

Mistake 2 — Blocking `/wp-content/` or JS/CSS

Legacy SEO myth. Modern crawlers need assets; blocking can harm render and fetch quality.

Mistake 3 — Accidental block of location subdirectory

Franchise restructure leaves Disallow: /locations/ from old CMS migration notes — AI retrieves homepage only; city pages invisible.

Mistake 4 — Staging robots copied to production

Disallow: / on production — catastrophic. CI should enforce environment-specific files.

Mistake 5 — Blocking llms.txt or sitemap paths

Some teams disallow /llms.txt thinking it hides facts — it hides your version of facts, not Yelp's.

Mistake 6 — Relying on robots.txt for secrets

PII, client portals, and HIPAA forms need auth — not robots hints.

Mistake 7 — Blocking Google-Extended but ignoring Googlebot

Understand which product each rule affects. Local Maps traffic still flows through Google crawl ecosystems.

robots.txt vs meta robots vs HTTP headers

Control	Layer	Effect
robots.txt	Site-wide path crawl permission	Crawler may not fetch body
`<meta name="robots" content="noindex">`	Page HTML	Ask not to index after fetch
X-Robots-Tag header	HTTP response	Same as meta, server-side

AI retrieval pipelines may never see noindex if robots.txt blocked fetch — contradictory signals confuse debugging.

Best practice for public location pages: allow crawl + index (index, follow). Use noindex only on thank-you pages, thin duplicates, and internal search results.

Interaction with AI retrieval and citations

Perplexity-style citation engines

How Perplexity cites local businesses: retrieval fetches candidate URLs. Blocked robots → URL skipped → competitor or directory cited instead.

ChatGPT browsing modes

When browsing triggers, fetch respects robots.txt for identified bots. User-visible answers may omit your site entirely even if GBP is strong.

Google Gemini and AI Overviews

Google-Extended policy is separate from standard indexing. Many local sites allow Googlebot while evaluating Google-Extended — document your choice; Maps relevance still ties to normal Google crawl health.

Offline model knowledge

Training corpora may include older page text regardless of current robots — blocking today does not erase history. Fresh corrections need fetchable pages.

Audit checklist — run quarterly

1. Fetch live file

curl -s https://yourdomain.com/robots.txt

Compare to repo — production drift happens.

2. Validate sitemap reference

Sitemap URL returns 200 and lists money pages.

3. Spot-check AI user-agents

Use Bing/Webmaster or third-party robots testers simulating GPTBot and PerplexityBot against:

Homepage
One location page
One service page
/llms.txt

Expect Allowed for public paths.

4. Cross-check Search Console

Coverage errors, blocked by robots reports on /services/* — fix immediately.

5. Log CMS changes

Plugin Changelog → robots section in runbook.

6. Sample AI prompts

Monthly: prompts that should cite your service page URL. If citations point only to directories, crawl access may be failing — share of AI voice measurement.

Multi-location and franchise considerations

Single robots.txt per domain — standard for most brands
Subdomains — locations.brand.com may need separate files
Path-based locales — ensure /locations/us/tn/nashville/ not disallowed by overly broad /us/ rule from legacy i18n experiments
White-label partners — franchisee microsites on unique domains each need audits

Central marketing should ship robots templates in franchise kits — drift is the enemy.

Blocking selectively — enterprise and YMYL

Law firms and clinics sometimes block AI training crawlers but allow search crawlers:

User-agent: GPTBot
Disallow: /

User-agent: Googlebot
Allow: /

Understand tradeoff: less training inclusion, potentially fewer live citations from OpenAI products. Measure mention rate before and after — anecdotes are not policy.

For patient portals at /portal/, disallow all bots:

User-agent: *
Disallow: /portal/

Public marketing remains allow — separate concerns.

Relationship to llms.txt and schema

llms.txt summarizes entity facts at /llms.txt. If robots blocks that path, models lose publisher-curated grounding.

JSON-LD on allowed pages still helps parsers that fetch HTML. Blocked pages mean invisible schema.

Order of operations:

Fix NAP on listings
Deploy schema + llms.txt on crawlable URLs
Audit robots.txt
Submit IndexNow on updates

DevOps and deployment

Store robots.txt in version control
Environment-specific builds — never copy staging disallow wholesale
Post-deploy smoke test in CI — curl robots + simulate GPTBot on /locations/test-city
Alert if production robots contains Disallow: /

Agency handoffs: require robots diff in transition doc.

When blocking makes sense

Legitimate block scenarios:

Paid member content or course libraries
Unreleased product pages behind marketing holds (use auth instead when possible)
Aggressive scraper abuse on /api/ — rate limit plus disallow
Regulatory hold on specific document paths

Not legitimate: hiding from AI because competitors might read your public services list — they already can.

Worked example — plumber after site migration

A Raleigh plumber migrates WordPress → headless. Old robots had:

User-agent: *
Disallow: /wp-content/uploads/

New stack serves images from /assets/ — fine. But migration plugin added:

User-agent: GPTBot
Disallow: /

Copied from a blog post about "protecting content from AI."

Symptoms: Perplexity cites Angi listing only; ChatGPT gives wrong emergency line from old Yelp.

Fix:

Remove blanket GPTBot disallow
Allow /services/, /service-area/, /llms.txt
Disallow /api/ and /studio/ (CMS preview)
Resubmit sitemap; IndexNow push on service-area pages
Week-four rescan — citation URLs shift to owned domain

Lesson: ideological blocking without measurement creates silent invisibility.

Policy documentation for stakeholders

Marketing, legal, and dev should share a one-page policy:

Public pages: allow major AI crawlers
Blocked paths list with rationale
Review cadence quarterly
Owner for changes

Reduces "legal said block AI" without specifying paths.

Staging, pre-production, and preview hosts

Local businesses break AI citations during redesigns when staging leaks into indexes or production copies staging robots.

Staging subdomain pattern:

# staging.example.com/robots.txt
User-agent: *
Disallow: /

Production must never inherit this file. CI guard: fail build if production artifact contains Disallow: / without accompanying Allow rules for public paths.

Password-protected staging is better than robots-only protection — robots is a hint, not a lock.

Preview URLs from headless CMS (preview.example.com) should always disallow all crawlers and use noindex — preview NAP experiments have appeared in Perplexity when teams pushed IndexNow against preview hostnames by mistake.

Document hostname → robots mapping in the agency SOP.

Log analysis — spotting AI bot traffic

Server logs complement robots audits. Filter access logs for:

GPTBot
ClaudeBot
PerplexityBot
Applebot-Extended

Healthy pattern after launching /locations/new-city/:

First GPTBot or PerplexityBot hit within days to weeks post-launch
Repeated fetches after content updates
200 responses on HTML, not 403 from WAF

Red flags:

Zero AI bot hits for 90 days while Googlebot hits location URLs — suspect user-agent block
403/503 on all bot traffic — hosting firewall louder than robots.txt
Only homepage hits — /locations/ disallowed or unlinked

Share log snippets with dev when mention rate stalls despite strong reviews — crawl access is an under-tested hypothesis.

Vendor policy pages — stay current

OpenAI, Anthropic, Perplexity, Google, and Apple publish crawler documentation that changes names and scopes. Assign one owner to review quarterly:

New user-agent strings to add to monitoring
Policy on Google-Extended vs Googlebot
Whether opt-out mechanisms beyond robots exist (some vendors offer publisher controls)

robots.txt remains the universal first lever; publisher dashboards are second where offered.

Bottom line

robots.txt for AI bots is not a philosophical war — it is operational clarity. Local businesses should default to allowing compliant AI crawlers on public homepage, location, service, FAQ, and llms.txt paths while blocking admin, account, and staging trees.

Blanket blocks rarely protect secrets; they often deprive retrieval systems of fresh facts and leave AI assistants quoting directories instead of you. Audit after every redesign, align with schema and sitemap strategy, and measure mention rate — not fear headlines.

Technical next steps: llms.txt and schema checklist · sitemap.xml for AI · structured data guide · free scan.

Frequently asked questions

Should local businesses block GPTBot and other AI crawlers?

Usually no for public marketing pages. Blocking reduces the chance retrieval systems fetch your current hours, services, and geography. Block only sensitive paths — admin, staging, account areas — not your homepage or location pages.

Does blocking AI bots prevent ChatGPT from mentioning my business?

No. Models still use reviews, listings, and third-party directories. Blocking your site can prevent accurate citations and updates when browsing or retrieval runs.

What is the difference between robots.txt and noindex?

robots.txt disallows crawling of paths; it does not remove URLs from all indexes. noindex on a page asks not to index that URL. You need both concepts in a proper technical audit.

Which AI user-agents should I know in 2025?

Common ones include GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended (Google AI products), Applebot-Extended, and Bytespider. Verify current names in vendor documentation — they change.

How often should I audit robots.txt for AI visibility?

After every site redesign, CMS plugin update, or agency handoff — at minimum quarterly. Wrong disallow rules are a silent cause of stale AI answers.

robots.txt for AI Bots — 2025 Guide for Local Businesses

robots.txt for AI Bots — 2025 Guide for Local Businesses

robots.txt is the bouncer — AI systems still ask who you are

How robots.txt works (quick refresher)

AI-related user-agents to know (2025)

The blocking debate — privacy vs accuracy

Recommended allow/disallow patterns for local sites

Common local-business robots.txt mistakes

robots.txt vs meta robots vs HTTP headers

robots.txt is the bouncer — AI systems still ask who you are

How robots.txt works (quick refresher)

Syntax basics

Compliance model

AI-related user-agents to know (2025)

The blocking debate — privacy vs accuracy

Arguments for blocking AI crawlers

Arguments against blanket blocks (local default)

Recommended allow/disallow patterns for local sites

Always allow (public entity graph)

Always disallow (non-public or low value)

Example — local service business template

Common local-business robots.txt mistakes

Mistake 1 — Blanket disallow after plugin install

Mistake 2 — Blocking /wp-content/ or JS/CSS

Mistake 3 — Accidental block of location subdirectory

Mistake 4 — Staging robots copied to production

Mistake 5 — Blocking llms.txt or sitemap paths

Mistake 6 — Relying on robots.txt for secrets

Mistake 7 — Blocking Google-Extended but ignoring Googlebot

robots.txt vs meta robots vs HTTP headers

Interaction with AI retrieval and citations

Perplexity-style citation engines

ChatGPT browsing modes

Google Gemini and AI Overviews

Offline model knowledge

Audit checklist — run quarterly

Multi-location and franchise considerations

Blocking selectively — enterprise and YMYL

Relationship to llms.txt and schema

DevOps and deployment

When blocking makes sense

Worked example — plumber after site migration

Policy documentation for stakeholders

Staging, pre-production, and preview hosts

Log analysis — spotting AI bot traffic

Vendor policy pages — stay current

Bottom line

Frequently asked questions

Should local businesses block GPTBot and other AI crawlers?

Does blocking AI bots prevent ChatGPT from mentioning my business?

What is the difference between robots.txt and noindex?

Which AI user-agents should I know in 2025?

How often should I audit robots.txt for AI visibility?

Frequently asked questions

See what AI says about your business

Mistake 2 — Blocking `/wp-content/` or JS/CSS