What AI Crawlers Actually See When They Visit Your Website
Your website looks great in a browser. But AI crawlers do not use browsers.
When ChatGPT's web search, Perplexity's retrieval system, or Google's AI indexer visits your site, they see the raw HTML your server sends. No JavaScript execution. No client-side rendering. No interactive elements. Just the HTML.
If your important content loads via JavaScript after the page renders, AI crawlers see a blank page. Your check-in time, your service descriptions, your pricing. All invisible.
The 73% Problem
Otterly AI's research across their customer base found that 73% of websites have technical barriers blocking AI crawler access. These are not obscure sites. They are businesses actively trying to be found online.
The barriers fall into three categories: CDN-level blocks, robots.txt restrictions, and JavaScript-dependent content.
CDN-Level Blocks
Content delivery networks like Cloudflare and Akamai protect websites from bots. That protection sometimes catches AI crawlers in the crossfire. If your CDN is configured to challenge or block non-browser requests, AI platforms cannot access your content.
The fix: check your CDN settings for bot management rules. Explicitly allow known AI crawlers: GPTBot (OpenAI), Google-Extended (Google/Gemini), PerplexityBot, ClaudeBot (Anthropic). These are documented user agents with legitimate purposes.
Robots.txt Restrictions
Your robots.txt file tells crawlers what they can and cannot access. Many websites block AI crawlers either intentionally (to prevent training data usage) or unintentionally (overly broad disallow rules).
Check your robots.txt right now: go to yourdomain.com/robots.txt and look for rules targeting GPTBot, Google-Extended, or general disallow rules that would catch AI agents.
The decision to block AI crawlers is legitimate. Some businesses do not want their content used for training. But understand the tradeoff: if you block GPTBot, ChatGPT's web search cannot retrieve your pages when generating recommendations. You are invisible not because your content is bad, but because you told the crawler to stay away.
JavaScript-Dependent Content
This is the most common and least understood barrier. Modern web frameworks (React, Angular, Vue) often render content client-side. The server sends a minimal HTML shell, and JavaScript fills in the content after the page loads in a browser.
AI crawlers do not execute JavaScript. They see the shell. If your business name, services, prices, and descriptions are all rendered by JavaScript, the AI crawler sees none of it.
How to test: use curl to fetch your homepage and compare what you see to what loads in a browser. Or use Google's Rich Results Test, which shows the rendered vs. raw HTML. If your key content is missing from the raw HTML, AI crawlers cannot see it.
The Parsing Tax
Even when AI crawlers can access your site, there is a cost to complexity. Readable (formerly SonicLinker) coined the term "parsing tax": a typical business page might have 500+ lines of CSS, 200+ lines of JavaScript, but only 50 lines of actual meaningful content. AI agents must process everything to find those 50 lines. More noise means more extraction errors.
Page speed compounds this. AI crawlers operate under strict time budgets. Readable's analysis of crawl behavior found that a page with 5-second load time allows an AI to index about 6 pages in a session. A page with 1-second load time allows 30. Faster pages get more of your content into the AI's knowledge base.
The target: time to first byte (TTFB) under 200ms is excellent. Under 500ms is good. Over 2 seconds risks timeout, meaning the AI gives up and moves to the next source.
Readable also analyzed 2 million AI-agent requests across 100+ websites. None requested /llms.txt, the proposed standard for guiding AI understanding. AI agents just fetch standard pages: homepage, docs, pricing, blog. Do not waste time on special AI-only files. Make your real pages readable.
What AI Crawlers Want to See
Scrunch's analysis of what works for AI search optimization boils down to a simple principle: serve good HTML with lots of readable text, straight from the server.
Specifically:
- Clear heading structure. H1, H2, H3 tags that describe the content hierarchy. AI parsers use these to understand what each section is about.
- Server-rendered text. Key facts about your business in the initial HTML response. No JavaScript required to read them.
- Schema markup. JSON-LD in the page head that explicitly describes your business entity. LocalBusiness, Hotel, Restaurant, or whatever schema type matches. This is structured data that AI systems parse directly.
- Minimal navigation clutter. AI crawlers extract content from the entire page. If 60% of your page is navigation, footer links, and sidebar widgets, the signal-to-noise ratio drops.
- Text alternatives for media. Alt text on images. Transcripts for videos. AI crawlers cannot see images or play videos.
The Schema Markup Priority
If you do one thing from this article, add JSON-LD schema markup to your homepage. This is the highest-signal, lowest-effort change you can make for AI visibility.
Schema markup is a block of code in your page's <head> that explicitly tells machines: "This is a hotel. It is located here. Check-in is at this time. It has these amenities. Here is the phone number."
You are not hoping the AI will figure this out from your prose. You are telling it directly in a format designed for machine consumption.
Example for a hotel:
{
"@context": "https://schema.org",
"@type": "Hotel",
"name": "Your Hotel Name",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Main St",
"addressLocality": "Your City",
"addressRegion": "Your State",
"postalCode": "12345"
},
"checkinTime": "14:00",
"checkoutTime": "11:00",
"amenityFeature": [
{"@type": "LocationFeatureSpecification", "name": "Pool"},
{"@type": "LocationFeatureSpecification", "name": "Restaurant"}
],
"starRating": {"@type": "Rating", "ratingValue": "4"},
"telephone": "+1-234-567-8900"
}
For a restaurant, use Restaurant. For a SaaS product, use SoftwareApplication. The schema types exist for nearly every business category.
The Audit Checklist
- Check your robots.txt. Are you blocking GPTBot, Google-Extended, or PerplexityBot?
- Check your CDN settings. Is bot protection catching AI crawlers?
- Curl your homepage. Is your key business content in the raw HTML?
- Check for schema markup. Is there JSON-LD in your page head?
- Test with Google's Rich Results Test. Does your structured data validate?
Most of these checks take less than ten minutes. The fixes may take longer, depending on your tech stack. But knowing the problem is the first step.