The AI Visibility Checklist: Everything a Site Needs to Be Crawled, Read, and Cited by AI

Search is turning into answer. Here is the complete, platform-agnostic checklist for making any website discoverable, parseable, and quotable by AI assistants and answer engines, with the exact files and schema that do the work.

Search is quietly becoming answer. A growing share of people never reach a page of blue links. They ask an assistant, read the synthesized reply, and act on it. If a model cannot fetch your site, cannot parse it, or cannot decide what it means, you are not in the running, no matter how good the page looks to a human.

The work to fix that is not mysterious, and it is not tied to any platform or framework. It is a short, concrete checklist. I just took a brand-new site from invisible to cited by working through every item below, so this is the field version, with the exact files and schema, not a theory post.

The order matters. The first item is the one almost everyone gets wrong, and it can quietly cancel out all the others.

1. Let the crawlers in

You can do everything else on this list perfectly and still be invisible if the front door is locked. Two layers control that door, and they disagree more often than you would think.

The first layer is robots.txt. Most sites that want AI visibility should allow everything and name the AI crawlers explicitly so the intent is unambiguous. Allowing them by default is fine; naming them is clearer.

A robots.txt file allowing all user agents, naming GPTBot, ClaudeBot, PerplexityBot and Google-Extended, and pointing to the sitemap.

The second layer is the one that bites. Your CDN or host sits in front of your origin, and in the last couple of years most of them shipped a one-click feature that blocks “AI scrapers and crawlers.” On many platforms it is on by default for new sites. It works at the edge, before your robots.txt is ever consulted, and it returns a 403 or 429 to the exact user-agents you are trying to court. Your robots.txt says “come in.” The edge says “no.” The edge wins.

A before-and-after diagram: with the managed block rule on, GPTBot, ClaudeBot and PerplexityBot get 403 and 429 responses; with it set to allow crawlers, all return 200 OK. The caption notes robots.txt said allow the whole time and the edge overrode it.

This is the single highest-leverage fix on the page. Check your CDN or host security settings for anything named “block AI bots,” “AI scrapers and crawlers,” or a “managed robots.txt” that rewrites yours, and turn the blocking off (or scope it precisely). Then verify it, because dashboards lie. Send a real request with a bot user-agent and confirm you get a 200 with real HTML, not a challenge page:

curl -A "GPTBot" -I https://yourdomain.com/
# expect: HTTP/2 200   (not 403, 429, or a JS challenge)

Do the same for ClaudeBot, PerplexityBot, OAI-SearchBot, and CCBot. If any come back blocked, you found your problem.

2. Publish an llms.txt

llms.txt is an emerging convention: a small Markdown file at the root of your site that hands an AI assistant a clean, factual summary of who you are plus a curated map of your most important pages. Think of it as a README for machines. It is optional, but it is cheap to write and it measurably improves how accurately a model describes and links you.

An llms.txt file with an H1 site name, a blockquote summary written for an AI to quote, a Core pages section with linked descriptions, and a link to llms-full.txt.

Keep the summary factual and entity-rich. State who you are, your category, your differentiators, and how to reach you, in sentences a model can quote verbatim. Then link your key pages with one-line descriptions. If you publish long-form content, add a companion llms-full.txt that concatenates the full text of those pages, so a model can ingest everything in one fetch. Generate it from your content so it never goes stale.

3. Make your content survive crawling

Here is the assumption that breaks more sites than any other: most AI crawlers do not run JavaScript. Retrieval and training crawlers tend to fetch the raw HTML and move on. If your headline, your body copy, and your key facts are injected by client-side JavaScript after load, a human with a browser sees a rich page and the crawler sees an empty shell.

The fix is to server-render or statically generate your pages so the meaningful content is in the HTML on first response. Test it the way a crawler sees it: disable JavaScript in your browser, or curl the URL, and confirm the actual words are there.

While the content is in the HTML, make it easy to parse:

  • One <h1> per page, then a logical <h2> and <h3> outline with no skipped levels. A jump from <h1> straight to <h3> tells a parser the structure is broken.
  • Real semantic landmarks: header, nav, main, article, section, footer. They are how a machine finds the content and ignores the chrome.
  • Descriptive alt text on images, and text equivalents for anything important that lives inside a graphic.
  • Self-contained sentences for your key claims. Write facts that can be lifted out and quoted without the surrounding design. “We doubled a $2B business’s new-customer revenue in eleven months” is extractable. A number floating next to an icon is not.

4. Add structured data

Semantic HTML tells a machine where your content is. Structured data tells it what your content means. JSON-LD is the format every engine reads, and it is the highest-leverage thing you can add after unblocking the crawlers.

Start with a site-wide identity graph: a Person or Organization, plus a WebSite, defined once and linked by @id so every page references the same canonical entity instead of redefining it. This is what lets an engine say “all of these pages are the same Jane Doe” with confidence.

A JSON-LD identity graph in a script tag, with a Person node carrying an @id, name and sameAs, and a WebSite node whose publisher references the Person by @id.

Then give every page its own node typed correctly: WebPage as the baseline, ProfilePage for an about page, ContactPage for contact, CollectionPage for an index, and Article or BlogPosting for posts. On articles, link the author and publisher back into the identity graph by @id, and always include an image, a datePublished, and a dateModified.

A BlogPosting JSON-LD node with headline, datePublished, author and publisher linked by @id, and an image, with a note that a BreadcrumbList, FAQPage and HowTo are added alongside it.

Two additions punch above their weight for discovery. A BreadcrumbList gives engines your hierarchy. And FAQPage (plus HowTo for step-by-step content like this checklist) maps your content into exactly the question-and-answer and step shapes that answer engines love to lift.

A FAQPage JSON-LD node with a Question and an acceptedAnswer, captioned as answer-engine bait.

A warning worth stating: the schema has to match what is visible on the page. Marking up questions a user cannot see, or claims you do not make, is the fastest way to get ignored or penalized. This very article carries Article, BreadcrumbList, HowTo, and FAQPage schema, and every one of them mirrors content you can actually read here.

5. Be quotable for answer engines

When an assistant decides whether to cite you, and how your link looks when it does, a handful of head-tags do the work:

  • A unique, descriptive <title> and a unique meta description per page. Factual, not clickbait.
  • Open Graph and Twitter Card tags, including a 1200x630 image. These control how your URL renders when an AI or a human shares it, and a clean unfurl earns clicks.
  • A self-referential <link rel="canonical"> so there is no ambiguity about the real URL. Do not point a canonical at a URL that redirects somewhere else.
  • Freshness signals, both the visible date and the dateModified in your schema. Answer engines favor content that is demonstrably current.

6. Give the map

A sitemap is how a crawler discovers everything it would otherwise have to guess at. Publish an XML sitemap, list every public page in it, reference it from robots.txt, and put a lastmod on each entry so engines know what changed and when.

A single sitemap URL entry showing loc, lastmod, changefreq and priority, with a note that freshness and priority sit on every entry and the sitemap is linked from robots.txt.

If you migrated or renamed anything, make sure the old URLs 301 to their new homes rather than 404. Dead links and soft-404s that return a 200 on a missing page both erode how much of your site a crawler trusts and keeps.

7. Establish the entity

Discovery gets you read. Authority gets you cited. AI answer engines, like search before them, weigh who is saying something, not just what is said.

  • Use one consistent name and identifier for the entity, on the page, in the schema, and in llms.txt. Inconsistency forces a model to guess, and it will guess wrong.
  • Link authoritative profiles with sameAs: LinkedIn, an org page, a Crunchbase or Wikipedia entry if you have one. This is how a model disambiguates you from everyone who shares your name.
  • Make authorship and expertise explicit, and attribute or source your claims. Self-contained, verifiable statements are the ones that get repeated.

8. Measure and re-check

You cannot improve what you do not test. After the changes are live:

  • Re-fetch with bot user-agents and confirm 200s with real HTML, per item 1.
  • Validate the schema against a structured-data validator and fix every error, not just the warnings.
  • Query the answer engines themselves. Ask an assistant about your brand and about the category question you want to own, and read the result critically. Are you cited? Is the description accurate and current, or is the model repeating stale or wrong facts? Misattribution is a fixable problem once you can see it.

Then put this on a cadence. The crawlers, the conventions, and the engines all move. A site that was perfectly visible six months ago can be quietly dropped by a default someone flipped at your CDN.

The ten-minute version

If you do nothing else, do these:

  1. Confirm your CDN or host is not blocking AI crawlers. Verify with curl -A "GPTBot".
  2. Allow AI agents in robots.txt and link your sitemap.
  3. Make sure your real content is in the HTML, not JavaScript-only.
  4. Add a linked JSON-LD identity graph plus per-page Article, BreadcrumbList, and FAQPage.
  5. Publish an llms.txt.
  6. Ship a sitemap with lastmod, fix your redirects, and add Open Graph tags.
  7. Re-fetch as a bot, validate the schema, and ask an AI what it thinks you do.

None of this is exotic. It is hygiene for a web where the most important reader of your site is increasingly not a person. Get the door open, make the content legible, label what it means, and prove who is behind it. The engines will do the rest.

FAQ

Do AI crawlers run JavaScript? Assume they do not. Many retrieval and training crawlers fetch raw HTML and never execute your scripts, so anything injected only by client-side JavaScript is invisible to them. Server-render or statically generate your content.

Is robots.txt enough to allow AI crawlers? No. robots.txt states your intent, but most CDNs and hosts now ship a one-click block-AI-bots feature that returns 403 at the edge regardless of what robots.txt says. Always verify with a live request using a bot user-agent.

What is llms.txt and do I need it? It is an emerging convention, a Markdown file at /llms.txt, that gives AI assistants a clean factual summary and a curated map of your most important pages. It is optional but cheap, and it improves how accurately models describe and cite you.

Does structured data help with AI search specifically? Yes. JSON-LD removes ambiguity about who you are and what each page is, which helps both traditional rich results and AI answer engines extract and attribute facts correctly. A linked identity graph plus per-page Article, FAQ, and Breadcrumb schema is the high-leverage set.

How do I know if it worked? Re-fetch key pages with AI bot user-agents and confirm a 200 with real HTML, validate your schema, and query AI answer engines for your brand and category to see whether you are cited and described accurately.