Where Do AI Chatbots Get Their Data? (2026 Guide)

The short answer

Nearly every major AI model is built on the same foundation: Common Crawl, a nonprofit archive of 100+ billion web pages. On top of that, each platform layers its own mix of books, code, licensed data, and (for tools like Perplexity, Gemini, and Copilot) live web search. The practical takeaway for businesses: if your site is fast, structured, and authoritative, it feeds these systems. If it is thin or blocked, you are invisible to them.

What Training Data Is and Why It Matters

A large language model learns by processing enormous text datasets during a phase called pre-training. It reads billions of documents, learns the statistical patterns between words, and builds an internal map of how language works. After that, the team fine-tunes the model on curated examples to make it more helpful, accurate, and safe.

The critical implication: the model cannot know what it never read. It cannot be more reliable than its sources allow, and it inherits the biases, gaps, and errors in those sources. Training data is not a technical footnote. It is the foundation of everything the model says, including what it says about your company.

The Dataset Behind Almost Every AI You Have Used

Common Crawl is a nonprofit that has crawled and archived the public web since 2008. As of its August 2025 archive, it holds over 100 billion web pages, adding roughly 2.44 billion pages (424 TiB) every month across 47.5 million hosts. ChatGPT, Claude, Gemini, LLaMA, and most other major models all draw from it.

100B+

Web pages in the Common Crawl archive that feed nearly every LLM.

Common Crawl, Aug 2025

13T

Tokens of text and code in GPT-4's training dataset.

OpenAI / public reporting

800M

Weekly ChatGPT users now asking AI instead of searching.

OpenAI, Oct 2025

What is inside Common Crawl? Personal blogs, forum threads, news articles, product pages, Reddit discussions, academic papers, Wikipedia. Basically the whole public internet, filtered and deduplicated, going back nearly 20 years. If your website has been publicly indexed, it has almost certainly been crawled, and it may already shape how AI describes your industry.

Where Each Platform Gets Its Data

The raw ingredients overlap, but the recipe differs. Here is the data diet, real-time capability, and knowledge cutoff for each major platform.

ChatGPT · OpenAI

Web crawls weighted toward Reddit

Trained on roughly 13 trillion tokens: mostly Common Crawl web text, plus WebText2 (links upvoted on Reddit, weighted about 5x heavier than raw web), books, and code. That Reddit weighting skews the model toward English, US-based, tech-literate perspectives, a bias baked in at the data layer. Base ChatGPT does not browse; the Search feature adds live results via Bing.

Cutoff: Oct 2023 (GPT-4o) / Aug 2025 (GPT-5)Real-time: Search add-on (Bing)

Claude · Anthropic

Web and books, plus a written constitution

Anthropic is the most opaque about its exact corpus, but the differentiator is the alignment layer, not the raw data. Claude is trained with Constitutional AI: a written set of principles (drawn from the UN Declaration of Human Rights, Apple's terms of service, and Anthropic's own testing) that the model uses to generate synthetic training examples. The January 2026 revision expanded this to an 80-page framework, open-sourced under Creative Commons.

Cutoff: Jan 2026 (Opus 4.8)Real-time: No (tool use only)

Gemini · Google

Common Crawl, The Pile, YouTube, and Search

Gemini trains on Common Crawl plus datasets like The Pile (825GB across 22 sources, including 196,640 books), YouTube transcripts, and Google's own search index. A 2024 legal filing revealed Google removed 80 billion of 160 billion tokens after publisher opt-outs. No competitor owns both the world's largest search index and a frontier model, which is why Gemini's live Search integration is so strong.

Cutoff: Jan 2025 (2.5 Pro)Real-time: Yes (native Search)

Perplexity

Live Bing index, cited every time

Perplexity is not a traditional model. It is an answer engine built on Retrieval-Augmented Generation: it reads your intent, fires a live search (primarily the Bing index), summarizes the retrieved pages, and cites every claim with a numbered source. It does not answer from memory. It answers from the live web, which is why its information is current to the hour.

Cutoff: None (real-time)Real-time: Always live

Microsoft Copilot

OpenAI models on top of the Bing index

Copilot runs on OpenAI's models (Microsoft owns roughly 49% of OpenAI) plus Bing's live web index. The same Bing crawl feeds Copilot, ChatGPT's Search feature, DuckDuckGo, and Ecosia. One note for businesses evaluating it internally: content processed by Microsoft 365 Copilot is not used for model training.

Cutoff: Model-dependentReal-time: Yes (Bing index)

Meta LLaMA

The open-source baseline

LLaMA is not a consumer product, but it powers hundreds of downstream tools, and its training data (reproduced as the open RedPajama dataset) is the most transparent in the industry: Common Crawl (878B tokens), C4 (175B), GitHub (59B), arXiv (28B), books (26B), Wikipedia (24B), and StackExchange (20B).

Cutoff: Varies by versionReal-time: No

Side-by-Side Comparison

Platform	Primary source	Real-time?	Cutoff
ChatGPT (GPT-4o)	CommonCrawl, WebText2, Books	Add-on (Bing)	Oct 2023
Claude	Web, Books, Constitutional AI	No	Jan 2026
Gemini	CommonCrawl, The Pile, YouTube	Yes	Jan 2025 / live
Perplexity	Bing index (RAG, live)	Always	Real-time
Microsoft Copilot	OpenAI models + Bing	Yes	Model-dependent
Meta LLaMA	CommonCrawl, C4, Wikipedia	No	Varies

Which Domains Each AI Engine Actually Cites

The labs do not disclose their training-data mix. But we can measure the next best thing: which domains each engine actually pulls into its answers. Profound analyzed 680 million real citations (August 2024 to June 2025). Below is the share each domain holds within each engine's top-10 most-cited sources, plus a March 2026 ranking from Peec AI (30 million sources).

Top 10 cited domains by AI engine, share of top-10 sources: ChatGPT led by Wikipedia at 47.9 percent, Google AI Overviews and Perplexity both led by Reddit, plus 2026 Gemini and Google AI Mode rankings — **Top cited domains by AI engine.** Share of each engine's top-10 cited sources.Profound 2025 · Peec AI 2026

ChatGPT: top 10 cited domains

Domain	Share of top-10 citations
Wikipedia	47.9%
Reddit	11.3%
Forbes	6.8%
G2	6.7%
TechRadar	5.5%
NerdWallet	5.1%
Business Insider	4.9%
NY Post	4.4%
Toxigon	4.1%
Reuters	3.4%

Google AI Overviews: top 10 cited domains

Domain	Share of top-10 citations
Reddit	21.0%
YouTube	18.8%
Quora	14.3%
LinkedIn	13.0%
Gartner	7.1%
NerdWallet	5.9%
Forbes	5.7%
Wikipedia	5.7%
Business Insider	4.5%
Medium	3.9%

Perplexity: top 10 cited domains

Domain	Share of top-10 citations
Reddit	46.7%
YouTube	13.9%
Gartner	7.0%
Yelp	5.8%
LinkedIn	5.3%
Forbes	5.0%
NerdWallet	4.5%
TripAdvisor	4.1%
G2	4.0%
PCMag	3.7%

Gemini and Google AI Mode (March 2026, Peec AI, 30 million sources): the per-domain share is not published, but the ranked order is. Gemini cites, in order: Reddit, YouTube, Wikipedia, Medium, Forbes. Google AI Mode: YouTube, Reddit, Facebook, LinkedIn, Yelp. Across all five engines in 2026, Reddit and YouTube appear in every single one.

Read this right

These percentages are where each engine pulls its answers from (its citations), not its training-data mix. The exact training composition of GPT-5, Claude, and Gemini is undisclosed. Shares are measured within each engine's top-10 cited domains and shift over time, so treat them as direction, not gospel.

What This Means for Your Business

The tools shaping how your customers find answers were built on the same web crawls and forums that shaped the pre-AI internet. That changes the game in five concrete ways, and the data backs each one.

Get cited or get invisible: the new rules of AI search, with verified 2026 data on AI Overview click-through rates, ChatGPT usage, and what gets content cited by AI — **Get cited or get invisible.** The new rules of AI search.Timpson Marketing · 2026

Your website quality is now an AI signal. Thin content and slow pages get filtered out during data cleaning, long before a human reads the answer.
Real-time tools reward freshness. Perplexity, Gemini, and Copilot pull live data. A competitor's newer page can outrank you in AI results even if you rank higher in classic search.
Citations are the new backlinks. When Google shows an AI Overview, clicks to the #1 result fall 58% (Ahrefs, Dec 2025). Being the cited source is the win now, not ranking alone.
Knowledge cutoffs create openings. If a model's training predates a change in your industry, it is handing users stale information. Publish the updated resource and become the source live tools pull.
Opt-out is not neutral. Google removed 80 billion tokens from opt-out publishers. Blocking AI crawlers in robots.txt can quietly shrink your AI visibility. Revisit that choice deliberately.

The businesses that win in AI search are the ones whose content is so thorough, well-sourced, and clearly structured that crawlers, retrieval systems, and training pipelines all treat it as a reliable source. That is what good SEO always was. The audience just got smarter.

Frequently Asked Questions

Does ChatGPT learn from my conversations?

No. OpenAI confirmed that conversations with ChatGPT are not used to retrain the model in real time. You would need to opt in explicitly for your data to be used in future training, and enterprise accounts have this disabled by default.

Can I get my content removed from an AI's training data?

It depends on the model. Some datasets like The Pile have formal takedown processes. Google respects robots.txt for Googlebot but does not necessarily honor all opt-out signals for Gemini training data. OpenAI offers a form for training data opt-out requests, though efficacy is contested.

Which AI tool has the most up-to-date information?

Perplexity is the most current for real-time factual queries. It retrieves live web results for every answer and cites them. Gemini 2.5 with AI Mode and Copilot with Bing are close seconds. Base ChatGPT without Search is frozen at its training cutoff.

Why do AI tools sometimes get facts wrong about my business?

Three common causes: your training data was inaccurate or outdated, the AI conflated your business with a similarly named competitor, or the model hallucinated, generating plausible-sounding but false information. Structured data, consistent NAP across directories, and authoritative content all reduce the error rate.

How often does AI training data get updated?

Full retraining is expensive and infrequent, typically once every several months to a year for a model like GPT or Claude. Real-time retrieval layers like Bing and Google Search update continuously. The gap between those two layers is where most AI errors about current events occur.

Does an AI read my website in real time?

Only if the tool uses RAG or live search like Perplexity, Gemini AI Mode, or Copilot. Base models like GPT-4o do not read your website at query time. They work from what was captured in training data months or years ago.

Sources

Common Crawl, August 2025 Crawl Archive. commoncrawl.org
Pew Research Center, AI summaries and click-through (2025). pewresearch.org
Ahrefs, AI Overviews reduce clicks (Apr 2025 and Dec 2025 update). ahrefs.com
Nieman Lab, Google and publisher opt-outs (2025). niemanlab.org
Anthropic, Constitutional AI. anthropic.com
Together AI, RedPajama dataset. together.ai
Otterly AI, LLM knowledge cutoff dates (2026). otterly.ai

Author

Joseph Timpson

Founder of Timpson Marketing, a boutique SEO and AI search agency based in St. George, Utah. 15 years of search optimization experience, helping local and regional businesses win visibility in Google and the AI tools now rewriting how people search. Among the earliest practitioners building GEO and LLM-optimization workflows before the discipline had an established name.