You publish an article. It gets indexed. It ranks on Google. Then one day you wonder: did an AI model just... learn from it? Is your content sitting inside ChatGPT's training data, helping it answer questions, without you knowing?
That question led Spawning AI to build Have I Been Trained, a tool that lets you check if your work ended up in AI training datasets. It's a useful tool. But it solves a very specific problem, and it's not the problem most brands think they have.
What Have I Been Trained actually does
Have I Been Trained searches the LAION-5B dataset, the massive open-source image collection used to train models like Stable Diffusion. You upload an image (or search by text), and the tool shows you matching or similar images from the dataset.
If your photos, illustrations, or artwork appear in the results, it means your visual content was included in the training data that powered a generation of image AI models.
The tool was built by Spawning AI, a company focused on creator rights in the age of generative AI. Their thesis is simple: creators should know when their work is used, and they should be able to say no.
Spawning's opt-out ecosystem
Beyond the search tool, Spawning has built a broader set of mechanisms for controlling how AI uses your content:
ai.txt works like robots.txt, but specifically for AI crawlers. You place it on your website to signal which content AI companies can and cannot use for training. Several AI companies have agreed to respect it.
The Do Not Train registry lets creators register their work and declare it off-limits for AI training. Think of it as a global opt-out list.
Platform integrations bring these controls directly into creative platforms. DeviantArt, for example, integrated Spawning's technology so artists can flag their work as not available for AI training.
These tools address a real concern. If you're a photographer, illustrator, or content creator, knowing whether your work was scraped into training data, and having mechanisms to prevent it, matters.
The gap most brands don't see
Here's where it gets interesting. Most businesses searching for "have I been trained" or "is my content in AI" aren't actually worried about training data rights. They're worried about something else entirely: whether AI platforms recommend them.
These are fundamentally different questions.
Training data presence means your content was ingested when the model was built. It's historical. It happened (or didn't) months or years ago. For text-based models like GPT-4 or Claude, there's no public tool equivalent to Have I Been Trained. You can't search their training corpora.
AI visibility means whether ChatGPT, Perplexity, Gemini, Copilot, or Google AI Mode actively mentions your brand when someone asks a relevant question right now. This is dynamic. It changes as models update, as your online presence evolves, and as competitors shift.
A brand can exist in training data but never get recommended. And a brand can get recommended based entirely on signals the model picks up from the live web during inference (which is how Perplexity and Google AI Mode work by design).
Why visibility tracking matters more for most brands
If you run a SaaS, a consultancy, an agency, or any business where customers discover you through search, the question that actually affects your revenue is not "was my website in the training data?" It's "when someone asks ChatGPT for a tool like mine, do they hear about me?"
That's the question AI visibility tracking answers. Tools like Mentionable monitor what AI platforms say about your brand across real prompts that your potential customers use. Not once, but daily, across five major AI platforms. You see trends, spot when competitors gain mentions, and catch it when you disappear from a conversation you used to own.
This is closer to how you already think about SEO rankings. You don't just care whether Google crawled your site. You care whether you rank for the queries that matter. Same logic applies to AI.
Both layers matter, for different reasons
For creators and rights holders, Have I Been Trained and Spawning's opt-out tools address legitimate intellectual property concerns. If your images were used to train Stable Diffusion without consent, you deserve to know that and have recourse.
For brands focused on growth, AI visibility tracking addresses the business impact question. When 30% of product research starts with an AI chatbot (and that number keeps climbing), knowing whether you show up in those conversations is no longer optional.
The two layers aren't in conflict. They're complementary. One governs what goes into the models. The other monitors what comes out.
What to do next
If you're a creator concerned about training data: start with Have I Been Trained for images, and review Spawning's opt-out mechanisms. For text content, configure your robots.txt to block AI crawlers you want to exclude (GPTBot, Google-Extended, CCBot, anthropic-ai).
If you're a brand concerned about AI recommendations: start by checking what AI actually says about you. Run a free visibility check on your domain, or manually test 10-15 prompts your customers might ask across ChatGPT, Perplexity, and Gemini.
Either way, the worst position is not knowing. The AI landscape moves fast, and the brands that track their presence, whether in training data or in live recommendations, are the ones that can actually do something about it.
