Site Crawl & Internal Linking Analysis

Crawl your entire site to detect content cannibalization, discover missing internal links, and map your topic clusters. Built for AI visibility, not just traditional SEO.

Curious if AI mentions your brand?

Run a free scan and see where you stand on ChatGPT.

Free AI Scan

Key Takeaways

  • Mentionable crawls your entire website and extracts content, headings, schema markup, and all internal and external links for every page.
  • The cannibalization detector finds pages competing for the same topics, with severity ratings (high, medium, potential) so you can prioritize merges or rewrites.
  • Missing link suggestions show you exactly which pages should link to each other, with recommended anchor text, based on semantic analysis of your content.
  • Crawl limits scale by plan: 500 pages on Starter, and 1,000 pages on Growth, Pro, and Agency.
  • Mentionable checks your robots.txt for AI crawler rules (GPTBot, Gemini, Claude) and detects whether you have an llms.txt file.

You're tracking your AI visibility across 5 LLMs. You're writing content. But you have this nagging feeling: is your site actually structured in a way that helps LLMs understand what you're an expert at?

Maybe you have three blog posts that all target the same topic without realizing it. Maybe your best content lives in an orphan page with zero internal links pointing to it. Maybe you're accidentally blocking ChatGPT's crawler in your robots.txt. These are the kinds of problems you can't spot by looking at individual pages. You need to see the whole picture.

What the site crawl does

Mentionable crawls your entire website, starting from your homepage and following every internal link. For each page, it extracts:

  • Content and structure: title, meta description, H1, full heading hierarchy, word count, and the page content itself
  • Schema markup: every JSON-LD block found on the page, with type identification
  • Links: all internal links (where they go, what anchor text they use) and all external links (where you're sending traffic)
  • Technical signals: canonical tags, robots directives, hreflang tags, Open Graph data

But the real value isn't the raw extraction. It's what happens after.

Cannibalization detection

Once every page is crawled and analyzed, Mentionable generates semantic embeddings for your content and compares them. When two pages cover highly similar topics, they get flagged as a cannibalization pair.

Each pair gets a severity rating:

  • High: these pages are very likely competing for the same queries and confusing LLMs about which one represents your expertise
  • Medium: significant overlap that could dilute your authority on the topic
  • Potential: enough similarity to watch, but may be intentional (like a product page and a related blog post)

For each pair, you see both pages side by side with their similarity scores. This makes it easy to decide: merge them, differentiate them, or redirect one to the other.

Why does this matter for AI visibility? When an LLM tries to understand your expertise on a topic and finds three similar pages, it has to pick one. It might pick the wrong one. Or worse, it might conclude that none of them are authoritative enough compared to a competitor who has one definitive page on that topic.

Missing internal links

Internal links are how both search engines and LLMs discover the relationships between your pages. If your "ultimate guide to email marketing" doesn't link to your "email deliverability tips" post, that's a missed opportunity for both.

After crawling your site, Mentionable analyzes the content of every page and identifies pages that cover related topics but don't link to each other. For each suggestion, you get:

  • The source page (where the link should be added)
  • The target page (where the link should point)
  • Suggested anchor text based on the target page's content

Suggestions are ranked by content relevance, so the most impactful links appear first. You get up to 100 suggestions per crawl.

Topic clusters

Mentionable groups your pages into semantic clusters using embeddings and DBSCAN clustering. This gives you a visual map of how your content organizes by topic.

You might discover that you have 12 pages about "project management" but only 2 about "time tracking," even though both topics matter equally for your business. Or you might find that pages you thought were about different topics actually cluster together, revealing hidden overlap.

Topic clusters also help you plan your content strategy. Gaps in your cluster map point to topics where you need more content. Dense clusters suggest areas where you're already strong and might want to consolidate.

AI crawler accessibility

After crawling your site, Mentionable checks three things:

robots.txt analysis: are you blocking any AI crawlers? Many site owners don't realize their robots.txt blocks GPTBot, Google-Extended, ClaudeBot, or other AI crawlers. If you're blocking them, LLMs can't access your latest content to inform their recommendations.

llms.txt detection: do you have an llms.txt file? This emerging standard helps LLMs understand your site structure and find your most important content. Mentionable checks whether it exists and what it contains.

Sitemap coverage: what percentage of your crawled pages appear in your sitemap? A low coverage percentage means some of your content isn't being explicitly shared with crawlers.

Crawl limits by plan

Plan Price Max pages per crawl
Starter EUR 39/month 500 pages
Growth EUR 79/month 1,000 pages
Pro EUR 149/month 1,000 pages
Agency EUR 300/month 1,000 pages

One crawl per month per project. For most solopreneurs and small sites, 500 pages covers the core content. Larger sites with hundreds of pages benefit from the Growth, Pro, or Agency tiers.

How the crawl pipeline works

When you start a crawl, Mentionable runs an 11-step async pipeline:

  1. Initiates the crawl and begins fetching pages
  2. Extracts content, headings, schema, and links from every page
  3. Stores all page data and link relationships
  4. Scans your robots.txt, llms.txt, and sitemap
  5. Generates semantic embeddings for all page content
  6. Resolves internal link targets and computes link metrics
  7. Groups pages into topic clusters
  8. Detects cannibalization pairs
  9. Generates missing link suggestions

You can watch the progress in real-time from the crawl dashboard. Each step shows a progress indicator so you know where the analysis stands.

Who benefits most

Solopreneurs who've been publishing content for months or years often discover they have cannibalization problems they never knew about. Three blog posts all targeting "best invoicing practices"? That's diluting your authority instead of building it.

Consultants can use site crawl results to show clients the structural issues holding back their AI visibility. A report showing 15 cannibalization pairs and 40 missing internal links is concrete and actionable.

Content-heavy sites with 100+ pages get the most out of clustering and missing link detection. The more content you have, the harder it is to maintain a coherent internal linking structure manually.

Try it yourself

Start your 7-day free trial and run your first site crawl. See how your content clusters, where cannibalization is hurting your authority, and which internal links you're missing. No credit card required.

Related articles

Frequently Asked Questions

What does Mentionable's site crawl analyze?
Mentionable's site crawl extracts content, heading structure, schema markup (JSON-LD), meta tags, canonical URLs, and all internal and external links for every page. After crawling, it runs semantic analysis to detect content cannibalization, suggest missing internal links, and map your site into topic clusters.
How many pages can Mentionable crawl?
Page limits depend on your Mentionable plan: Starter crawls up to 500 pages, and Growth, Pro, and Agency crawl up to 1,000 pages. The crawler starts from your homepage and follows internal links automatically.
What is content cannibalization and how does Mentionable detect it?
Content cannibalization happens when two or more pages on your site target the same topic or keywords, forcing search engines and LLMs to choose between them. Mentionable uses semantic embeddings to compare page content and flags pairs with high similarity. Each pair gets a severity rating: high, medium, or potential.
How do missing link suggestions work?
After crawling your site and analyzing the content of every page, Mentionable identifies pages that cover related topics but don't link to each other. It suggests specific internal links with recommended anchor text, ordered by content relevance. You get up to 100 suggestions per crawl.
Does Mentionable check if my site blocks AI crawlers?
Yes. After crawling, Mentionable reads your robots.txt file and checks whether you're blocking AI crawlers like GPTBot (ChatGPT), Google-Extended (Gemini), ClaudeBot, or others. It also detects whether you have an llms.txt file, which helps LLMs understand your site.
How often can I run a site crawl?
You can run one crawl per month per project. This limit exists because crawling is resource-intensive and the analysis pipeline involves multiple AI processing steps. Running a crawl after major content or structural changes to your site gives you the most useful results.
How is this different from Screaming Frog or Ahrefs site audit?
Traditional crawlers focus on broken links, missing tags, and technical errors. Mentionable's crawl adds AI-specific analysis: semantic clustering of your content into topic groups, cannibalization detection using embeddings (not just URL patterns), missing internal link suggestions based on content relevance, and AI crawler accessibility checks. It's built for people optimizing their site for LLM visibility, not just Google rankings.

Ready to check your AI visibility?

See if ChatGPT mention you on the queries that actually lead to sales. No credit card required.

Keep Reading

Site Crawl & Internal Linking Analysis | Mentionable