Crawl Budget and Indexing: Why Google Is Ignoring Some of Your Pages

Google does not crawl every page on your website every time it visits. It has a limited crawl budget, the time and resource it will spend on your site. Waste it on junk pages and your important pages get ignored.

That is the blunt version of crawl budget SEO. It sounds nerdy because it is nerdy, but it has a very real effect on whether Google finds, understands, and indexes the pages that actually make you money.

If your site has 20 clean pages, you probably do not need to lose sleep over this. If your site has hundreds or thousands of URLs, a messy WordPress setup, filters, tags, archives, duplicate service pages, redirects, broken links, and old rubbish hanging around from 2016, crawl budget starts to matter.

What crawl budget actually is

Crawl budget is the amount of crawling Googlebot is willing and able to do on your website. It is not a fixed daily allowance where Google says, “Right, Dave’s Plumbing gets exactly 86 crawls today.” It is more fluid than that.

Google looks at two broad things: crawl capacity and crawl demand. Crawl capacity is how much your server can handle without being battered. If your site is slow, unstable, or throwing errors, Googlebot may back off. Crawl demand is how much Google thinks your pages are worth revisiting. Important, popular, frequently updated pages tend to get crawled more often.

For a small brochure website, crawl budget is usually a non-issue. Google can crawl the whole thing without breaking a sweat. For bigger sites, especially e-commerce sites, directories, news sites, or years-old WordPress sites full of archive clutter, crawl budget can become a proper problem.

The issue is simple: if Googlebot spends too much time crawling junk, it may not get to your best pages quickly enough. New pages take longer to appear. Updated pages take longer to refresh. Important commercial pages sit there waiting, like a good salesperson locked in the stockroom.

Why crawl budget matters more for larger or messier sites

Crawl budget matters most when your website has more URLs than Google reasonably wants to crawl often. This can happen because the site is genuinely large, or because it is technically messy and creating URLs you did not realise existed.

An e-commerce site with product filters can accidentally create thousands of near-identical URLs. A WordPress site can do the same with tags, author archives, date archives, attachment pages, feeds, search result pages, and pagination. A local business site can end up with old campaign pages, duplicate town pages, test pages, staging URLs, and redirected rubbish from previous redesigns.

Google may still crawl some of this. That does not mean it should. Crawling useless pages does not help your rankings, and indexing them can make things worse.

This is especially important for sites where trust and accuracy matter. If you run a health, finance, legal, or medical site, Google needs to reach the pages that explain your services, credentials, locations, and governance clearly. A medically supervised weight-loss provider such as Healthy Weight Clinics needs its key treatment and trust pages accessible, not buried beneath thin archives or duplicate parameter URLs.

Bigger sites need discipline. Messy sites need a bloody good tidy-up.

How Google decides which pages to crawl and how often

Google discovers URLs through internal links, XML sitemaps, external links, redirects, canonical tags, hreflang annotations, and previously known URLs. Once it knows a URL exists, it then decides whether to crawl it, how often to return, and whether the page deserves indexing.

Several signals influence that decision:

  • Internal importance: Pages linked from your main navigation, homepage, service hubs, and strong category pages tend to look more important.
  • Freshness: Pages that change regularly may be crawled more often, especially if Google has learned those updates matter.
  • Popularity and authority: Pages with stronger internal and external links often get more crawl attention.
  • Server response: Slow responses, 5xx errors, timeouts, and overloaded hosting can reduce crawl activity.
  • Historical quality: If Google repeatedly finds weak, duplicated, or low-value pages on your site, it may become less enthusiastic.

This is why site architecture matters. Googlebot is not sitting there admiring your homepage hero image. It is following links, reading signals, testing URLs, and deciding what is worth its time.

If your important service page is buried five clicks deep, has no internal links, is missing from your sitemap, and competes with three duplicate versions, do not act shocked when Google treats it like an afterthought.

What wastes crawl budget

The usual crawl budget killers are not exotic. They are boring, common, and usually caused by websites growing without anyone keeping the structure under control.

The biggest offenders are thin pages, duplicate content, parameter URLs, broken internal links, redirect chains, and archive pages that add no value. Googlebot wastes time fetching pages that should never have existed, then your proper pages have to fight for attention.

Here is the short version:

Crawl waste problem What Google sees Sensible fix
Thin pages Lots of weak pages with little unique value Improve, merge, noindex, or remove them
Duplicate content Several URLs saying basically the same thing Canonicalise, consolidate, or redirect
Parameter URLs Filters, sorting, tracking tags, and search URLs Block, canonicalise, or control generation
Broken internal links Crawl paths leading to 404s Fix links or redirect properly
Paginated archives Endless low-value archive pages Noindex where appropriate and improve linking
Redirect chains Googlebot hops through several URLs Replace with direct final URLs

Parameter URLs are a classic pain. Things like ?sort=price, ?filter=red, ?utm_source=email, and internal search URLs can multiply quickly. One product category can become hundreds of crawlable URLs if filters are not controlled.

Broken internal links are another daft one. If your site links to 404 pages, Googlebot follows them. That is time wasted, and it also makes your site look poorly maintained. Same with redirect chains. One redirect is usually fine. Four redirects because nobody cleaned up after a redesign is just lazy.

How to check what Google is actually crawling

Do not guess. Guessing is how people delete half their website and then blame “the algorithm”. Start with Google Search Console.

The Page indexing report, which many SEOs still call the Coverage report because old habits die hard, shows which URLs Google has indexed and which it has excluded. Pay attention to patterns, not isolated oddities. A few excluded pages are normal. Hundreds of “Crawled, currently not indexed” pages may suggest quality or duplication issues.

Useful areas to review include:

  • Pages marked “Discovered, currently not indexed”
  • Pages marked “Crawled, currently not indexed”
  • Duplicate pages where Google chose a different canonical
  • 404 errors and soft 404s
  • Indexed pages that you never wanted indexed
  • Sitemap URLs that are submitted but not indexed

Search Console also has a Crawl stats report. This shows crawl requests by response code, file type, host, and purpose. It will not tell you everything, but it can show if Googlebot is spending a suspicious amount of time on redirects, errors, images, or useless URL patterns.

For deeper work, use server logs. Logs show the actual requests Googlebot made to your server. You can see which URLs were crawled, how often, what status code they returned, and whether Googlebot keeps hammering pointless sections. This is more technical, but for large sites it is gold dust.

If this bit makes your eyes bleed, that is exactly the sort of thing covered in a proper technical SEO audit.

How to fix crawl budget and indexing issues

Fixing crawl budget issues is mostly about giving Google fewer stupid choices. You want Googlebot spending more time on useful, indexable, canonical pages and less time wandering through the website equivalent of a cupboard full of old cables.

The main tools are noindex tags, canonical tags, robots.txt, redirects, and deletion. Each has a different job, so do not use them interchangeably.

A noindex tag tells Google not to include a page in the index. This is useful for thin archive pages, internal search pages, thank-you pages, and other pages users may need but searchers do not. Important point: if you block a page in robots.txt, Google may not crawl it and may not see the noindex tag. Do not block something and expect Google to read instructions it cannot access.

Canonical tags help with duplicate or near-duplicate pages. They tell Google which version you prefer. They are a hint, not a magic spell. If your canonicals contradict your internal links, sitemap, and redirects, Google may ignore them.

Robots.txt is best used to stop crawling of sections that should not be crawled at all, such as certain parameter patterns or internal search paths. Be careful. One bad robots.txt rule can hide your whole site. I have seen it. It is not funny when it is your enquiry pipeline.

Sometimes the best fix is removal. If a page has no traffic, no links, no purpose, and no future, delete it and return a 410 or redirect it to a genuinely relevant replacement.

WordPress crawl budget problems are everywhere

WordPress is brilliant, but it will happily create crawl clutter if left unsupervised. Most small business owners never see it because they only look at the main pages in the admin menu. Googlebot sees more.

Common WordPress crawl budget problems include tag archives, author archives, date archives, attachment pages, category pagination, media URLs, feed URLs, and plugin-generated pages. None of these are automatically bad. The problem is when they create dozens or hundreds of low-value URLs that compete with your useful content.

Tag archives are a favourite mess. Someone adds tags to blog posts for years, using slightly different wording each time, and suddenly the site has 300 tag archive pages with one or two posts on each. Google crawls them, finds little value, and rightly wonders what the hell is going on.

Author archives are often pointless on single-author business sites. Date archives are usually pointless for service businesses. Attachment pages can be a disaster if each uploaded image gets its own indexable page with barely any content.

Most WordPress SEO plugins can noindex these areas, but the settings need checking properly. Do not assume installing a plugin fixed it. Plugins do what they are told. If they were set up badly by the previous agency, or not set up at all, they can quietly create problems for years.

For WordPress sites that have grown arms and legs, proper WordPress SEO support can save you from a lot of technical nonsense.

Internal linking controls crawl depth

Internal linking is one of the simplest ways to tell Google which pages matter. It affects discovery, priority, context, and crawl depth.

Crawl depth means how many clicks it takes to reach a page from the homepage or another strong page. A page linked from the main navigation is usually one click deep. A page linked from a category, then a subcategory, then an archive, then an old blog post is buried. Google can still find it, but it may treat it as less important.

Pages buried five clicks deep often get ignored or crawled less often, especially on larger sites. That is not because Googlebot is lazy. It is because your own structure is telling Google, “This page is not very important.”

Your key pages should be easy to reach. For most small business sites, that means core service pages, location pages, case studies, pricing or enquiry pages, and important guides should be no more than two or three clicks from the homepage.

Good internal linking also reduces orphan pages. An orphan page has no internal links pointing to it. It may exist in your sitemap, but if no page links to it, Google gets a mixed message. Sitemaps say “this matters”. Your site structure says “apparently not”.

Use internal links like signposts, not confetti. Link where it helps users and gives Google useful context.

When crawl budget genuinely matters

Crawl budget genuinely matters when Google is not discovering or refreshing important pages because your site has too many low-value URLs, poor architecture, or technical waste.

You should take it seriously if you have:

  • Thousands of URLs, especially from products, filters, locations, or archives
  • Lots of “Discovered, currently not indexed” pages in Search Console
  • Lots of “Crawled, currently not indexed” pages that should be valuable
  • Major duplication caused by parameters, categories, tags, or location pages
  • Frequent content updates that Google takes ages to reflect
  • Server log evidence that Googlebot is crawling junk more than key pages
  • A recent migration or redesign that left redirects, 404s, and old URLs behind

E-commerce sites, directories, publishers, marketplace sites, and multi-location businesses are the obvious candidates. But even a small business website can have crawl issues if WordPress has been allowed to spew out rubbish for years.

The important word is “genuinely”. Do not obsess over crawl budget if your site has 40 pages, clean navigation, a sensible sitemap, and no indexation problems. You have bigger fish to fry, like writing better service pages, earning trust, and making the phone ring.

Crawl budget is not an excuse to avoid doing basic SEO. It is a technical constraint that matters once the mess gets big enough.

When crawl budget does not matter much

For most small, clean websites, crawl budget is not the reason you are not ranking. Sorry. It is more likely that your content is weak, your service pages do not match search intent, your local signals are poor, your site is slow, or your competitors are simply doing a better job.

Google has said many times that most small sites do not need to worry about crawl budget. If your website has fewer than a few thousand URLs and it is technically clean, Google can usually crawl it efficiently.

That does not mean indexation is automatic. Google may crawl a page and still choose not to index it. Crawling and indexing are different. Crawling means Google found and fetched the page. Indexing means Google decided the page deserves to be stored and shown in search results.

This distinction matters. If Google crawls your page but does not index it, the problem may be quality, duplication, intent, canonicalisation, or trust. If Google never crawls it, the problem may be discovery, internal linking, sitemap issues, robots.txt, or crawl prioritisation.

So do not diagnose everything as crawl budget. That is like blaming the van because the plumber forgot his tools. Sometimes the page is just not good enough.

A sensible crawl budget cleanup process

If you suspect crawl budget or indexing problems, do not start by blocking random folders in robots.txt. That is how people create expensive disasters.

Use a calm process:

  1. Crawl your own site with a tool such as Screaming Frog, Sitebulb, or a similar crawler.
  2. Export indexed and excluded URL examples from Google Search Console.
  3. Compare your XML sitemap against the pages actually indexed.
  4. Look for repeated URL patterns, such as tags, dates, filters, searches, parameters, and duplicate locations.
  5. Decide which pages should be indexed, which should be noindexed, which should be canonicalised, and which should be removed.
  6. Fix internal links so key pages are easier to reach.
  7. Resubmit clean XML sitemaps containing only canonical, indexable URLs.
  8. Monitor Search Console over several weeks, not five minutes after making changes.

The goal is not to make Google crawl fewer pages for the sake of it. The goal is to make Google crawl better pages more efficiently.

Think of it like stock control. If your shop is full of broken products, old boxes, and duplicate labels, customers struggle to find the good stuff. Googlebot is no different. Tidy the shelves. Put the profitable products where they can be found. Bin the rubbish.

Frequently Asked Questions

Does crawl budget affect small business websites? Usually, not much. A clean small business website with 20 to 100 useful pages is normally easy for Google to crawl. Crawl budget becomes more relevant when the site has lots of duplicate, thin, archived, or parameter-based URLs. Small sites can still have crawl problems, but poor content and weak local SEO are more common causes of low visibility.

How do I check what pages Google is crawling? Start with Google Search Console. Use the Page indexing report, Crawl stats report, XML sitemap data, and URL Inspection tool. These show which pages are indexed, excluded, discovered, or recently crawled. For deeper analysis, check server logs. Server logs show actual Googlebot requests, including URLs, crawl frequency, response codes, and wasted crawl paths.

What is a noindex tag and when should I use it? A noindex tag is an instruction in a page’s HTML telling search engines not to include that page in search results. Use it for pages users may need but Google should not rank, such as thin archives, internal search results, thank-you pages, and low-value duplicate pages. Do not block the same page in robots.txt if Google needs to see the noindex tag.

Can too many pages hurt my SEO? Yes, if many of those pages are thin, duplicated, outdated, or low value. More pages does not automatically mean more rankings. A bloated site can dilute internal linking, waste crawl activity, confuse Google with duplicate signals, and reduce overall quality perception. It is better to have fewer strong pages than hundreds of weak ones doing bugger all.

Is crawling the same as indexing? No. Crawling means Googlebot visited and fetched the page. Indexing means Google decided to store the page and make it eligible to appear in search results. A page can be crawled but not indexed if Google thinks it is low quality, duplicated, blocked by canonical signals, irrelevant, or not useful enough compared with other pages.

Should I block junk pages in robots.txt or use noindex? It depends on the problem. Use noindex when Google needs to access the page but should not show it in search results. Use robots.txt when you want to stop Google crawling a section entirely, such as certain internal search or parameter URLs. Be careful with robots.txt because one bad rule can hide important pages from search engines.

About the author

Matt Warren is the founder of SEO Bridge, a UK-based digital marketing agency specialising in SEO, local SEO, and AI search optimisation including AEO and GEO strategies.