Most websites are bigger and messier than their owners believe. Pages that were quietly orphaned during a redesign, old URLs that still 404, a title tag that got duplicated across forty product pages, a canonical tag pointing at the wrong version of a page. None of it shows up when you click around the front end, because you only ever visit the pages you remember exist. A site crawl is what surfaces the rest. It is the single most honest thing you can do for a website, because it replaces opinion with a complete inventory of every URL, status code, and on-page tag the search engine will encounter.
Screaming Frog is the tool most SEOs reach for to run that crawl. It is a desktop spider that requests every URL it can find on your site the same way Googlebot does, then lays out the results in a giant filterable table. This guide walks through how to actually use it for a real audit, the issues you should hunt for, and the mindset that separates a useful crawl from a pile of data nobody acts on. The tactics map cleanly onto any reputable crawler, so if you run something else the workflow still holds.
What a Crawler Actually Does
A crawler starts at a seed URL, usually your homepage, downloads the HTML, extracts every link it finds, and queues those links to download next. It repeats that loop until it has visited every reachable page. Along the way it records the response status code, the title tag, meta description, headings, canonical tag, word count, response time, and dozens of other attributes for each URL. The output is a snapshot of your site as a machine sees it, not as a human browsing the navigation sees it.
That distinction matters. We built our own headless crawler in-house precisely because the crawl view is where the truth lives. The point of crawling is to find the gap between the site you designed and the site that actually exists on the server, and that gap is almost always wider than anyone expects.
Setting up your first crawl
Open Screaming Frog, paste your domain into the bar at the top, and start the crawl. For a small site that is genuinely all you need. A few configuration choices are worth making before you hit go on anything larger:
- Respect robots.txt or ignore it deliberately. By default the spider obeys your robots file, which mirrors Googlebot. Temporarily ignoring it can reveal pages you accidentally blocked from crawling.
- JavaScript rendering. If your site builds its content client-side, switch the spider into rendered mode so it executes the JavaScript before reading the page. A crawl of raw HTML on a JS-heavy site will look alarmingly empty and mislead you.
- Crawl scope. Decide whether you are crawling a single subdomain, the whole domain, or following external links. Keep it tight for an audit so the data stays about your site.
- Speed limits. Throttle the request rate on a live production server so you do not hammer it during business hours.
Save the configuration you trust
Once you have a setup that reflects how search engines see your site, save it as a default so every future audit starts from the same baseline. Consistent configuration is what makes a before-and-after comparison meaningful rather than an apples-to-oranges guess.
Let it run to completion before you start analyzing. A half-finished crawl gives you a half-true picture, and the issues you care most about often cluster in the deeper, slower-to-reach parts of the site.
The Issues Worth Hunting For
Once the crawl finishes you have a table with potentially thousands of rows. The skill is knowing which columns to filter and what each problem costs you. Work through them in roughly this order, because the earlier ones waste the most crawl resources and do the most damage.
Broken links and bad status codes
Filter the crawl by response code. Anything in the 4xx range is a broken link that a user or a search engine hit and got an error from. Internal links pointing at 404s are pure waste: they bleed link equity into a dead end and create a poor experience for anyone who clicks them. Screaming Frog's inlinks panel shows you exactly which pages link to each broken URL, so you can go fix the source rather than guessing.
Pay attention to 5xx server errors too. A handful of 500s during a crawl can mean your server is buckling under load or that specific templates are throwing exceptions. Those pages are effectively invisible to search engines for as long as they fail.
Redirect chains and loops
A single redirect is fine. Redirect chains, where URL A points to B which points to C which finally resolves at D, are not. Every hop adds latency, dilutes the signal passed along, and in long chains search engines may simply give up before reaching the destination. Use the redirect-chains report to find any URL that resolves through more than one hop, then update the original link or redirect rule to point straight at the final destination in one step. Redirect loops, where the chain eventually points back at itself, are worse: they trap crawlers and users in a circle and must be broken immediately.
Missing and duplicate titles and meta
Sort the page-titles tab by status. You are looking for three things: missing titles, duplicate titles, and titles that are truncated or padded out far past what displays in the results. A page with no title tag forfeits one of the strongest on-page relevance signals there is. Duplicate titles across many pages tell the search engine those pages are interchangeable, which is exactly the wrong message if they target different terms.
The same logic applies to meta descriptions and H1 headings. Crawl them, filter for blanks and duplicates, and treat each one as a small ranking and click-through opportunity left on the table. None of these are hard to fix; they are just impossible to find without a crawl.
Duplicate content and thin pages
Duplicate content is the quietest of these problems and often the most damaging. When several URLs serve substantially the same content, search engines have to choose which one to rank and may split signals across all of them. Common culprits are URL parameters, print versions, trailing-slash variants, HTTP and HTTPS both resolving, and faceted navigation generating endless combinations. Screaming Frog can hash page content to flag near-duplicates, and its address-variant data exposes the parameter and protocol issues. Thin pages, those with almost no unique body content, show up clearly when you sort by word count and deserve either expansion, consolidation, or removal.
Indexability and the signals that control it
This is where a crawl earns its keep, because indexability problems silently delete pages from search results without any visible symptom. A page can return a perfect 200 status and still be completely absent from the index because something told the search engine to stay away. Audit every page for the signals that govern whether it can rank:
- Noindex tags. A meta robots or X-Robots-Tag noindex directive removes a page from the index entirely. Crawl for it and confirm every noindexed page is one you genuinely want excluded.
- Canonical tags. Check that each page either self-canonicalizes or points at the correct preferred version. A canonical that points at the wrong URL hands your ranking signals to a different page, and a canonical loop confuses the crawler about which page is authoritative.
- Robots.txt blocks. Disallowed paths never get crawled, so anything important hiding behind a disallow rule is invisible. Cross-reference the blocked URLs against the pages you actually want ranking.
- Orphan pages. Pages with no internal links pointing to them are hard for crawlers to discover and hard for users to reach. Combine the crawl with your XML sitemap and analytics to find URLs that exist but are linked from nowhere.
Match indexability to intent
The goal is not zero noindex tags or zero canonicals. The goal is that every one of those signals matches what you actually want. We have seen real audits turn up a single misplaced noindex sitting on a category page that the owner desperately wanted to rank, and a canonical pointing at a staging URL that quietly leaked an entire section out of the index. Neither is visible from the front end. Both are obvious in a crawl.
Crawl Budget and Why Waste Matters
For most small sites you do not need to think about crawl budget at all; search engines will happily fetch every page you have. But on large sites, the number of pages a search engine is willing to crawl in a given window is finite, and every request it wastes on a redirect chain, a parameter duplicate, or a faceted-navigation dead end is a request it did not spend on a page you care about. Cleaning up the issues above is not just hygiene. On a big site it directly redirects crawler attention toward the pages that earn revenue.
This is the connective tissue of a good audit. Broken links, redirect chains, duplicate content, and stray indexability signals are not four separate problems. They are four ways the same site quietly wastes the finite attention a search engine gives it. A crawl shows you all four at once, in one table, ranked by how many URLs each one touches.
Want a professional read on your crawl data?
Running the crawl is the easy part. Knowing which of the thousands of rows actually move rankings, and in what order to fix them, is the work. See how we approach technical reviews on our SEO audits page, grab a no-cost starting point with a free site audit, or get in touch to talk through your crawl.
Get a Free Site AuditTurning a Crawl Into an Audit
A crawl is data. An audit is a decision about what to do with it. The difference between the two is the mindset you bring, and it is the part no tool can do for you. Three habits turn raw crawl output into something worth acting on.
Prioritize by impact, not by count
It is tempting to fix the issue with the biggest number first. Resist that. One broken canonical on a high-traffic category page matters more than two hundred missing meta descriptions on pages nobody visits. Cross-reference your crawl with traffic and revenue data so you are spending your effort where it pays back, not where the spreadsheet looks ugliest.
Crawl, fix, re-crawl
An audit is not a one-time event. After you ship fixes, run the crawl again and confirm the numbers actually moved. Re-crawling catches the new problems your fixes introduced, which happens more often than anyone likes to admit. A redirect added to solve one broken link can quietly create a chain somewhere else.
Combine the crawl with other data
The crawl tells you what is on the site. It cannot tell you what search engines are actually doing with those pages. Pair it with Search Console to see which crawled URLs are getting impressions, which are excluded and why, and where coverage errors line up with the issues your crawl found. The two data sets together are far more powerful than either alone.
Document the why, not just the what
Every fix should have a recorded reason. Six months from now, when someone asks why a page is noindexed or why a redirect exists, the answer should be in writing. Undocumented technical SEO decisions are how sites slowly drift back into the mess the audit cleaned up.
Where to Start
If you have never crawled your own site, do it this week. Point a spider at your homepage, let it finish, and filter for the four things that matter most: broken links, redirect chains, missing titles, and anything that is unintentionally non-indexable. You will almost certainly find something you did not know was broken. That moment, the gap between the site you thought you had and the site the crawl reveals, is the entire reason the audit exists. Fix what matters, re-crawl to confirm, and pair the results with Search Console. The tool is just a flashlight. The value is in knowing where to point it and what to do about what you see.