Content Duplicate Checker

Free duplicate content scanner. Crawls your site, finds near-duplicate pages that may trigger Google Helpful Content demotion. No sign-up.

About the tool

What is the SBMM Content Duplicate Checker?

The SBMM Content Duplicate Checker is a free internal duplicate content scanner that crawls up to 30 pages of your site, shingles each page body into rolling five-word phrases, compares every URL pair, and surfaces every near-duplicate match with a similarity score. It is the fastest way to spot the kind of content overlap Google's Helpful Content Update penalises.

Internal content duplication is one of the clearest signals Google uses to demote a site under the Helpful Content Update. When the same paragraph or page appears at multiple URLs, Google has to pick a canonical, and the choice is often wrong. The pages compete with each other for ranking, split the link equity, and confuse the rater grading the helpfulness of your content set.

This checker finds the duplicates before Google does. It detects exact duplicates, near duplicates (a product variant page that copies most of its content from the master), boilerplate bleed (footer, sidebar, or template text dominating the body), and thin-content overlap. Every flagged pair shows you exactly which shingles match so you can verify the report and pick the right canonical or merge target.

Step by step

How to use this tool in 3 steps

  1. Step 01

    Enter your website URL

    Drop the homepage URL of any public website you own or have permission to crawl. The scanner respects robots.txt and limits itself to the free tier crawl budget (30 pages) so it never overloads your origin.

  2. Step 02

    Shingle + compare every pair

    The crawler fetches each page, strips boilerplate (header, footer, sidebar, navigation), shingles the body into rolling five-word phrases, and compares the shingle sets between every URL pair using a Jaccard similarity index.

  3. Step 03

    Fix what is flagged

    See every duplicate pair sorted by similarity score, with a list of matched shingles per pair. Choose to 301 redirect, canonicalise, noindex, or merge the duplicate. Re-run the audit after fixing to confirm the count drops.

Why this tool

Why use this tool

  • 5-word shingle fingerprinting

    Each page body is shingled into rolling five-word phrases that capture both exact and near-duplicate overlap. Five-word shingles are the standard duplicate-content detection unit used by Google's own internal deduper.

  • Boilerplate detection

    The scanner identifies and excludes repeating template content (header, footer, sidebar, nav, related-posts widgets) so duplicate detection focuses on real body content, not the chrome that legitimately repeats.

  • Similarity score per pair

    Every flagged pair gets a 0 to 100 similarity score. Pairs above 70 are exact-duplicate candidates; 40 to 70 are near-duplicate candidates worth canonicalising; below 40 is usually safe overlap.

  • Crawls up to 30 pages free

    The free tier crawls 30 pages, which catches duplicate patterns on small to mid-sized sites and gives you a clean directional signal on larger sites before upgrading.

  • HCU-aligned reporting

    The findings are organised the way Google's Helpful Content rater handbook reads: thin content overlap, near-duplicate product variants, paginated archives, and tag-archive bloat all surface as distinct categories.

  • Free, three runs a day

    Three full duplicate audits per day on the free tier covers most authoring workflows. SBMM Pro raises the crawl to 200 pages, adds cross-site comparison (your site vs a competitor), and exports the report as a styled PDF.

FAQ

Frequently asked questions

What is duplicate content in SEO?

Duplicate content is text that appears at more than one URL, either within a single domain (internal duplication) or across multiple domains (external duplication). Internal duplication forces Google to pick a canonical, splits link equity between competing URLs, and is a known Helpful Content Update demotion trigger.

Will Google penalise me for duplicate content?

Google does not issue a formal penalty for duplicate content, but it actively demotes pages and sites with significant internal duplication under the Helpful Content Update. The practical effect is identical to a penalty: the duplicate pages stop ranking, and in severe cases the whole subfolder or domain loses visibility.

What is the difference vs Copyscape or Siteliner?

Copyscape is built for external plagiarism detection (someone else copying your content). Siteliner is the closest equivalent to this tool but is paid for serious crawls. The SBMM Duplicate Checker focuses on internal duplication, runs free for small sites, and integrates with the rest of the SBMM SEO toolkit.

How does boilerplate detection work?

The scanner identifies blocks of HTML that repeat verbatim across multiple pages (header, footer, sidebar, navigation, related-posts widgets) and excludes them from the shingle comparison. This prevents the legitimate template from triggering false duplicate alerts in the body content audit.

How do I fix near-duplicate pages?

For exact duplicates, pick one canonical URL and 301 redirect the others. For intentional near-duplicates (product variants, archive pages), add a rel canonical tag pointing to the primary version. For thin or low-value duplicates, noindex them or merge their content into a richer page. Re-run the audit after each fix to confirm the count drops. Run our Site Audit Pro afterward to confirm the canonical chain is clean and the redirects do not loop.

When should I use canonical instead of 301?

Use rel canonical when both URLs need to stay reachable for users (product variants by colour, paginated archives, tracking-parameter URLs) but only one should rank. Use a 301 redirect when the duplicate URL has no reason to exist (an old slug, a content merge, a category rename). 301 passes equity; canonical concentrates it.

Does pagination count as duplicate content?

Paginated archives are not duplicate content in the strict sense (each page has a different URL list) but they often share boilerplate intro text that can trigger soft-duplicate flags. The scanner detects pagination patterns and separates them from real duplicate cases in the report.

Can I scan a site I do not own?

Technically yes (the scanner only fetches public URLs the same way Google does) but practically you should only audit sites you own or have written permission to audit. Respect the target site's robots.txt, crawl-rate signals, and bandwidth.