We Tested Our Own Score Against Real AI Citations. Here Is What We Found.

All research

Every AI visibility tool on the market implies or outright claims that their score predicts whether AI will recommend your business. We wanted to know if ours does. So we tested it.

The short answer: structural readiness does not predict citations. But it reliably measures the part you can fix.

That distinction matters enormously, and the fact that we are telling you this — when our competitors are not — is exactly the point.

What we did

We selected 40 websites across 6 industries: accountancy, legal, marketing, recruitment, financial planning, and IT services. 20 were sites we expected to perform well (established brands, strong content, visible online presence) and 20 were sites we expected to perform poorly (thin content, limited online presence, minimal structure).

We did two things with each site:

Ran our Visibility Score scan — all 44 checks, generating a structural readiness score from 0 to 100.
Tested actual AI citation — we asked Perplexity AI 5 industry-relevant questions per site (e.g., "best accountants in Manchester for R&D tax credits") and recorded whether each site was cited in the response.

Then we compared the two measurements.

The results

Validation results — 40 websites

AUC (structural separation) 0.94

Correlation: score vs. citations 0.31

High-scoring sites cited by AI 45%

Low-scoring sites cited by AI 15%

Two numbers that tell different stories.

AUC 0.94 — the score works for what it measures

The AUC (area under the ROC curve) of 0.94 means our score reliably separates well-built websites from poorly-built ones. If you pick a random well-built site and a random poorly-built site, our score will correctly identify which is which 94% of the time. This is a strong result — it validates the structural assessment methodology.

Correlation 0.31 — structure alone does not predict citations

The correlation between our structural score and actual AI citations was 0.31. This is a weak positive correlation. Sites with better structural readiness are somewhat more likely to be cited, but the relationship is far from deterministic.

Why? Because AI citation depends on two independent factors:

The two factors of AI visibility

Structural readiness Can AI find, read, and interact with your site?

Authority and reputation Does AI consider you a credible source?

Our score measures the first factor. The second factor — authority — is built through years of publishing quality content, earning backlinks, being mentioned in reputable sources, and building a genuine online presence. You cannot fix authority with a code change. You cannot fix structural readiness without one.

What this means in practice

We found four distinct groups in our data:

The four quadrants

High structure + high authority Cited frequently (85%)

Low structure + high authority Cited sometimes (40%)

High structure + low authority Rarely cited (20%)

Low structure + low authority Almost never cited (5%)

The most important finding: high authority with low structure still loses 60% of potential citations. These are well-known businesses with strong reputations whose websites are technically invisible to AI because of structural gaps — missing structured data, JavaScript-dependent content, no machine-readable services.

This is the group that benefits most from our scan. They already have authority. They just need their website to be readable by the AI systems that are trying to find them.

Why we are publishing this

Most tools in this space would not publish these numbers. A correlation of 0.31 between their score and the thing customers care about (actual citations) would be buried. The AUC of 0.94 would be headline news. The nuance would be quietly ignored.

We are publishing both because we believe the nuance is the most valuable thing we can offer.

If we told you our score predicts citations, we would be lying. You would optimise for the score, see no change in citations, and rightfully lose trust. Instead, we tell you the truth:

Our score measures structural readiness — the part you can fix. Authority is the other half, and it takes time to build. You need both. We measure the one you can change this week.

This is the foundation of trust. When a tool tells you what it cannot do, you can believe what it says it can do.

What the competitors claim

We reviewed the claims made by 8 other AI visibility tools. Zero of them have published validation data. Several explicitly claim their score "predicts AI citations" or "measures AI visibility" without distinguishing between structural readiness and authority.

We do not name them here because the point is not to attack competitors. The point is that this kind of validation — testing your own score against real-world outcomes and publishing the results honestly — should be the minimum standard. We believe it will become the minimum standard. We are just doing it first.

Methodology

For full transparency, here is exactly how we ran this study:

Sample: 40 UK-based business websites across 6 professional service industries.
Selection: 20 sites predicted to score high (strong online presence, established brands) and 20 predicted to score low (thin content, limited presence). This deliberate stratification maximises AUC sensitivity.
Structural scan: Our standard 44-check scan, run between February and March 2026.
Citation test: 5 industry-relevant queries per site, submitted to Perplexity AI. Citation = the site URL appeared in Perplexity's response or sources. Each query run 3 times to account for response variability.
AUC calculation: Binary classification — sites with Visibility Score above median vs. below. AUC measures how well the score separates the two groups.
Correlation: Pearson correlation between Visibility Score (0-100) and citation rate (0-1, proportion of queries where the site was cited).
Limitations: 40 sites is a small sample. Perplexity's citation behaviour changes over time. Results may not generalise to ChatGPT, Google AI Overviews, or other engines. We will repeat this study at larger scale as we accumulate more scan data.

What happens next

As we scan more websites, we will build a larger validation dataset. We intend to repeat this study quarterly, testing against multiple AI engines (Perplexity, ChatGPT, Google AI Overviews, Claude) and across a wider range of industries.

We will publish every result, whether the numbers improve or not. If our score gets better at predicting citations, we will show that. If it does not, we will explain why structural readiness is still worth measuring even when it does not predict the whole picture.

That is what accountability looks like.

Measure the part you can fix

44 structural checks. Instant results. See exactly what your web developer should change this week.

Scan My Website Free