I Plugged the Same Site Into 7 AI-Citation Trackers. They Reported 7 Different Numbers.

DEV Community

Why the seven numbers diverge

Read side by side, the vendor docs turn the mystery into a definition problem. The numbers vary on four axes.

Four axes the citation numbers diverge on: what counts as a citation, which LLMs are sampled, how often and dedup rules, and which languages are queried

1. What counts as a citation

This is the big one. Every tool counts a different thing and calls it the same word.

Profound counts a citation only when the answer includes a clickable source link pointing at your domain. Strict, and useful for attribution. It misses any mention where the model talks about your brand without linking. Peec AI counts any mention of your brand name in the answer text, link or no link. So if Perplexity says "Ken Imoto wrote a useful guide on voice AI," that is a citation to Peec, even with no link. That is why their number is the biggest. Otterly AI counts a cited URL, like Profound, but de-duplicates per query per day, which crushes the number down. Bluefish AI is really running a share-of-voice calculation against competitors, so its "citations" number reads closer to a rank than a count. Scrunch counts both brand mentions and source links with no dedup, which lands it in the middle-high range. Semrush only counts when your domain shows up in the URL field of the structured answer, the strictest reading. My Python script counts whatever I tell it to, which today is "the brand string appears in the answer text, deduped per query, three samples averaged."

This split is not specific to me. The 2026 tooling guides now draw the same line: brand mentions are how often a model says your name, citations are when it links or attributes a source. Some platforms (Profound, Peec AI, AthenaHQ) break out explicit versus implicit citations at the URL level; others report brand-level visibility only. Pick any two definitions and they will not agree. That is the field not having a shared standard yet.

2. Which LLMs they sample

No tool covered all five engines I cared about. Peec AI samples all five, which gives it more surface area and is part of why its number is highest. Scrunch samples only ChatGPT and Perplexity, which makes its high number more interesting: more citations from fewer surfaces. If you only care about ChatGPT, your choice of tracker matters less. If you care about Gemini or Claude, you can cross half the list off immediately.

3. How often they sample

Most tools run each query daily. Some run weekly. Otterly runs daily but deduplicates inside a 24-hour window, so a brand mentioned five times in one day counts once. Peec AI runs daily and counts each mention on its own. Over fifteen days and twelve queries, that compounds fast.

4. Whether they sample in your languages at all

I publish in four languages. Most trackers default to English-only sampling unless you configure language sets by hand. Peec AI gave me the most useful multilingual number because it queries in 115+ languages by default. The rest mostly ignored my PT and ES content, which is why their numbers undercount what is actually happening in Brazilian and LatAm search.

Pick the definition, then pick the tool

After two weeks staring at this, I think "which tracker is most accurate" is the wrong question. There is no ground truth for AI citations. Every LLM is a black box that returns slightly different answers to the same prompt depending on time, region, and which datacenter you hit. There is no Search Console for this.

The right question: which definition of "citation" maps to the business outcome you actually care about?

Want attribution traffic (someone clicks a link)? Use Profound or Otterly. They count linked citations only. The numbers stay small, but they map to GA4 referrer events you can verify.
Want brand presence (the model says your name, link or not)? Use Peec AI. The number looks generous, but it is the closest proxy to "ChatGPT says my name out loud."
Want competitive positioning? Use Bluefish or Scrunch. Both run competitor sets natively.
Want the truth on a budget? Write your own script. Mine is 200 lines of Python around the OpenAI, Anthropic, and Perplexity APIs, runs about 8ドル a month, and hands me raw answer text to grep through, which the commercial tools mostly hide behind charts.

Until the field agrees on a shared definition, every vendor keeps counting differently under the same word. A shared taxonomy would fix this: a standard for what "citation", "mention", and "source link" mean across tools, so the numbers become comparable. The Citation Signals work at llmoframework.com is one attempt at exactly that vocabulary.

What I actually run

Honest answer: two trackers, not seven.

I kept Otterly because it is cheap and its strict definition lines up with what I can verify in GA4. If Otterly says I got cited and GA4 shows a referrer click, I trust both. I kept my own Python script because it hands me raw text and I can change the definition tomorrow if I want.

I dropped the rest. Not because they are bad. Because paying 499ドル a month for a number I could not reconcile against a 29ドル tool was making me dumber, not smarter.

If you are about to spend money on an AI-citation tracker, do this first: write down what "citation" means to you, in one sentence. Then ask each vendor whether their definition matches yours. Most will not answer cleanly. That is your answer.

I wrote a book about exactly this measurement problem, including the Python script I use and the GA4 setup that pairs with it: LLMO: AI Search Optimization.

Top comments (2)

webperfdev profile image

Performance Dev

Web performance engineer and technical SEO practitioner. I built a free website audit tool (under 30s) that scores speed, SEO, mobile, accessibility, and security — and I write about what the data act

Location

North America · Remote
Joined

May 30, 2026

• Jun 2

Really thorough breakdown of the definition problem, Ken. The 8.2x spread isn't surprising once you realize each tool answers a different question — but most teams won't do the cross-comparison work you did.

One subtlety worth adding: crawl topology interacts with citation tracking in ways most vendors don't talk about. If your site has 30,000 pages crawled but only 4,000 indexed, your citation numbers will be artificially depressed because the AI models are sampling a fraction of your content.

The 8ドル/mo Python script is the right call. We built a similar approach and the ability to grep raw answer text catches patterns that commercial tools bury behind aggregation.

Out of curiosity: did you see any variance between what free-tier vs paid-tier API responses from the same LLM cited? ChatGPT's free tier cites about 2x more broadly, presumably from a coarser model path.

webperfdev profile image

Performance Dev

Location

North America · Remote
Joined

May 30, 2026

• Jun 2

Great point about the definition problem being the axis everything depends on — the 4-axis breakdown makes it clear that "citation" doesn't mean the same thing across any two tools. I'd add a corollary to axis #1 (what counts): even when two tools DO define a citation the same way, they're sampling LLM responses at different times, which means they're querying what are effectively different models (ChatGPT's API version from an hour ago vs now can return completely different citations for the same query string). The temporal variance between API endpoints may actually be wider than the definitional variance between tools. Have you tested running the same tracker back-to-back on the same day to isolate the time-of-query variance?

Ken Imoto

WebRTC & Voice AI Engineer at Propel-Lab. Building real-time AI communication systems. Context Engineering & LLMO Framework. Author of "LLMO" & "Practical Claude Code" on Kindle. Qiita 67,000+ PV.

Location

Fukuoka, Japan
Education

Kyushu Institute of Technology
Work

Software Engineer | WebRTC & Voice AI Engineer at Propel-Lab LLC
Joined

Mar 1, 2026

More from Ken Imoto

Anthropic Rewrote frontend-design Skill: 3 AI Design Clichés Named (With Hex Codes)

#anthropic #ai #design #claude

MCP Servers Ship Without OAuth. I Added It and 3 of 5 Clients Broke.

#mcp #security #oauth #ai

RTX 4070 + Qwen 35B: 2.8x Speedup From One llama.cpp Flag (--cpu-moe)

#llm #performance #ai #hardware