v2 · The evidence layer

Every score now ships with the evidence that produced it.

A critic gave the original demo an 8/10. Five fair complaints — numbers felt manufactured, voice DNA was vibes, no counterfactual, scores didn't discriminate, no data freshness signal. Every one is fixed. Every tool now returns provenance, signals, and a counterfactual. Agents stop paraphrasing — they cite.

Re-run it in the playground See the new response shape

5 weaknesses → 5 fixes

What changed, point by point.

Fix #1

Numbers come with receipts.

"87/100 — no visible rubric. Power users will smell this."

Every score now ships a signals array. Each signal has a name (snake_case), a value 0–10, and an evidence field with the concrete phrase from the input that drove the score. Agents cite the phrase, not paraphrase the number.

"signals": [
  {
    "name": "specificity",
    "value": 9,
    "evidence": "uses 4 concrete numbers (24, $58K, $100K, 12%)"
  },
  {
    "name": "credibility_signal",
    "value": 8,
    "evidence": "'no side hustle' attaches a verifiable constraint"
  },
  ...
]

Fix #2

Provenance is a first-class field.

"Are those view counts real and recent? If they're plausible-but-fabricated, the whole output is sand."

Every response includes as_of + a provenance block. analyze_account exposes data source, fetched_at, video_post_dates[]. trend_pulse exposes per-entry sources, signal_window, signal_strength + top-level cache_age_hours and cache_status.

"provenance": {
  "data_sources": [
    "Google Trends RSS (regional + global)",
    "YouTube Data API (velocity)",
    "Reddit hot (cross-subreddit topic surfacing)",
    "Internal viral-video corpus (100K+ videos)"
  ],
  "cache_age_hours": 3,
  "cache_status": "fresh",
  "rubric_version": "trend_pulse.v1"
}

Fix #3

Scores discriminate. No more 82/85/87 clusters.

"Three different scripts, similar scores, similar reasoning. The scoring model isn't discriminating enough to be useful as a decision tool."

Six calibration anchors at 10 / 30 / 50 / 70 / 85 / 95 are now wired into the system prompt with example hooks per anchor. Temperature dropped from 0.5 to 0.2. The model is required to name the closest anchor in its why field. predict_virality adds a calibration_check field that flags when a score is overstated.

"calibration_check": {
  "anchor": "80",
  "agreement": "agrees — strong hook + scene structure + named CTA,
                missing a second pattern interrupt to hit 95"
}

Fix #4

Voice DNA as math, not vibes.

"Voice DNA labels are vibes, not systems. 'Excited Discovery Evangelist' sounds good but isn't reproducible."

match_voice now returns a deterministic voice_metrics block alongside the qualitative profile: type-token ratio (moving-window 100), filler rate per 100 words (18 filler tokens matched), avg sentence length, and top recurring 2-3-grams with counts. Reproducible across runs, citable as numbers.

"voice_metrics": {
  "vocab_diversity_ttr": 0.62,
  "filler_rate_per_100_words": 1.4,
  "avg_sentence_length_words": 8.2,
  "signature_phrases": [
    { "phrase": "real talk", "count": 4 },
    { "phrase": "you need to", "count": 3 }
  ]
}

Fix #5

Every tool says what would have failed.

"No counterfactual. A real intelligence layer would say 'here's the version that scored 45 and why.'"

score_hook, predict_virality, and analyze_account all return a would_fail_because field. One sentence. Names the closest version that drops 20+ points and the reason. Tells agents what to avoid, not just what to copy.

"would_fail_because": "If this hook said 'crypto secrets' instead
of 'no crypto, no dropshipping', the anti-hype share trigger
disappears and share rate drops roughly 40%."

Why it matters

For the people the demo was for.

Agent builders

Stop paraphrasing. Cite.

When Claude / Cursor / n8n calls a Hooklayer tool, every claim it makes in chat can be backed by the signals + provenance block in the response. AI Twitter stops calling it slop.

Content teams

A real decision tool.

Scores that don't discriminate are useless. Calibrated scores + visible signals + counterfactuals make hook scoring an actual quality gate — not just a vibes meter.

Critics

"Show your work" — done.

Every score is auditable: the model has to name the calibration anchor, list six signals with evidence, and write the counterfactual that would have failed. The intelligence is in the response, not behind it.

See the evidence layer in your agent.

100 free credits. No card. The full 7-tool chain in your client in under 5 minutes.

Get API key Install for Claude / Cursor / n8n