We can't predict how the market reacts to an ad. We can make sure the ad is written well. So this rubric does one job: judge whether an ad is good enough to ship — and if it is, it's ready to ship on your call. No legal kill-switches. It scores the craft, flags the things that make an ad bad, and hands you one of four verdicts.
It is niche-agnostic — the same bar works across coaching, health and skincare, because the same things make an ad bad in all of them. The expert-specific details (proof names, mechanism, tone) live in each expert's config, not in the rubric. Every shorthand term is spelled out on first use and again in the glossary at the foot.
Every reviewed ad gets one of four labels. They're the output of the rubric, not an invented score band.
Two layers feed it. The no-nos (§2) are the things that make an ad bad — trip one and the verdict drops below green (how far is the operator's call). The quality bar (§3) is what makes an ad great when no no-nos fire. The system flags; you decide. Nothing auto-rejects on a guess about the market — a flagged ad can still be shipped on your call, and a clean ad still gets your eyes before it runs.
These are the real, recurring kill-signals - the patterns that make an ad bad, seen again and again in Amelia's live account, the founder's reviews, and the high-signal advertisers we model. Tripping one caps the verdict below GREEN LIGHT. The top two are by far the most common.
The whole game is trust. An ad that sounds like a hype-marketer destroys it. It's the most common kill, across every expert.
"…selling these for $750 a piece using nothing but a simple AI system. Don't blink…".
Tells: fake exposé framing ("I wasn't supposed to show you this", "before the gatekeepers catch on"), unverifiable press name-drops, "borderline illegal" energy.
If the first line creates no real intrigue — or is so vague it sends the mind in five directions — it fails. The fix is grounded specificity, not more hype.
Also a tell: a hook so vague it raises five questions and sends the mind in different directions.
The ad's only job is to earn the click to the free training. Describing the course, or inventing a "community," breaks the funnel.
Manufactured "only a few spots left" pressure reads as scammy to a sophisticated audience.
It's a person talking to camera, not copy on a page. No "as I said above," no fascinations no human would say aloud.
Generic AI filler and "telling" instead of "showing" gets cut on sight.
The narrative must match the expert's actual biography and the facts. Inconsistency reads as fabrication.
The named mechanism and any factual claim must be exactly right. This one flags a hard halt — but the call is still yours.
Lower-frequency flags folded into the quality bar below: AI-default names ("the name will be Sarah, it's a pattern"), avatar-inaccessible jargon, numbers that strain belief ("'first month' is more believable than '7 days'"), and pain written as judgment rather than empathy. Proof that lands logically but not emotionally is scored under the Proof dimension in §3.
When no no-nos fire, these dimensions decide how good the ad actually is. Each is grounded in what repeatedly separates a great ad from a weak one. Weights are relative emphasis and stay Provisional — tuned on real performance, never auto-set (see §6). Validated (craft) means the standard is a proven craft rule; Provisional means its weight — not the standard — still awaits performance data, so a Provisional dimension can outweigh a Validated one where the evidence is strongest.
| Dimension | What "great" looks like | Weight | Confidence |
|---|---|---|---|
| Hook intrigue & audience-fit | Grounded intrigue; speaks to the exact person at their exact stage | 15 | Provisional |
| Believable, non-scammy register | "Genuine, not made up"; protects the expert's trust | 13 | Provisional |
| Native, spoken voice | "Looks super native"; sounds like the expert talking, not written | 12 | Validated (craft) |
| Proof: named, specific & emotional | Named person + number + timeframe, landing emotionally not just logically | 13 | Validated (craft) |
| Mechanism: tease the what, protect the how | Branded mechanism that names the enemy; the method's steps stay in the VSL | 10 | Validated (craft) |
| One audience, one big idea, clean arc | A single spine; every sentence earns its place (Chekhov's Gun) | 9 | Validated (craft) |
| Format-execution fit | Tone matched to format/seating; format congruent with the message | 15 | Provisional |
| Believable numbers & claims | Specific but credible; nothing that strains belief or invites legal risk | 3 | Provisional |
| Total | 90+10* |
*Reserves ~10 points for an expert-specific dimension an expert config may add (e.g. an angle-class preference). "Validated (craft)" = a proven rule of the playbook. "Provisional" = grounded in review patterns but not yet correlated to performance — flags strongly while its weight earns trust. Weights still await your formal sign-off.
Each dimension is rated 0–10 by the reviewer, multiplied by its weight, and summed to a 0–100 total (the eight core dimensions carry 90 points; an expert config can add one expert-specific dimension worth the last 10). The total ranks quality — but the no-nos override it: trip one and the verdict is capped below GREEN LIGHT however high the score, and a fatal no-no (wrong facts, a scammy core) is a DISMISS on its own. The score ranks; the no-nos and you set the floor.
| Weighted total (no no-nos firing) | Verdict |
|---|---|
| 70–100 | GREEN LIGHT |
| 55–69 | LIGHT TWEAK |
| 40–54 | RESERVE |
| below 40 | DISMISS |
Where each band sits is itself Provisional. The §5 calibration scores ran under a stricter early configuration and read lower than the current rubric would produce, so don't read them as the band thresholds — the monthly performance check (§7) settles the cut-points on real data.
A well-built ad with no no-nos firing. Each score is the reviewer's 0–10 rating; the contribution is score × weight ÷ 10.
| Dimension | Score /10 | Weight | Contribution |
|---|---|---|---|
| Hook intrigue & audience-fit | 8 | 15 | 12.0 |
| Believable, non-scammy register | 8 | 13 | 10.4 |
| Native, spoken voice | 9 | 12 | 10.8 |
| Proof: named, specific & emotional | 8 | 13 | 10.4 |
| Mechanism: tease what, protect how | 8 | 10 | 8.0 |
| One audience, one big idea, clean arc | 8 | 9 | 7.2 |
| Format-execution fit | 8 | 15 | 12.0 |
| Believable numbers & claims | 8 | 3 | 2.4 |
| Total | 73.2 |
Total 73.2, no no-nos → GREEN LIGHT. Had the register scored 4 instead of 8 (it reads a little salesy), the total falls ~5 points to ~68 — into LIGHT TWEAK; and had a no-no fired outright, the verdict is capped below green regardless of the 73.
The canonical rule (from the scripter-polish skill): "The ad sells the training. The training sells the method. If a viewer can reconstruct the method from the ad alone, mechanism protection has failed." What's protected is the teaching method's sequence — the steps, phases and order that the video sales letter exists to reveal. Ordinary domain words are fine — naming "floor plans" or "layouts" is not a leak; naming "Step 1 → Step 2 → Step 3 of the method" is.
Every green-lit ad had a named person, a specific number, and a timeframe — never a percentage or "many customers." And the proof has to land, not just compute.
"Take Jenny, a former elementary teacher… she made $1,800 in her first week. Or Leo, who landed his first $2,400 project just 16 days after starting."
"she now says her skin is clear '90% of the time' and she feels like it's continuing to get more and more stable.".
The same things make an ad bad and the same shape makes it good across coaching, health and skincare. That shared structure is the rubric; the niche details are config. Amelia is the one expert live today; the same engine is built to serve skincare and health as they launch.
Amelia (live): "If you're passionate about this but think you can't because you don't have a degree…"
Health (Elena) — real: "Your doctor says 'lose weight first, then we'll talk about fertility' — but they won't actually help you lose the weight. And you're sitting there thinking, how am I supposed to do one without the other?"
The spine above is one proven structure, not a mandate that every ad be built pain → agitation → solution. Different ads want different frameworks — and so do hooks. The open design question is how to let the system pick from a small set of structures and hook frameworks without losing its simplicity. That's a dedicated upcoming session, not settled here; this rubric judges quality whatever structure an ad uses.
Before trusting the rubric, we pressure-tested its design against real outcomes. The process: score 10 real ads from Amelia against the quality bar — 5 confirmed winners, 5 confirmed losers — blind to their performance, then compare the rubric's read to what the market actually did. The result was a clean gap between winners and losers and a correct call on the worst ads from script evidence alone — plus one clear lesson that shaped two of the rubric's design choices (below). That's what the test is for: calibrate the bar on known outcomes before it judges live batches.
Ten was a deliberately clean, balanced sample — 5 clear winners against 5 clear losers, hand-labelled so the blind scoring had unambiguous outcomes to check against. It is not the limit of what we hold. Amelia's Meta account and Hyros carry cost-per-lead, cost-per-booked-call, hook and hold data across her full batch history and beyond — a much larger owned set. Widening the back-test onto that fuller record is the obvious next validation step, and it's a separate piece of work, not something this draft claims to have already done.
| Ad | What actually happened | Rubric | Agree? |
|---|---|---|---|
| Winner 1 · Studio | $8 cost-per-lead · $64 per call — best | 54.5 | over-rejected |
| Format Pair · Selfie | $25 per call — deepest win | 71.3 | over-rejected* |
| Winner 2 · Studio | $10 cost-per-lead · $54 per call | 35.0 | over-rejected |
| Winner 3 · Selfie | $14 cost-per-lead · $70 per call | 49.0 | over-rejected |
| Winner 4 · Studio | $58 per call · $14 cost-per-lead | 47.9 | over-rejected |
| Loser 1 · Selfie | $177 per call — worst | 16.0 | correct |
| Loser 2 · Studio | $28 cost-per-lead · hook rate 38.9 | 17.9 | correct |
| Loser 3 · Studio | $169 per call | 23.6 | correct |
| Loser 4 · Studio | $108 per call | 47.1 | correct |
| Format Pair · Studio | $41 cost-per-lead · $122 per call | 63.9 | missed |
Honest read: a strong filter against bad ideas — 4 of 5 losers caught from the script alone, and the three lowest-scoring ads were all genuine losers. The lesson cuts the other way too: scored against strict structural rules, the winners came in low — the market forgave gaps the rubric was penalising. That's the signal behind the two design choices below. Both push the same direction: ship great ads, don't over-reject them.
A strict proof rule — "two-plus stacked proof stories or block" — would reject 3 of the 5 winners, including the deepest win (the Format Pair, Selfie, $25 per call). Those ads had one strong proof story, not zero; the market forgave the incompleteness. So the rubric splits the two: total proof absence is a strong flag (it catches the real loser, Loser 3, which had none), while proof incompleteness is a scored deduction — it lowers the score, it doesn't kill the ad.
One ad ran in two formats with the same script (the Format Pair above) — Selfie won at $25 per call, Studio lost at $122. A 4.9× swing on format alone. A token weight for format would let the losing version pass on script quality. So format-execution fit is weighted heavily (15%), context-tunable (higher for new/cold tests, lower once a format is proven), and Provisional until performance settles the exact number.
*Scores are from the calibration scoring pass, run under a stricter early configuration than the current weights (§3) — shown to illustrate the pattern, not as exact current-rubric totals. The winner/loser separation and the two lessons are the takeaway, not the decimals.
The Provisional weights are tuned by evidence, never auto-set. You stay in the loop.
Hook rate — the share of viewers who don't scroll past — is a false quality signal. Amelia's two highest hook rates were her two worst-performing ads. No dimension in this rubric uses it as a quality measure, and any older check that still does (a "hook rate below 25% = fail" rule) should be retired.