The Great Ad Rubric — Consolidated V1 (for review)

The one idea

We can't predict how the market reacts to an ad. We can make sure the ad is written well. So this rubric does one job: judge whether an ad is good enough to ship — and if it is, it's ready to ship on your call. No legal kill-switches. It scores the craft, flags the things that make an ad bad, and hands you one of four verdicts.

It is niche-agnostic — the same bar works across coaching, health and skincare, because the same things make an ad bad in all of them. The expert-specific details (proof names, mechanism, tone) live in each expert's config, not in the rubric. Every shorthand term is spelled out on first use and again in the glossary at the foot.

01 The verdict — the four labels

Every reviewed ad gets one of four labels. They're the output of the rubric, not an invented score band.

GREEN LIGHT

Great. Ship as-is. Clears the quality bar, trips no no-nos.

LIGHT TWEAK

Nearly there — one small fix, then ship. A single fixable issue stands between it and green.

RESERVE

Hold it. Usable later, not now. Often where a no-no fires but the idea "can be tested" down the line.

DISMISS

Kill it. A fundamental no-no — scammy at its core, wrong facts, or no real hook.

How the verdict is reached

Two layers feed it. The no-nos (§2) are the things that make an ad bad — trip one and the verdict drops below green (how far is the operator's call). The quality bar (§3) is what makes an ad great when no no-nos fire. The system flags; you decide. Nothing auto-rejects on a guess about the market — a flagged ad can still be shipped on your call, and a clean ad still gets your eyes before it runs.

02 The no-nos — what makes an ad bad

These are the real, recurring kill-signals - the patterns that make an ad bad, seen again and again in Amelia's live account, the founder's reviews, and the high-signal advertisers we model. Tripping one caps the verdict below GREEN LIGHT. The top two are by far the most common.

① Sounds scammy / theatrical — the single most common kill

The whole game is trust. An ad that sounds like a hype-marketer destroys it. It's the most common kill, across every expert.

Dismissed — real (Amelia)

"…selling these for $750 a piece using nothing but a simple AI system. Don't blink…".

Tells: fake exposé framing ("I wasn't supposed to show you this", "before the gatekeepers catch on"), unverifiable press name-drops, "borderline illegal" energy.

② Vanilla / vague hook — a hook that isn't a hook

If the first line creates no real intrigue — or is so vague it sends the mind in five directions — it fails. The fix is grounded specificity, not more hype.

Also a tell: a hook so vague it raises five questions and sends the mind in different directions.

③ Sells the program, not the click

The ad's only job is to earn the click to the free training. Describing the course, or inventing a "community," breaks the funnel.

④ Fake urgency / scarcity

Manufactured "only a few spots left" pressure reads as scammy to a sophisticated audience.

⑤ Reads like a sales letter, not a spoken video

It's a person talking to camera, not copy on a page. No "as I said above," no fascinations no human would say aloud.

⑥ AI-tell / fluff — show, don't tell

Generic AI filler and "telling" instead of "showing" gets cut on sight.

⑦ Story contradicts the expert's real life

The narrative must match the expert's actual biography and the facts. Inconsistency reads as fabrication.

⑧ Wrong mechanism name / wrong facts — usually fatal

The named mechanism and any factual claim must be exactly right. This one flags a hard halt — but the call is still yours.

Lower-frequency flags folded into the quality bar below: AI-default names ("the name will be Sarah, it's a pattern"), avatar-inaccessible jargon, numbers that strain belief ("'first month' is more believable than '7 days'"), and pain written as judgment rather than empathy. Proof that lands logically but not emotionally is scored under the Proof dimension in §3.

03 The quality bar — what makes an ad great

When no no-nos fire, these dimensions decide how good the ad actually is. Each is grounded in what repeatedly separates a great ad from a weak one. Weights are relative emphasis and stay Provisional — tuned on real performance, never auto-set (see §6). Validated (craft) means the standard is a proven craft rule; Provisional means its weight — not the standard — still awaits performance data, so a Provisional dimension can outweigh a Validated one where the evidence is strongest.

Dimension	What "great" looks like	Weight	Confidence
Hook intrigue & audience-fit	Grounded intrigue; speaks to the exact person at their exact stage	15	Provisional
Believable, non-scammy register	"Genuine, not made up"; protects the expert's trust	13	Provisional
Native, spoken voice	"Looks super native"; sounds like the expert talking, not written	12	Validated (craft)
Proof: named, specific & emotional	Named person + number + timeframe, landing emotionally not just logically	13	Validated (craft)
Mechanism: tease the what, protect the how	Branded mechanism that names the enemy; the method's steps stay in the VSL	10	Validated (craft)
One audience, one big idea, clean arc	A single spine; every sentence earns its place (Chekhov's Gun)	9	Validated (craft)
Format-execution fit	Tone matched to format/seating; format congruent with the message	15	Provisional
Believable numbers & claims	Specific but credible; nothing that strains belief or invites legal risk	3	Provisional
Total		90+10*

*Reserves ~10 points for an expert-specific dimension an expert config may add (e.g. an angle-class preference). "Validated (craft)" = a proven rule of the playbook. "Provisional" = grounded in review patterns but not yet correlated to performance — flags strongly while its weight earns trust. Weights still await your formal sign-off.

How the score produces a verdict

Each dimension is rated 0–10 by the reviewer, multiplied by its weight, and summed to a 0–100 total (the eight core dimensions carry 90 points; an expert config can add one expert-specific dimension worth the last 10). The total ranks quality — but the no-nos override it: trip one and the verdict is capped below GREEN LIGHT however high the score, and a fatal no-no (wrong facts, a scammy core) is a DISMISS on its own. The score ranks; the no-nos and you set the floor.

Weighted total (no no-nos firing)	Verdict
70–100	GREEN LIGHT
55–69	LIGHT TWEAK
40–54	RESERVE
below 40	DISMISS

Where each band sits is itself Provisional. The §5 calibration scores ran under a stricter early configuration and read lower than the current rubric would produce, so don't read them as the band thresholds — the monthly performance check (§7) settles the cut-points on real data.

Worked example — illustrative; the numbers are made up to show the maths, and the bands await your sign-off

A well-built ad with no no-nos firing. Each score is the reviewer's 0–10 rating; the contribution is score × weight ÷ 10.

Dimension	Score /10	Weight	Contribution
Hook intrigue & audience-fit	8	15	12.0
Believable, non-scammy register	8	13	10.4
Native, spoken voice	9	12	10.8
Proof: named, specific & emotional	8	13	10.4
Mechanism: tease what, protect how	8	10	8.0
One audience, one big idea, clean arc	8	9	7.2
Format-execution fit	8	15	12.0
Believable numbers & claims	8	3	2.4
Total			73.2

Total 73.2, no no-nos → GREEN LIGHT. Had the register scored 4 instead of 8 (it reads a little salesy), the total falls ~5 points to ~68 — into LIGHT TWEAK; and had a no-no fired outright, the verdict is capped below green regardless of the 73.

Two dimensions worth seeing in full

Mechanism — tease the WHAT, protect the HOW (corrected)

The canonical rule (from the scripter-polish skill): "The ad sells the training. The training sells the method. If a viewer can reconstruct the method from the ad alone, mechanism protection has failed." What's protected is the teaching method's sequence — the steps, phases and order that the video sales letter exists to reveal. Ordinary domain words are fine — naming "floor plans" or "layouts" is not a leak; naming "Step 1 → Step 2 → Step 3 of the method" is.

Great — names the enemy, hides the method

A branded mechanism that frames the problem — a named "system" or "method" the expert coined — without listing how it actually works.

Fail — reveals the method's sequence

Walking through the numbered steps/phases of the method so the viewer no longer needs the training. (The fix is to tease, not teach.)

Proof — named, specific, and emotional

Every green-lit ad had a named person, a specific number, and a timeframe — never a percentage or "many customers." And the proof has to land, not just compute.

Great — real (Amelia)

"Take Jenny, a former elementary teacher… she made $1,800 in her first week. Or Leo, who landed his first $2,400 project just 16 days after starting."

Flagged — real (Alba)

"she now says her skin is clear '90% of the time' and she feels like it's continuing to get more and more stable.".

04 The agnostic spine — why one rubric fits every expert

The same things make an ad bad and the same shape makes it good across coaching, health and skincare. That shared structure is the rubric; the niche details are config. Amelia is the one expert live today; the same engine is built to serve skincare and health as they launch.

One spine that recurs — not the only one

Hook on a concrete lived frustration, not a hypothetical.
Named proof + specific number + timeframe, immediately.
Branded mechanism that names the enemy, hides the method.
Objection stack — "no degree / no experience / no equipment."
CTA = click to the free training, never the program.

What changes per expert (the config layer)

Proof metric — income ($/mo) · symptoms (lbs, cycle) · skin clarity.
Tone — aspirational · clinical-empathetic · identity/healing.
The enemy — design schools · dismissive doctors · conventional dermatology.
Off-brand flags — e.g. "no qualifications" (Amelia, legal), jargon (health), percentages (skincare).

The same spine, real lines — Amelia (live) and Elena (pre-launch, scripted)

Amelia (live): "If you're passionate about this but think you can't because you don't have a degree…"
Health (Elena) — real: "Your doctor says 'lose weight first, then we'll talk about fertility' — but they won't actually help you lose the weight. And you're sitting there thinking, how am I supposed to do one without the other?"

Not every ad should follow one structure

The spine above is one proven structure, not a mandate that every ad be built pain → agitation → solution. Different ads want different frameworks — and so do hooks. The open design question is how to let the system pick from a small set of structures and hook frameworks without losing its simplicity. That's a dedicated upcoming session, not settled here; this rubric judges quality whatever structure an ad uses.

05 Stress-test against live data (Amelia — our only account with results)

Before trusting the rubric, we pressure-tested its design against real outcomes. The process: score 10 real ads from Amelia against the quality bar — 5 confirmed winners, 5 confirmed losers — blind to their performance, then compare the rubric's read to what the market actually did. The result was a clean gap between winners and losers and a correct call on the worst ads from script evidence alone — plus one clear lesson that shaped two of the rubric's design choices (below). That's what the test is for: calibrate the bar on known outcomes before it judges live batches.

Why only 10 — and the data we haven't tapped yet

Ten was a deliberately clean, balanced sample — 5 clear winners against 5 clear losers, hand-labelled so the blind scoring had unambiguous outcomes to check against. It is not the limit of what we hold. Amelia's Meta account and Hyros carry cost-per-lead, cost-per-booked-call, hook and hold data across her full batch history and beyond — a much larger owned set. Widening the back-test onto that fuller record is the obvious next validation step, and it's a separate piece of work, not something this draft claims to have already done.

Mean winner score

51.5

Mean loser score

33.7 −17.8

Winner/loser pairs ordered right

80%

Bottom 3 scores — all losers

3/3 script alone

Ad	What actually happened	Rubric	Agree?
Winner 1 · Studio	$8 cost-per-lead · $64 per call — best	54.5	over-rejected
Format Pair · Selfie	$25 per call — deepest win	71.3	over-rejected*
Winner 2 · Studio	$10 cost-per-lead · $54 per call	35.0	over-rejected
Winner 3 · Selfie	$14 cost-per-lead · $70 per call	49.0	over-rejected
Winner 4 · Studio	$58 per call · $14 cost-per-lead	47.9	over-rejected
Loser 1 · Selfie	$177 per call — worst	16.0	correct
Loser 2 · Studio	$28 cost-per-lead · hook rate 38.9	17.9	correct
Loser 3 · Studio	$169 per call	23.6	correct
Loser 4 · Studio	$108 per call	47.1	correct
Format Pair · Studio	$41 cost-per-lead · $122 per call	63.9	missed

Honest read: a strong filter against bad ideas — 4 of 5 losers caught from the script alone, and the three lowest-scoring ads were all genuine losers. The lesson cuts the other way too: scored against strict structural rules, the winners came in low — the market forgave gaps the rubric was penalising. That's the signal behind the two design choices below. Both push the same direction: ship great ads, don't over-reject them.

The two design choices the test drove

Choice 1 · Proof is a scored deduction, not a kill-switch

A strict proof rule — "two-plus stacked proof stories or block" — would reject 3 of the 5 winners, including the deepest win (the Format Pair, Selfie, $25 per call). Those ads had one strong proof story, not zero; the market forgave the incompleteness. So the rubric splits the two: total proof absence is a strong flag (it catches the real loser, Loser 3, which had none), while proof incompleteness is a scored deduction — it lowers the score, it doesn't kill the ad.

Choice 2 · Format-execution fit carries real weight (15%)

One ad ran in two formats with the same script (the Format Pair above) — Selfie won at $25 per call, Studio lost at $122. A 4.9× swing on format alone. A token weight for format would let the losing version pass on script quality. So format-execution fit is weighted heavily (15%), context-tunable (higher for new/cold tests, lower once a format is proven), and Provisional until performance settles the exact number.

*Scores are from the calibration scoring pass, run under a stricter early configuration than the current weights (§3) — shown to illustrate the pattern, not as exact current-rubric totals. The winner/loser separation and the two lessons are the takeaway, not the decimals.

06 The hard questions, pre-answered

"Ten ads is an anecdote, not a dataset — and isn't this just Amelia's rubric with a coat of paint?"

The standards don't come from ten ads, and they aren't one expert's playbook repainted. They're drawn from three places: Amelia's live performance, the founder's own creative reviews, and the high-signal advertisers already winning in our markets. The structural spine is niche-agnostic and carries no expert-specific content — that lives in config. The 10-ad blind test is only the performance check on our one account with numbers, and it still called the 3 deepest duds from script evidence alone. Fuller owned data — Amelia's full Meta and Hyros history — is there to widen that check (see §5).

"So nothing gets auto-rejected? That sounds loose."

By design. We can't predict the market, so we don't pretend to. The rubric makes sure an ad is written well and flags what a sharp reviewer would flag; you decide what ships. The no-nos still stop the genuinely bad ones — they just inform your call instead of overriding it.

"What stops it from shipping a scammy ad that happens to score well?"

The no-nos. They sit outside the weighted score — trip one and the verdict is capped below GREEN LIGHT no matter how high the quality score, and a fatal one (wrong facts, scammy core) is a DISMISS on its own. A polished, well-structured ad that reads as hype doesn't get a pass for being well-built.

07 How the weights earn trust

The Provisional weights are tuned by evidence, never auto-set. You stay in the loop.

EDITS

Edit-distance trend (primary). We capture the script the system writes and the version you approve, and measure how much you changed. As that gap shrinks, the rubric and the writer are converging on your taste.

MONTHLY

Performance check. Do high-scoring ads actually beat low-scoring ads on cost-per-booked-call? If a weight doesn't correlate, the system flags it for your review — it never re-weights itself.

FLAGS

No-no frequency. If one no-no fires on 30% of scripts, the upstream writing skill needs reinforcement — the rubric doubles as a diagnostic on the pipeline.

A note on hook rate

Hook rate — the share of viewers who don't scroll past — is a false quality signal. Amelia's two highest hook rates were her two worst-performing ads. No dimension in this rubric uses it as a quality measure, and any older check that still does (a "hook rate below 25% = fail" rule) should be retired.

08 Honest limits

Performance validation is one account. Only Amelia has results (the labelled back-test was n = 10 ads, ~2 months, below the 15-call confidence floor; fuller owned data exists, see §5). The other two experts (skincare, health) are pre-launch — the standards transfer, the weights aren't yet validated there.
The standards are qualitative. They're expert judgment, mined faithfully from real reviews — but "sounds scammy" and "lacks emotional power" are calls, not measurements. The rubric makes them explicit and consistent; it doesn't make them objective.
This does not replace human judgment. Replicating a sharp reviewer's line-by-line eye is still beyond any system, and that isn't the aim. The aim is to cover enough ground to reach real predictability — to beat the industry's creative hit rate, and to do it at scale. That is the quality benchmark. Until the system is far more mature, human judgment and review stay essential: the rubric makes the obvious calls fast and flags the rest for your eyes.
It scores the ad, not the funnel. Cost-per-booked-call also depends on the training video and the offer. A bad number can be a VSL problem, not an ad problem.

09 Glossary

The experts: Amelia (a skills-coaching niche) — our one live account with performance data — plus two pre-launch experts in skincare and health. The system serves all of them from one agnostic engine. Examples marked "real" are from Amelia's live account; pre-launch examples are labelled illustrative.
High-signal advertisers: Outside brands whose ads are demonstrably winning in our markets at scale. We don't hold their private numbers, so they're indirect evidence — but their proven creative is a strong reference for what "great" looks like. We model the patterns; we never copy the ads.
VSL (video sales letter): The long free-training video the ad drives traffic to. The ad sells the click; the VSL sells the program.
Mechanism: The proprietary "how it works" of an expert's method — the steps, order and sequence. Teased in ads, fully revealed only in the VSL, to protect the reason to watch and the intellectual property.
Hook: The opening line(s). Its job is grounded intrigue aimed at the exact target person — not maximum scroll-stop (see the hook-rate trap).
Hook rate: The share of viewers who don't immediately scroll past. Feels like quality; isn't — Amelia's two highest hook rates were her two worst ads.
Social proof bridge: The block of named student results (name + obstacle + number + timeframe) that turns "interesting" into "possible for me."
Cost-per-booked-call / cost-per-lead: The money metrics. Cost-per-lead = price of one opt-in; cost-per-booked-call = price of one qualified sales call (the one that matters most). Lower is better.
Selfie vs Studio: Two video formats. Selfie = phone-shot, low-polish, peer feel. Studio = polished, higher production, authority feel. Format is a first-order driver of results, not neutral.
Chekhov's Gun: the conciseness rule: every sentence must earn its place — if removing it costs nothing, cut it.