Demand Gen vs YouTube Bake-off
The test design, the metrics, and the decision rule for picking between Demand Gen and YouTube Ads.
Who this is for
- Brands at $50K+/mo considering top-of-funnel
- Operators running a YouTube test next quarter
- In-house creative teams making allocation decisions
Why this exists
Demand Gen and YouTube look similar from the outside but read different signals and convert different audiences. The wrong pick burns 4-6 weeks of test budget and tells you nothing useful. This bake-off design ensures you get a clean answer.
Read this first
Demand Gen runs across YouTube, Discover, and Gmail with audience-led targeting. YouTube Ads runs primarily on YouTube proper with placement and topic-led targeting. They overlap on the YouTube surface but read different signal stacks, optimise against different conversion windows, and tend to capture different audience archetypes. The bake-off settles which engine fits your brand without burning budget on a hunch.
Test design
Lock the matched-traffic budget split before launch
Allocate a fixed budget for the test window (typically $30K-$60K total over 4-6 weeks for a brand at $50K+/month). Split 50/50 across Demand Gen and YouTube Ads. Same total spend, same window, same product set. Anything else introduces a confounder you can't tease out at the end.
Match the creative across formats
Six scripts shot once, format-cut into vertical (9:16), in-feed (1:1 or 4:5), and 16:9. Both campaigns run the same matched batch. If Demand Gen runs polished verticals while YouTube runs unedited 16:9 from the same source, the test measures format mix more than engine performance. Format parity is non-negotiable.
Set the audience overlap deliberately
Use the same customer match list, the same custom segment URLs, and the same in-market segment on both campaigns. Different audience signal between campaigns means different targeting, which means the test result is contaminated. Same signal on both, let the engines do the work of routing it.
Run for 4-6 weeks, not less
View-through conversions take 7-14 days to register cleanly. Below 4 weeks, post-click data is thin and view-through is half-baked. 4 weeks is the floor for a defensible read. 6 weeks is the recommended minimum if your conversion volume is below 200 per week.
Measure on the four-metric stack
View-through CPA (audience reach efficiency), post-click ROAS (intent quality), AOV impact (whether the audience is qualified or low-intent), blended MER lift across the entire account (the only number that survives attribution debate). Decide on the engine that wins on three of the four metrics; if the result splits 2-2, run both at lower budget.
Decision rule for declaring a winner
Apply against the four-metric stack at the end of the test window. Statistical confidence requires at least 200 conversions per arm; below that, hold the test or extend by 2 weeks.
| Metric | Threshold for a win | What it tells you |
|---|---|---|
| View-through CPA | 30%+ lower than the other arm | Reach efficiency. Cheaper attention. |
| Post-click ROAS | 20%+ higher than the other arm | Intent quality. The audience converted on first click. |
| AOV impact | 10%+ higher AOV than the other arm | Audience qualification. Higher-intent buyers spend more per order. |
| Blended MER lift | Net positive lift on account-level MER vs pre-test baseline | The number that survives attribution debate. The actual business impact. |
Creative parity checklist (the test fails if these aren't matched)
- Same six scripts shot once, no script-level rewrites for one engine
- Vertical, in-feed, and 16:9 cuts produced from the same master, no separate shoots
- Same hook structure across cuts (the first 2 seconds are the script's first 2 seconds, format-trimmed)
- Same offer + CTA placement (front-loaded or back-loaded) across both arms
- Same end-card design and length
- Same captions and audio mix (subtitles burned in for vertical, optional captions for 16:9)
- Same brand asset (logo, colour palette, brand voice)
- Same approval cycle so neither arm has stale creative when the other refreshes
When the answer is 'run both'
The bake-off split-results 2-2 in roughly a third of accounts. That's not the test failing; that's a real signal. It means the brand has audiences that respond differently to the two engines and budget consolidation would lose volume on one of them.
When that happens, the right move is to scale both at proportional budget allocations. Demand Gen carries the broad-discovery layer (Discover feed, Gmail promotions, YouTube home feed). YouTube Ads carries the intent-led layer (in-stream pre-roll, YouTube search results, channel page placements). Both serve the funnel; neither replaces the other.
The wrong move when the test splits is to pick one engine arbitrarily and call it. The 4-6 weeks of data was telling you something specific. Read it.
What good looks like at the end of the bake-off
You have a written winner (or written 'run both' with proportional allocations). You have a documented metric trail showing why. Creative parity was maintained throughout. The decision survives a 30-minute Q&A with the CFO. From here, the winning engine scales to its target spend with the structure refined on what the test taught you.
External resources
Authoritative references we link to alongside the template. Read them before running the audit.
- Google Ads, Demand Gen overviewOfficial feature reference for Demand Gen.
- Google Ads, video campaigns overviewReference for YouTube Ads campaign types and targeting options.
- Google Ads, video ad sequencingUseful for the YouTube Ads arm if the test extends to mid-funnel sequencing.
- Google Skillshop, video advertising certificationFree training. The team running the bake-off should hold current certification.
- Search Engine Journal, video advertising coverageTopic index for ongoing YouTube + Demand Gen platform changes.
Want this run for you?
