# Samples
_Don't treat small-sample research as quantitative when its value is qualitative. You are not measuring; you are learning._
---
A product team runs user interviews. Ten people.
Seven pick option A, three pick option B.
The readout: "Overwhelming favourite: more than twice as many preferred A over B."
That looks immediately actionable. Mathematically, it isn't.
---
You need far more data than you think to confirm a pattern.
Statistically - the number of observations you need scales with the square of the difference you're trying to detect. Halve the effect size, quadruple the sample.
A useful rule of thumb:
n ≈ 16 × p × (1−p) / d²,
Where p is your baseline rate and d is the absolute difference you're trying to detect. The 16 bakes in standard statistical confidence. The d² is what ruins your plans.
This is why small improvements - the kind most product changes actually produce - require enormous samples to confirm, while only dramatic differences show up in small groups.
---
You're a product manager for a communications app. You've redesigned the notification screen - cleaner layout, fewer taps to acknowledge a message. You show both versions to 30 users and ask which they prefer.
22 out of 30 prefer the new design. 73%. Your slide deck says "strong preference for the redesign."
But run the formula.
Your baseline expectation is 50% (no preference either way), and you're trying to detect whether the true preference is meaningfully above that.
n ≈ 16 × 0.5 × 0.5 / d²
To detect a 15-point preference difference (i.e. 65% choosing the new layout) with confidence, you'd need about 178 people. You have 30.
---
You run a payments platform. Currently 4% of users who land on a top-up page complete a payment. You've streamlined the flow and you want to know if it converts better.
You split traffic 50/50 for two weeks. 1,200 see each version. The old flow converts at 4.0%. The new one converts at 4.8%. That's a 20% relative lift. Feels meaningful.
n ≈ 16 × 0.04 × 0.96 / 0.008² = **9,600 per variant**
To run this test properly at 4% baseline, you'd either need to run it for several months to accumulate ~20,000 total visitors, or accept that you can only detect very large improvements - say 4% jumping to 6%, which would need about 2,400 per variant.
Even if the 20% improvement was real - with 1,200 people where you needed 9,600, you would only have a roughly 15% chance of detecting it. This is the hidden cost of underpowered tests - you'd conclude the redesign didn't work, kill the project, and move on.
Small samples don't just risk false confidence, they risk false failure - abandoning changes that were actually working.
---
All is not lost.
Twenty interviews won't tell you what percentage of users prefer option A. But they might reveal the words people actually use ("this feels like it was designed by engineers"), or that twelve people independently mention the same friction point without prompting.
They capture intensity - lukewarm preference from 60% tells you less than passionate enthusiasm from 30%. And they surface surprises you didn't anticipate, unexpected objections and unimagined use cases that are just as informative at n=5 as n=500.
Many decisions can't wait for statistical significance, and some never could. You will never get 3,000 enterprise buyers into a pricing study.
In those contexts, do qualitative research well and make the decision with appropriate humility about what you don't know.
---