Experiment Design Definition
Process and Ways of Working for Product teams
π§ͺ The different types of experiments
Product teams run 4 different types of experiments:
π Concept validation (Learning)
Purpose: To quickly and cheaply validate the desirability and viability of a high-level concept or a specific hypothesis about user pain/need before significant development effort. This playgroup is dedicated to reducing uncertainty and risk.
Success criteria:
Clear decision threshold reached: The results provide the necessary confidence to formally invest (move to optimization), pivot (adjust the concept), or kill (stop investment).
Actionable insight: We gain a validated answer to the core learning question (e.g., "Users will pay for X, but only if delivered instantly," or "Users do not understand concept Y").
Qualitative validation: Demand metrics (clicks, sign-ups) exceed a predefined threshold, validating user interest.
Typical examples:
Testing core value prop: Using a "Fake Door" test to assess demand for a new service (e.g., "Instant pay for more visibility").
Measuring desirability: via click-through or mock functionality (profile strength)
Concept comprehension: A/B testing two distinct messaging/design concepts to see which one users understand better (e.g., testing "Profile Strength" vs. "Profile Quality Score").
π Iterative growth (Optimization)
Purpose: To refine existing or build new features/experiences to drive measurable improvements in key or behavioural metrics through low-risk experiments β often part of a series of iterations leading to a bigger goal.
Success criteria:
Statistically significant uplift: Primary metric shows a measurable positive increase (e.g.,{Conversion Rate} up by %1.5).
Guardrail metrics stable or positive: No negative trade-offs or significant deterioration in guardrail metrics (e.g.,{Session Length}, {Error Rate}, or {Behavioural metrics}).
We gain confidence in directionality β knowing what works and what doesnβt.
Positive ROI: The experiment's gain in revenue, retention, or activation exceeds the opportunity cost of the development work.
Typical examples:
Funnel enhancement: Tweaking the steps, content, or interactions in the registration, or other flow.
Design / copy tuning: Tweaking copy, visuals, or placement of key CTAs (Contact details v1, onboarding toolkit v1 iteration)
Optimising a feature that can perform better (e.g. trades promise) or iterating on a concept to evolve it further (e.g. from onboarding messages to onboarding assistant)
π No-harm experiments
Purpose: To safely roll out necessary new design, system, architectural component, or foundational change can be introduced without causing negative impact to user experience, business metrics, or internal operations. No-harm experiments are driven by higher business goals and focused on de-risking any type of changes prior to user-facing rollout; they are not intended to prove immediate user or business value.
Success criteria:
No degradation in key user or business metrics (conversion, activation, retention, etc)
No increase in system errors, latency, or operational incidents
No increase in claims to customer support
Typical examples:
Improving caching mechanisms (e.g. job details to show accurate discount amount)
Deploying a backend service that supports future features
Testing infrastructure changes
Testing technical feasibility
Validate that a new design can be introduced without causing negative impact (e.g. Contacts to Inbox, new job details UI)
Rolling out a rebrand
Increasing adherence to design system components
Migrating UI frameworks, accessibility updates, or fixing design debt that changes layout or flows.
π Full feature rollout experiments
Purpose: To validate a complete, high-impact product or feature intended for rollout β expected to move multiple key metrics positively and deliver clear business/user value. For these experiments confidence has been built over time through learning experiments and/or user research.
Success criteria:
Statistically significant improvement on primary metric and positive uplift in guardrail metrics.
Clear improvements across behavioural metrics.
Clear user adoption and engagement with the new feature.
No major negative trade-offs (e.g. improved conversion but lower satisfaction).
Fit for scale: The experiment validates the feature for full, permanent rollout and long-term maintenance.
Typical examples:
Launching a new recommendation system (e.g. similar leads)
Introducing a new onboarding experience (e.g. onboarding toolkit)
π§βπ¬ The hypotheses
Every initiative and the corresponding experiment(s) need to have clear hypotheses to test in order to clearly measure success. These hypotheses are at the core of PRDs. The structure we should follow is as below:
π The "If-Then-Because" Standard
The best practice for hypothesis generation is the scientific, single-sentence format that clearly articulates the action, the measured outcome, and the user-centric reason.
This structure applies to all user-centric experiments:
IF (Action)
Our proposed change (the independent variable).
What we will build/test.
THEN (Prediction)
The expected directional outcome and/or observations
How the experience and behaviours are shifting
BECAUSE (Rationale)
The underlying user behaviour or insight that drives the prediction.
The user-centric reason/assumption why.
WE WILL KNOW THIS IS TRUE WHEN WE SEE
(Success metrics)
The signals and measures that prove the hypothesis right.
How the key, secondary or behavioural metrics will move.
IF we [introduce or simulate a new experience/concept/design], THEN we will learn whether [user group] [understands/values/engages with] it BECAUSE:
It will reveal how users behave (behaviour, engagement insight)
It will reveal how users perceive or interpret [concept](perception insight)
It will show whether the idea resonates with their goals or pain points (value insight)
It will clarify potential barriers or misconceptions (friction insight)
WE WILL KNOW THIS IS TRUE WHEN WE SEE [qualitative or quantitative signals].