Banyak eksperimen gagal bukan karena statistik. Tapi karena definisi, tracking, dan guardrails-nya ngambang.
A/B test harusnya ngurangin tebak-tebakan. Realitanya sering jadi “mesin percaya diri”: lihat angka hijau, ship, lanjut. Dua minggu kemudian performa balik normal, lalu semua pura-pura itu “seasonality”. Biasanya bukan. Biasanya measurement-nya yang bocor.
Kalau kamu mau eksperimen jujur, kuncinya bukan jadi profesor statistik. Kuncinya: decision dulu, definisi dikunci, tracking diverifikasi, guardrails dipasang.
Mulai dari keputusan, bukan varian
Tulis 1 kalimat sebelum desain/copy disentuh: “Kalau B menang, kita akan ____.”
Kalau kamu nggak bisa ngisi, itu bukan eksperimen. Itu hiburan pakai chart.
Template keputusan:
Jika Variant B menang, kita akan:
- Rollout ke 100% / Iterasi lagi / Buang ide ini
Kriteria “menang”:
- Primary metric naik >= [X]% dalam [N] hari
- Guardrails aman (tidak melewati threshold)
Pilih 1 primary metric. Satu.
Tim suka “buffet metrik”: 7 metrik, lalu rayain yang kebetulan bagus. Itu bukan belajar, itu belanja.
Contoh mapping cepat:
Onboarding change → Activation Rate (7 hari)
Pricing page change → Checkout Start Rate
Paywall change → Paid Conversion Rate
Definisi primary metric (kunci pakai doc, bukan vibes):
primary_metric:
name: activation_rate_7d
definition: "% signup yang memicu >=1 key_action dalam 7 hari sejak signup"
unit: user
numerator_event: key_action
denominator_event: signup_completed
window_days: 7
filters: [exclude_internal, exclude_test, exclude_bots]
Tambahin guardrails (biar “menang” tanpa ngerusak rumah)
Eksperimen bisa naikin primary metric sambil diam-diam ngerusak produk. Guardrails itu rem tangan.
Pilih 2–3 aja:
Refund/chargeback rate (monetisasi)
Crash/error rate (produk)
Latency/perf (web)
Complaint/ticket (kalau ada)
Guardrail spec:
guardrails:
- name: crash_rate
threshold: "<= +10% vs control"
- name: refund_rate
threshold: "<= +5% vs control"
- name: p95_latency_ms
threshold: "<= +50ms vs control"Verifikasi measurement sebelum start (pre-flight QA)
Di sinilah banyak “win” lahir: double fire, missing props, identity nggak nyambung, exposure nggak sekali.
Schema minimal yang bikin eksperimen bisa diaudit:
create table if not exists experiment_assignments (
experiment_key text not null,
user_id text not null,
variant text not null, -- 'A' or 'B'
assigned_at timestamptz not null,
assignment_id text, -- optional idempotency key
primary key (experiment_key, user_id)
);
create table if not exists experiment_exposures (
experiment_key text not null,
user_id text not null,
variant text not null,
exposure_time timestamptz not null,
source text,
primary key (experiment_key, user_id, exposure_time)
);
Check 1: exposure harus 1x per user (atau minimal “reasonable”)
select
experiment_key,
variant,
count(*) as exposure_events,
count(distinct user_id) as exposed_users,
(count(*)::numeric / nullif(count(distinct user_id),0)) as exposures_per_user
from experiment_exposures
where experiment_key = 'exp_onboarding_v2'
and exposure_time >= now() - interval '7 days'
group by 1,2
order by 1,2;
Check 2: assignment konsisten (user nggak loncat varian)
select
experiment_key,
user_id,
count(distinct variant) as variant_count
from experiment_assignments
where experiment_key = 'exp_onboarding_v2'
group by 1,2
having count(distinct variant) > 1
limit 50;
Tiga kebohongan klasik eksperimen (yang wajib kamu curigai)
Lie 1: Sample Ratio Mismatch (SRM)
Target 50/50, tapi kejadian 60/40. Itu bukan “minor”. Itu red flag: assignment, targeting, caching, exclusions.
SRM count:
select
variant,
count(*) as assigned_users
from experiment_assignments
where experiment_key = 'exp_onboarding_v2'
and assigned_at >= now() - interval '7 days'
group by 1
order by 1;
SRM test cepat (Python, chi-square 50/50):
import math
def chi2_srm(a, b):
n = a + b
exp = n / 2
chi2 = (a-exp)**2/exp + (b-exp)**2/exp
# approx p-value for df=1
p = math.erfc(math.sqrt(chi2/2))
return chi2, p
a, b = 6000, 4000 # ganti dari hasil query
chi2, p = chi2_srm(a, b)
print("chi2:", round(chi2,3), "p:", p)Lie 2: Novelty spike
Hari 1–2 bagus banget, lalu turun. Jangan ship cuma karena adrenaline day-2.
Lie 3: Metric bergerak, makna nggak
“Conversion naik” tapi event definisinya berubah atau step berhenti firing. Selalu cek raw counts + step sanity.
Hitung primary metric per varian (warehouse-first)
Contoh: Activation Rate 7D, berdasarkan signup_day cohort.
with cohort as (
select
ea.user_id,
ea.variant,
date_trunc('day', u.created_at)::date as signup_day,
u.created_at as signup_time
from experiment_assignments ea
join users u on u.user_id = ea.user_id
where ea.experiment_key = 'exp_onboarding_v2'
and ea.assigned_at >= now() - interval '14 days'
and coalesce(u.is_internal,false) = false
and coalesce(u.is_test,false) = false
),
activated as (
select
c.user_id,
c.variant,
c.signup_day
from cohort c
join events e
on e.user_id = c.user_id
and e.event_name = 'key_action'
and e.event_time <= c.signup_time + interval '7 days'
group by 1,2,3
)
select
signup_day,
variant,
count(distinct a.user_id)::numeric / nullif(count(distinct c.user_id),0) as activation_rate_7d,
count(distinct c.user_id) as users_in_cohort
from cohort c
left join activated a on a.user_id = c.user_id and a.variant = c.variant and a.signup_day = c.signup_day
group by 1,2
order by 1 desc, 2;Guardrails contoh (refund rate):
select
ea.variant,
count(*) filter (where e.event_name='refund')::numeric
/ nullif(count(*) filter (where e.event_name='purchase'),0) as refund_rate
from experiment_assignments ea
join events e on e.user_id = ea.user_id
where ea.experiment_key = 'exp_paywall_v3'
and e.event_time >= ea.assigned_at
group by 1;Jangan “peek” sampai menang
Kalau kamu cek dashboard tiap jam, randomness akan kasih kamu “win” cepat atau lambat. Atur durasi dari awal (misal 7/14 hari). Stop awal cuma kalau ada break jelas (SRM, bug tracking, crash).
Segment belakangan, dan dibatasi
Segment itu berguna buat jelasin hasil, bukan bikin hasil. Kalau kamu slice 12 kali, pasti ada yang “menang” di salah satu. Itu bukan insight, itu probabilitas.
Predeclare segmen (maks 2–3):
segments_allowed:
- new_vs_returning
- platform_web_vs_mobile
- one_acquisition_source
Ship “readout memo”, bukan screenshot chart
Output eksperimen yang bener itu memo singkat yang bikin keputusan repeatable.
Template readout:
experiment: exp_onboarding_v2
decision: "Jika B menang, rollout onboarding baru ke 100%."
primary_metric:
name: activation_rate_7d
definition_locked: true
guardrails:
crash_rate: "OK"
refund_rate: "OK"
quality_checks:
srm: "PASS"
exposure_once: "PASS"
tracking_notes: "No major anomalies detected"
results:
window: "2025-XX-XX to 2025-XX-XX"
uplift: "+[X]%"
confidence_note: "No early peeking; duration pre-set"
next_action: "Rollout 25% -> 50% -> 100% with monitoring"
Kebanyakan A/B test itu sebenernya “measurement test” dulu. Kalau tracking belum stabil, eksperimen jadi fiksi mahal. Cara tercepat ningkatin velocity eksperimen bukan bikin lebih banyak test, tapi bikin angka cukup trustworthy supaya hasil bisa diputusin tanpa debat 45 menit.
Kalau eksperimen kamu terasa noisy, politis, atau inkonsisten, curigai hal yang sama: definisi, tracking QA, guardrails. Baru statistik boleh kerja.