Comparable-strata designer: Continuous-outcome trials (the coefficient-of-variation screen)

Dwyer, William J.

doi:10.5281/zenodo.20709963

In Continuous-outcome trials (the coefficient-of-variation screen), bands are comparable on both the coefficient-of-variation (dispersion) ratio and the outcome level, so a pooled-variance contrast on the log scale is valid. Click the axis to add a cut, drag a cut to move it, click a cut to remove it.

Confidence-interval calculator

Accurate 95% intervals for positive, right-skewed data at small N — built where the model is honest (the log scale, or a gamma GLM's log link) and back-transformed to the natural domain, using the screen's identity σ² = ln(1 + CV²). The eq-7 check routes to one of three rescues — stay natural, the log-scale t (log-normal), or a gamma GLM (gamma) — and two groups get a ratio of means with its p-value and CI reported in natural units. Load Example G for a real-data gamma rescue.

Group 1 (or your single sample) — positive numbers, any separators

Group 2 (optional — two-group comparison)

Load an example:

Example G is real public data: serum bilirubin (mg/dL) for patients without vs with ascites in the Mayo Clinic primary-biliary-cirrhosis trial — bilirubin is this paper’s running example, and the small ascites arm reads gamma on the eq-7 check. Dickson et al. (1989); R survival::pbc. Reproduce ↗

Runs in your browser.

What this is doing, and the one thing to watch

Why the log domain. If the data are log-normal, ln(x) is exactly normal, so a t-interval on the logs is valid without the central limit theorem, even at n = 5–10; back-transforming gives an asymmetric natural-domain interval that respects the skew.

The catch. The naive back-transform exp(ŷ ± t·s_y/√n) is a CI for the geometric mean = median, not the arithmetic mean. For the arithmetic mean this uses Cox's method on ŷ + s_y²/2 (Land's method is the exact version).

Heteroscedasticity & the fallback. Constant CV means SD ∝ mean; the log is the variance-stabilising transform for that case. For two groups, ρ = ln(1+CV&sub2;²)/ln(1+CV&sub1;²) selects pooled vs Welch on the logs. When the s_y² ≈ ln(1+CV²) check diverges, or the data contain zeros/negatives, abandon the parametric route for a BCa bootstrap or a wider distribution-free interval.

How to build a confidence interval when data are skewed and N < 30

1. Why the usual interval breaks here

The textbook 95% confidence interval for a mean is

x̄ ± t_{(n−1, .975)} · s / √n

where x̄ is the sample mean, s the sample standard deviation, n the sample size, and t the Student-t critical value. It assumes the sample mean is itself normally distributed. The Central Limit Theorem (CLT) makes that true only when n is large or the data are symmetric. With strong right-skew and n < 30 the CLT has not arrived: the interval is too short, forced symmetric, and can dip below zero for a positive quantity. So it is not really a 95% interval.

The multiplier, precisely. Because the interval is two-sided, α = 0.05 is split into 0.025 per tail, so t_{(n−1, .975)} is the 0.975 quantile — the value leaving 2.5% in the upper tail — of the t-distribution on n−1 degrees of freedom. We use t rather than z precisely because s is an estimate of σ; that added uncertainty is what gives t its heavier tails, and t → z as n grows. Here s is the sample SD computed with the (n−1) denominator (Bessel's correction) and the standard error is s/√n.

If you have seen the same interval written with z and s/√(N−1), it is not in conflict. The z form is the large-sample approximation to the t form, and the √(N−1) denominator is simply the algebraic equivalent that arises when s is defined with N in the denominator instead of N−1. The two standard errors are mathematically identical:

s_N / √(N−1) = s_(N−1) / √N

So the two notations agree; the more general t form is the one that matters at small n — exactly the regime this calculator is built for.

2. The fix: change the scale

Many positive, skewed quantities (concentrations, costs, durations, biomarkers) are log-normal: their logarithm is normal. If y = ln(x) is exactly normal, a t-interval on the y-values is valid at any sample size — normality is now exact, not approximate. Four steps:

1) y_i = ln(x_i) 2) compute ȳ, s_y 3) ȳ ± t_(n−1,.975) · s_y/√n 4) apply exp( · )

The back-transformed interval is asymmetric in the original units — the upper arm reaches further, as skew demands.

3. Which "average" are you bounding? (the key subtlety)

Back-transforming the log-mean does not give the arithmetic mean. exp(ȳ) is the geometric mean, which for a log-normal equals the median. Two different targets:

Median / geometric mean: exp( ȳ ± t · s_y/√n )

Arithmetic mean (Cox): exp( ȳ + s_y²/2 ± t · √( s_y²/n + s_y⁴ / [2(n−1)] ) )

The arithmetic mean of a log-normal is exp(μ + σ²/2), so it needs the extra s_y²/2 term (Land's method is the exact version). Reporting the back-transformed median as the mean is the most common error — this tool shows both, labelled.

4. Where the coefficient of variation comes in (the screen)

CV = s / x̄ is a unit-free measure of relative spread. For a log-normal there is an exact bridge to the log-scale variance:

σ² = ln( 1 + CV² )

A large CV ⇒ large log-scale spread ⇒ strong skew, which is why the CV is the right trigger. It also gives a free honesty check: estimate the log variance directly as s_y² and from the CV as ln(1+CV²), and compare. A big gap warns the data are not log-normal and the back-transformed interval should not be trusted.

A gamma companion (the three-way check). The same CV implies a second reference value under the gamma model: ψ′(1/CV²) (trigamma). The two straddle CV² — log-normal ln(1+CV²) just below, gamma ψ′(1/CV²) just above (agreeing to order CV², splitting at order CV⁴) — so comparing s_y² to both makes the check three-way: nearer ln(1+CV²) → log-normal (log-scale t); nearer ψ′(1/CV²) → gamma (a gamma GLM with a log link, no back-transform bias); outside both → heavier/lighter tail, use the bootstrap. The eq-7 line reports the verdict. (See the Polygamma Bridge derivation.)

5. Heteroscedasticity and comparing two groups

A constant CV means the SD grows with the mean (SD ∝ mean) — heteroscedasticity. The log transform stabilises exactly that. For two groups, compare log-scale variances with

ρ = ln(1 + CV₂²) / ln(1 + CV₁²)

ρ ≈ 1 → equal log spread → pooled (Student) t on the logs; ρ far from 1 with unequal group sizes → Welch t on the logs. The result is a ratio of geometric means, exp( (ȳ₂ − ȳ₁) ± t · SE ) — for skewed data a ratio is meaningful where a difference is not.

5b. The gamma rescue, and reading either rescue in the natural domain

If the eq-7 check points to gamma rather than log-normal, the logarithm no longer normalizes the data, so the matched rescue is not the log-scale t but a gamma generalized linear model with a log link. It models the mean directly, so the contrast is a ratio of arithmetic means — exp(β) — with no back-transformation bias, and its variance function V(μ)=φμ² is exactly the mean-proportional spread the CV screen detected. Inference is by analysis of deviance (an F-test).

Carrying either rescue back to the natural domain. Each rescue is computed where the model is honest, then returned to original units:

p-value: tested on the log / link scale, but H₀ “ratio of means = 1” ⇔ “difference of log-means = 0” — so the rescued p is the natural-domain p (no transform).

interval: formed on the log / link scale, then exp( · ) → an asymmetric natural-domain ratio. Log-normal → ratio of geometric means; gamma → ratio of arithmetic means.

Example G (real data — serum bilirubin by ascites) shows both side by side: the naive raw-scale t gives p ≈ 5×10⁻⁴; the log-normal rescue returns a geometric-mean ratio 3.45 (95% CI 2.11–5.64, p ≈ 2×10⁻⁵), and the gamma rescue an arithmetic-mean ratio 3.33 (95% CI 1.90–5.81, p ≈ 9×10⁻⁷). Same data, two honest natural-domain effects — a median ratio and a mean ratio — each sharper than the naive interval.

6. When NOT to use the log route

If any value is zero or negative, ln() is undefined. If the data are skewed but not log-normal (the s_y² vs ln(1+CV²) check disagrees, or the tail is heavier than log-normal), use a shape-free method: a bias-corrected and accelerated (BCa) bootstrap, or a wider distribution-free interval. You trade precision for honesty.

7. Worked example (Example A, n = 7)

Data: 4.2, 5.1, 6.0, 7.3, 9.8, 14.2, 31.0. t_(6,.975) = 2.447.

Natural: mean 11.09, SD 9.41, CV 0.85, skew 2.0 → naive CI 11.09 ± 2.447×9.41/√7 = [2.38, 19.79] (symmetric).

Logs: ȳ = 2.173, s_y = 0.690. Geometric mean exp(2.173)=8.79 → CI [4.64, 16.63] (asymmetric).

Arithmetic mean (Cox): exp(2.173+0.690²/2)=11.15 → CI [5.42, 22.91].

Check: s_y² = 0.475 vs ln(1+0.85²) = 0.543 — close, so log-normal is reasonable.

In one sentence: small-N accuracy for skewed data is bought from a distributional assumption (log-normality) instead of from sample size (the CLT) — so the tool checks that assumption and tells you to fall back to a bootstrap when it fails.

User guide

About this planner

A positive continuous outcome (concentration, cost, duration, biomarker); bands must match in relative spread so a pooled-variance comparison is valid.

Comparability checks for this data type

Coefficient-of-variation (CV) ratio — SD/mean; the band's max/min CV must stay within tolerance (the dispersion screen)
Outcome level — the band's max/min mean-outcome ratio

Rescue analysis (when a band cannot be made comparable by splitting): a scale-free / Welch (unequal-variance) analysis on the log scale, using each level's own dispersion.

What this tool does

This planner helps you split an ordered index (a covariate, dose, risk level, or — for a group-sequential design — the information fraction) into comparable strata (bands): groups that are alike on the statistical properties that matter, so a single summary effect per band is well defined and the analysis stays calibrated. You design the bands; the tool scores them and suggests a comparable set.

Three ways to load data

Parameters — dial in a published or assumed profile from the sliders and defaults. See the Reference dataset panel above for its source and a live link.
Paste summary — one row per index level with its summary statistics.
Paste pilot — raw pilot observations; the tool fits a model (scoring candidate distributions by AIC where shown) and estimates the per-level statistics.

Reading the charts

Outcome value — the distribution of the outcome (a density or mass function and its cumulative form), with numeric axis scales, so you can see how the data look.
Comparability axes — one panel per check, showing that check's statistic across the index, with your band cut-points overlaid. Green shading marks a comparable band; red marks one that needs review.

Designing the bands

On the index axis you place cut-points (or, for a group-sequential schedule, interim looks): click the axis to add one, drag it to move it, and click an existing one to remove it. Suggest comparable bands computes the comparability-maximizing set automatically. The tolerance sliders make each check stricter or looser; where a planned-N control is shown, the relevant check is sample-size–dependent, so the bands shift as you change N.

Reading the band table and the diagnosis

Each band is scored on every check and marked comparable or review. When a band is flagged, the tool diagnoses why and what to do, distinguishing two cases:

Decomposable (pooling-induced) — the band pools levels that are each acceptable on their own but differ from one another. The fix is structural: decouple (split the band) — no change of analysis.
Irreducible (intrinsic) — a single level fails the check on its own. Re-banding cannot fix it (splitting only isolates it); the fix is analytic — switch that stratum to the rescue analysis named above for this data type.

Key terms

Stratum / band — a contiguous group of index levels analysed together.
Comparability check — a statistic that must stay within tolerance across a band for it to be “comparable.”
Tolerance — the threshold a check may not exceed (a maximum ratio or spread).
Primary vs secondary check — the primary axis defines the sub-optimal single-axis banding; the additional checks tighten it.
Decouple — split a band so each piece is internally homogeneous.
Rescue analysis — the alternative method to use when a band cannot be made comparable by re-banding.