clophfit.testing.synthetic#

Synthetic data generation for testing and benchmarking.

This module provides a unified API for generating synthetic pH titration datasets with characteristics matching real experimental data from Tecan plate readers.

Primary functions: - make_dataset: Unified function for all synthetic data generation - make_simple_dataset: Simplified interface for unit tests

Classes#

TruthParams

Ground truth parameters for synthetic data.

Functions#

`make_dataset`([k, s0, s1, is_ph, seed, rng, n_labels, ...])	Generate synthetic pH/Cl titration data with configurable complexity.
`make_simple_dataset`(k, s0, s1, *, is_ph[, noise, ...])	Create a simple synthetic Dataset for unit tests.
`make_benchmark_dataset`([k, n_labels, n_points, ...])	Generate synthetic data for fitter benchmarking.

Module Contents#

class clophfit.testing.synthetic.TruthParams#: Ground truth parameters for synthetic data.

clophfit.testing.synthetic.make_dataset(k=None, s0=None, s1=None, *, is_ph=True, seed=None, rng=None, n_labels=2, randomize_signals=False, error_model='realistic', noise=0.02, y_err=None, rel_error=0.035, min_error=1.0, buffer_sd=50.0, error_ratio=1.0, low_ph_drop=False, low_ph_drop_magnitude=0.4, low_ph_drop_label='1', n_low_ph_drops=1, saturation_prob=0.0, x_error_large=0.0, x_systematic_offset=0.0, rel_x_err=0.01, n_points=None)#

Generate synthetic pH/Cl titration data with configurable complexity.

This is the unified function for all synthetic data generation. It supports per-label error scaling, randomization of signal parameters based on real experimental data distributions, and realistic artifacts like low-pH drops.

Parameters:

k (float | None) – Equilibrium constant (pKa for pH, Kd for Cl). If None and randomize_signals is True, sampled from real data distribution.
s0 (dict[str, float] | float | None) – Signal at unbound state. Use dict for multiple labels: {“1”: 700, “2”: 1000}. If None and randomize_signals is True, sampled from real data distribution.
s1 (dict[str, float] | float | None) – Signal at bound state. Use dict for multiple labels: {“1”: 1200, “2”: 200}. If None and randomize_signals is True, sampled from real data distribution.
is_ph (bool) – True for pH titration, False for Cl titration.
seed (int | None) – Random seed for reproducibility. If None (default), generates random data.
rng (np.random.Generator | None) – Pre-existing random generator (overrides seed if provided).
n_labels (int) – Number of labels (1 or 2). Only used when randomize_signals=True.
randomize_signals (bool) – If True, randomize K, S0, S1 from real L4 data distributions when not provided. Creates y1/y2 dual-channel data with realistic signal magnitudes and ranges.
error_model (str) –
Error model to use: - “simple”: Constant noise as fraction of dynamic range (uses noise). - “uniform”: Constant absolute error per label (uses y_err). - “realistic”: Relative error with floor (uses rel_error, min_error). - “physics”: Shot noise + buffer noise (uses buffer_sd, error_ratio). - “tecan”: Calibrated model from MCMC posterior on real L2 data.

err = sqrt(gain * signal + sigma_read^2 + (alpha * signal)^2). y1: gain=1.78, sigma_read=8.6, alpha=5.9% → ~49 counts at signal=600. y2: gain≈0, sigma_read=9.78, alpha≈0.2% → ~10 counts (uniform).
noise (float | dict[str, float]) – For “simple” model: relative noise as fraction of dynamic range. Use dict for per-label: {“1”: 0.05, “2”: 0.02}.
y_err (float | dict[str, float] | None) – For “uniform” model: constant absolute error per label. Use dict for per-label: {“1”: 10.0, “2”: 3.0}.
rel_error (float | dict[str, float]) – For “realistic” model: relative error as fraction of signal. Use dict for per-label: {“1”: 0.07, “2”: 0.025} for 3x y1/y2 ratio.
min_error (float | dict[str, float]) – For “realistic” model: minimum error floor (instrument noise).
buffer_sd (float | dict[str, float]) – For “physics” model: base buffer SD where err = sqrt(signal + buffer_sd^2). For y2, scaled by error_ratio if not a dict.
error_ratio (float) – For two-label physics model: ratio of 2_buffer_sd to 1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1.
low_ph_drop (bool) – Simulate acidic tail collapse at lowest pH (realistic artifact).
low_ph_drop_magnitude (float) – Fraction of signal to drop at lowest pH (0-1).
low_ph_drop_label (str) – Which label to apply the pH drop to (“1” or “2”).
n_low_ph_drops (int) – Number of low-pH points to drop (default 1). Points are selected by ascending pH (lowest first).
saturation_prob (float) – Probability of masking points (saturation).
x_error_large (float) – Additional random x-error (pH units).
x_systematic_offset (float) – Systematic x-offset (pH units).
rel_x_err (float) – Relative x-error for Cl titrations (ignored for pH).
n_points (int | None) – Number of pH points. If None, uses L2_PH_VALUES (7 points). If specified, generates evenly-spaced pH from 5.5 to 9.0.

Returns:

Dataset – Generated dataset with specified labels.
TruthParams – Ground truth parameters (K, S0, S1).

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

Examples

Simple single-channel for unit tests:

>>> ds, truth = make_dataset(7.0, 100, 1000, error_model="simple", noise=0.02)

Randomized dual-channel matching real data distributions:

>>> ds, truth = make_dataset(randomize_signals=True, seed=42)

Randomized single-channel:

>>> ds, truth = make_dataset(randomize_signals=True, n_labels=1, seed=42)

Physics-based errors with differential noise (y2 5x more precise):

>>> ds, truth = make_dataset(
...     k=7.0,
...     s0={"1": 1000, "2": 800},
...     s1={"1": 200, "2": 300},
...     error_model="physics",
...     buffer_sd=50.0,
...     error_ratio=0.2,
... )

Calibrated Tecan noise model from real L2 MCMC posteriors:

>>> ds, truth = make_dataset(
...     randomize_signals=True,
...     error_model="tecan",
...     seed=42,
... )

Simulate low-pH drop artifact:

>>> ds, truth = make_dataset(
...     randomize_signals=True,
...     low_ph_drop=True,
...     low_ph_drop_magnitude=0.4,
... )

clophfit.testing.synthetic.make_simple_dataset(k, s0, s1, *, is_ph, noise=0.02, seed=None, rel_x_err=0.01)#

Create a simple synthetic Dataset for unit tests.

Uses fixed x-values and simple noise model for backward compatibility with existing tests. Does NOT set y_err when noise=0 to allow fitters to use default weighting.

Parameters:

k (float)
s0 (dict[str, float] | float)
s1 (dict[str, float] | float)
is_ph (bool)
noise (float)
seed (int | None)
rel_x_err (float)

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

clophfit.testing.synthetic.make_benchmark_dataset(k=7.0, *, n_labels=1, n_points=7, error_ratio=1.0, add_outlier=False, outlier_label='1', outlier_sigma=4.0, n_outliers=1, seed=None, rng=None)#

Generate synthetic data for fitter benchmarking.

This is an alias for make_dataset with physics error model and convenient defaults for benchmarking.

Parameters:

k (float) – True pKa value (default 7.0).
n_labels (int) – Number of labels: 1 or 2 (default 1).
n_points (int) – Number of pH points (default 7).
error_ratio (float) – Ratio of 2_buffer_sd to 1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1.
add_outlier (bool) – If True, add a low-pH drop in the specified label.
outlier_label (str) – Label to add pH drop to (“1” or “2”).
outlier_sigma (float) – Magnitude of low-pH drop (fraction of signal, 0-1). Default 4.0 is converted to 0.4 (40% drop).
n_outliers (int) – Number of low-pH points to corrupt (default 1). Points are selected by ascending pH (lowest first).
seed (int | None) – Random seed for reproducibility.
rng (np.random.Generator | None) – Pre-existing random generator (overrides seed if provided).

Returns:

Dataset – Generated dataset.
TruthParams – Ground truth parameters.

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

Examples

Single label, clean: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=1)

Two labels with 1:5 error ratio: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=2, error_ratio=0.2)

Two labels with low-pH drop in noisy channel: >>> ds, truth = make_benchmark_dataset( … k=7.0, n_labels=2, error_ratio=0.2, add_outlier=True, outlier_label=”1” … )