clophfit.testing.synthetic#

Synthetic data generation for testing and benchmarking.

This module provides a unified API for generating synthetic pH titration datasets with characteristics matching real experimental data from Tecan plate readers.

Primary functions: - make_dataset: Unified function for all synthetic data generation - make_simple_dataset: Simplified interface for unit tests

Classes#

TruthParams

Ground truth parameters for synthetic data.

Functions#

make_dataset([k, s0, s1, is_ph, seed, rng, n_labels, ...])

Generate synthetic pH/Cl titration data with configurable complexity.

make_simple_dataset(k, s0, s1, *, is_ph[, noise, ...])

Create a simple synthetic Dataset for unit tests.

make_benchmark_dataset([k, n_labels, n_points, ...])

Generate synthetic data for fitter benchmarking.

Module Contents#

class clophfit.testing.synthetic.TruthParams#

Ground truth parameters for synthetic data.

clophfit.testing.synthetic.make_dataset(k=None, s0=None, s1=None, *, is_ph=True, seed=None, rng=None, n_labels=2, randomize_signals=False, error_model='realistic', noise=0.02, y_err=None, rel_error=0.035, min_error=1.0, buffer_sd=50.0, error_ratio=1.0, low_ph_drop=False, low_ph_drop_magnitude=0.4, low_ph_drop_label='y1', saturation_prob=0.0, x_error_large=0.0, x_systematic_offset=0.0, rel_x_err=0.01, n_points=None)#

Generate synthetic pH/Cl titration data with configurable complexity.

This is the unified function for all synthetic data generation. It supports per-label error scaling, randomization of signal parameters based on real experimental data distributions, and realistic artifacts like low-pH drops.

Parameters:
  • k (float | None) – Equilibrium constant (pKa for pH, Kd for Cl). If None and randomize_signals is True, sampled from real data distribution.

  • s0 (dict[str, float] | float | None) – Signal at unbound state. Use dict for multiple labels: {“y1”: 700, “y2”: 1000}. If None and randomize_signals is True, sampled from real data distribution.

  • s1 (dict[str, float] | float | None) – Signal at bound state. Use dict for multiple labels: {“y1”: 1200, “y2”: 200}. If None and randomize_signals is True, sampled from real data distribution.

  • is_ph (bool) – True for pH titration, False for Cl titration.

  • seed (int | None) – Random seed for reproducibility. If None (default), generates random data.

  • rng (np.random.Generator | None) – Pre-existing random generator (overrides seed if provided).

  • n_labels (int) – Number of labels (1 or 2). Only used when randomize_signals=True.

  • randomize_signals (bool) – If True, randomize K, S0, S1 from real L4 data distributions when not provided. Creates y1/y2 dual-channel data with realistic signal magnitudes and ranges.

  • error_model (str) – Error model to use: - “simple”: Constant noise as fraction of dynamic range (uses noise). - “uniform”: Constant absolute error per label (uses y_err). - “realistic”: Relative error with floor (uses rel_error, min_error). - “physics”: Shot noise + buffer noise (uses buffer_sd).

  • noise (float | dict[str, float]) – For “simple” model: relative noise as fraction of dynamic range. Use dict for per-label: {“y1”: 0.05, “y2”: 0.02}.

  • y_err (float | dict[str, float] | None) – For “uniform” model: constant absolute error per label. Use dict for per-label: {“y1”: 10.0, “y2”: 3.0}.

  • rel_error (float | dict[str, float]) – For “realistic” model: relative error as fraction of signal. Use dict for per-label: {“y1”: 0.07, “y2”: 0.025} for 3x y1/y2 ratio.

  • min_error (float | dict[str, float]) – For “realistic” model: minimum error floor (instrument noise).

  • buffer_sd (float | dict[str, float]) – For “physics” model: base buffer SD where err = sqrt(signal + buffer_sd^2). For y2, scaled by error_ratio if not a dict.

  • error_ratio (float) – For two-label physics model: ratio of y2_buffer_sd to y1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1.

  • low_ph_drop (bool) – Simulate acidic tail collapse at lowest pH (realistic artifact).

  • low_ph_drop_magnitude (float) – Fraction of signal to drop at lowest pH (0-1).

  • low_ph_drop_label (str) – Which label to apply the pH drop to (“y1” or “y2”).

  • saturation_prob (float) – Probability of masking points (saturation).

  • x_error_large (float) – Additional random x-error (pH units).

  • x_systematic_offset (float) – Systematic x-offset (pH units).

  • rel_x_err (float) – Relative x-error for Cl titrations (ignored for pH).

  • n_points (int | None) – Number of pH points. If None, uses L2_PH_VALUES (7 points). If specified, generates evenly-spaced pH from 5.5 to 9.0.

Returns:

  • Dataset – Generated dataset with specified labels.

  • TruthParams – Ground truth parameters (K, S0, S1).

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

Examples

Simple single-channel for unit tests:

>>> ds, truth = make_dataset(7.0, 100, 1000, error_model="simple", noise=0.02)

Randomized dual-channel matching real data distributions:

>>> ds, truth = make_dataset(randomize_signals=True, seed=42)

Randomized single-channel:

>>> ds, truth = make_dataset(randomize_signals=True, n_labels=1, seed=42)

Physics-based errors with differential noise (y2 5x more precise):

>>> ds, truth = make_dataset(
...     k=7.0,
...     s0={"y1": 1000, "y2": 800},
...     s1={"y1": 200, "y2": 300},
...     error_model="physics",
...     buffer_sd=50.0,
...     error_ratio=0.2,
... )

Simulate low-pH drop artifact:

>>> ds, truth = make_dataset(
...     randomize_signals=True,
...     low_ph_drop=True,
...     low_ph_drop_magnitude=0.4,
... )
clophfit.testing.synthetic.make_simple_dataset(k, s0, s1, *, is_ph, noise=0.02, seed=None, rel_x_err=0.01)#

Create a simple synthetic Dataset for unit tests.

Uses fixed x-values and simple noise model for backward compatibility with existing tests. Does NOT set y_err when noise=0 to allow fitters to use default weighting.

Parameters:
  • k (float)

  • s0 (dict[str, float] | float)

  • s1 (dict[str, float] | float)

  • is_ph (bool)

  • noise (float)

  • seed (int | None)

  • rel_x_err (float)

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

clophfit.testing.synthetic.make_benchmark_dataset(k=7.0, *, n_labels=1, n_points=7, error_ratio=1.0, add_outlier=False, outlier_label='y1', outlier_sigma=4.0, seed=None, rng=None)#

Generate synthetic data for fitter benchmarking.

This is an alias for make_dataset with physics error model and convenient defaults for benchmarking.

Parameters:
  • k (float) – True pKa value (default 7.0).

  • n_labels (int) – Number of labels: 1 or 2 (default 1).

  • n_points (int) – Number of pH points (default 7).

  • error_ratio (float) – Ratio of y2_buffer_sd to y1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1.

  • add_outlier (bool) – If True, add a low-pH drop in the specified label.

  • outlier_label (str) – Label to add pH drop to (“y1” or “y2”).

  • outlier_sigma (float) – Magnitude of low-pH drop (fraction of signal, 0-1). Default 4.0 is converted to 0.4 (40% drop).

  • seed (int | None) – Random seed for reproducibility.

  • rng (np.random.Generator | None) – Pre-existing random generator (overrides seed if provided).

Returns:

  • Dataset – Generated dataset.

  • TruthParams – Ground truth parameters.

Return type:

tuple[clophfit.fitting.data_structures.Dataset, TruthParams]

Examples

Single label, clean: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=1)

Two labels with 1:5 error ratio: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=2, error_ratio=0.2)

Two labels with low-pH drop in noisy channel: >>> ds, truth = make_benchmark_dataset( … k=7.0, n_labels=2, error_ratio=0.2, add_outlier=True, outlier_label=”y1” … )