clophfit.testing.synthetic ========================== .. py:module:: clophfit.testing.synthetic .. autoapi-nested-parse:: Synthetic data generation for testing and benchmarking. This module provides a unified API for generating synthetic pH titration datasets with characteristics matching real experimental data from Tecan plate readers. Primary functions: - make_dataset: Unified function for all synthetic data generation - make_simple_dataset: Simplified interface for unit tests Classes ------- .. autoapisummary:: clophfit.testing.synthetic.TruthParams Functions --------- .. autoapisummary:: clophfit.testing.synthetic.make_dataset clophfit.testing.synthetic.make_simple_dataset clophfit.testing.synthetic.make_benchmark_dataset Module Contents --------------- .. py:class:: TruthParams Ground truth parameters for synthetic data. .. py:function:: make_dataset(k = None, s0 = None, s1 = None, *, is_ph = True, seed = None, rng = None, n_labels = 2, randomize_signals = False, error_model = 'realistic', noise = 0.02, y_err = None, rel_error = 0.035, min_error = 1.0, buffer_sd = 50.0, error_ratio = 1.0, low_ph_drop = False, low_ph_drop_magnitude = 0.4, low_ph_drop_label = 'y1', saturation_prob = 0.0, x_error_large = 0.0, x_systematic_offset = 0.0, rel_x_err = 0.01, n_points = None) Generate synthetic pH/Cl titration data with configurable complexity. This is the unified function for all synthetic data generation. It supports per-label error scaling, randomization of signal parameters based on real experimental data distributions, and realistic artifacts like low-pH drops. :param k: Equilibrium constant (pKa for pH, Kd for Cl). If None and randomize_signals is True, sampled from real data distribution. :type k: float | None :param s0: Signal at unbound state. Use dict for multiple labels: {"y1": 700, "y2": 1000}. If None and randomize_signals is True, sampled from real data distribution. :type s0: dict[str, float] | float | None :param s1: Signal at bound state. Use dict for multiple labels: {"y1": 1200, "y2": 200}. If None and randomize_signals is True, sampled from real data distribution. :type s1: dict[str, float] | float | None :param is_ph: True for pH titration, False for Cl titration. :type is_ph: bool :param seed: Random seed for reproducibility. If None (default), generates random data. :type seed: int | None :param rng: Pre-existing random generator (overrides seed if provided). :type rng: np.random.Generator | None :param n_labels: Number of labels (1 or 2). Only used when randomize_signals=True. :type n_labels: int :param randomize_signals: If True, randomize K, S0, S1 from real L4 data distributions when not provided. Creates y1/y2 dual-channel data with realistic signal magnitudes and ranges. :type randomize_signals: bool :param error_model: Error model to use: - "simple": Constant noise as fraction of dynamic range (uses `noise`). - "uniform": Constant absolute error per label (uses `y_err`). - "realistic": Relative error with floor (uses `rel_error`, `min_error`). - "physics": Shot noise + buffer noise (uses `buffer_sd`). :type error_model: str :param noise: For "simple" model: relative noise as fraction of dynamic range. Use dict for per-label: {"y1": 0.05, "y2": 0.02}. :type noise: float | dict[str, float] :param y_err: For "uniform" model: constant absolute error per label. Use dict for per-label: {"y1": 10.0, "y2": 3.0}. :type y_err: float | dict[str, float] | None :param rel_error: For "realistic" model: relative error as fraction of signal. Use dict for per-label: {"y1": 0.07, "y2": 0.025} for 3x y1/y2 ratio. :type rel_error: float | dict[str, float] :param min_error: For "realistic" model: minimum error floor (instrument noise). :type min_error: float | dict[str, float] :param buffer_sd: For "physics" model: base buffer SD where err = sqrt(signal + buffer_sd^2). For y2, scaled by error_ratio if not a dict. :type buffer_sd: float | dict[str, float] :param error_ratio: For two-label physics model: ratio of y2_buffer_sd to y1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1. :type error_ratio: float :param low_ph_drop: Simulate acidic tail collapse at lowest pH (realistic artifact). :type low_ph_drop: bool :param low_ph_drop_magnitude: Fraction of signal to drop at lowest pH (0-1). :type low_ph_drop_magnitude: float :param low_ph_drop_label: Which label to apply the pH drop to ("y1" or "y2"). :type low_ph_drop_label: str :param saturation_prob: Probability of masking points (saturation). :type saturation_prob: float :param x_error_large: Additional random x-error (pH units). :type x_error_large: float :param x_systematic_offset: Systematic x-offset (pH units). :type x_systematic_offset: float :param rel_x_err: Relative x-error for Cl titrations (ignored for pH). :type rel_x_err: float :param n_points: Number of pH points. If None, uses L2_PH_VALUES (7 points). If specified, generates evenly-spaced pH from 5.5 to 9.0. :type n_points: int | None :returns: * *Dataset* -- Generated dataset with specified labels. * *TruthParams* -- Ground truth parameters (K, S0, S1). .. rubric:: Examples Simple single-channel for unit tests: >>> ds, truth = make_dataset(7.0, 100, 1000, error_model="simple", noise=0.02) Randomized dual-channel matching real data distributions: >>> ds, truth = make_dataset(randomize_signals=True, seed=42) Randomized single-channel: >>> ds, truth = make_dataset(randomize_signals=True, n_labels=1, seed=42) Physics-based errors with differential noise (y2 5x more precise): >>> ds, truth = make_dataset( ... k=7.0, ... s0={"y1": 1000, "y2": 800}, ... s1={"y1": 200, "y2": 300}, ... error_model="physics", ... buffer_sd=50.0, ... error_ratio=0.2, ... ) Simulate low-pH drop artifact: >>> ds, truth = make_dataset( ... randomize_signals=True, ... low_ph_drop=True, ... low_ph_drop_magnitude=0.4, ... ) .. py:function:: make_simple_dataset(k, s0, s1, *, is_ph, noise = 0.02, seed = None, rel_x_err = 0.01) Create a simple synthetic Dataset for unit tests. Uses fixed x-values and simple noise model for backward compatibility with existing tests. Does NOT set y_err when noise=0 to allow fitters to use default weighting. .. py:function:: make_benchmark_dataset(k = 7.0, *, n_labels = 1, n_points = 7, error_ratio = 1.0, add_outlier = False, outlier_label = 'y1', outlier_sigma = 4.0, seed = None, rng = None) Generate synthetic data for fitter benchmarking. This is an alias for make_dataset with physics error model and convenient defaults for benchmarking. :param k: True pKa value (default 7.0). :type k: float :param n_labels: Number of labels: 1 or 2 (default 1). :type n_labels: int :param n_points: Number of pH points (default 7). :type n_points: int :param error_ratio: Ratio of y2_buffer_sd to y1_buffer_sd. 1.0 = equal errors, 0.2 = y2 has 1/5 the error of y1. :type error_ratio: float :param add_outlier: If True, add a low-pH drop in the specified label. :type add_outlier: bool :param outlier_label: Label to add pH drop to ("y1" or "y2"). :type outlier_label: str :param outlier_sigma: Magnitude of low-pH drop (fraction of signal, 0-1). Default 4.0 is converted to 0.4 (40% drop). :type outlier_sigma: float :param seed: Random seed for reproducibility. :type seed: int | None :param rng: Pre-existing random generator (overrides seed if provided). :type rng: np.random.Generator | None :returns: * *Dataset* -- Generated dataset. * *TruthParams* -- Ground truth parameters. .. rubric:: Examples Single label, clean: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=1) Two labels with 1:5 error ratio: >>> ds, truth = make_benchmark_dataset(k=7.0, n_labels=2, error_ratio=0.2) Two labels with low-pH drop in noisy channel: >>> ds, truth = make_benchmark_dataset( ... k=7.0, n_labels=2, error_ratio=0.2, add_outlier=True, outlier_label="y1" ... )