Sample-Specific Variance in Differential Analysis¶

This tutorial demonstrates how to account for biological variability across samples in differential analysis. This is crucial when analyzing data from multiple biological replicates.

Why Sample Variance Matters¶

When comparing conditions (e.g., treated vs. control), we typically have multiple biological replicates per condition. Biological variation between samples within the same condition is often larger than technical variation, and ignoring it can lead to:

Inflated statistical significance
High false positive rates
Poor reproducibility across studies

Kompot’s sample variance estimation:

✓ Builds separate estimators for each sample
✓ Computes sample-to-sample covariance
✓ Adjusts significance based on inter-sample variability
✓ Produces more conservative and reproducible results

Dataset¶

We’ll use bone marrow data from young and old mice, with 3 biological replicates per age group.

[2]:

import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import palantir
import pandas as pd
import scanpy as sc

import kompot

plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
plt.rcParams["image.cmap"] = "Spectral_r"

[3]:

DATA_PATH = "../data/murine_bone_marrow_aging.h5ad"
SAMPLE_COLUMN = "Replicate"        # Sample/replicate identifiers
GROUPING_COLUMN = "Age"
CONDITIONS = ["Young", "Old"]
CELL_TYPE_COLUMN = "highres_celltype"
DIMENSIONALITY_REDUCTION = "DM_EigenVectors"
LAYER_FOR_EXPRESSION = "logged_counts"

Load Data¶

[4]:

adata = ad.read_h5ad(DATA_PATH)
palantir.utils.run_diffusion_maps(adata, pca_key="X_pca_harmony", n_components=40)
adata

[4]:

AnnData object with n_obs × n_vars = 8090 × 16285
    obs: 'Compartment', 'Replicate', 'Age', 'Sample', 'Info', 'batch', 'doublet_score', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'S_score', 'G2M_score', 'phase', 'leiden', 'phenograph', 'highres_celltype', 'midres_celltype'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'hb', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'Age_colors', 'Compartment_colors', 'DMEigenValues', 'Info_colors', 'README', 'Replicate_colors', 'Sample_colors', 'batch_colors', 'draw_graph', 'highres_celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'midres_celltype_colors', 'neighbors', 'pca', 'phase_colors', 'umap', 'DM_EigenValues'
    obsm: 'AbCapture', 'DM_EigenVectors', 'HTO', 'X_draw_graph_fa', 'X_pca', 'X_pca_harmony', 'X_pca_noregression', 'X_umap'
    varm: 'PCs'
    layers: 'MAGIC_imputed_data', 'logged_counts', 'normalized_counts', 'raw_counts'
    obsp: 'DM_Kernel', 'connectivities', 'distances', 'DM_Similarity'

Examine Sample Structure¶

[5]:

# Check sample distribution
sample_counts = adata.obs.groupby([GROUPING_COLUMN, SAMPLE_COLUMN]).size().unstack(fill_value=0)
print("Cells per sample:\n", sample_counts)

Cells per sample:
 Replicate     1     2     3
Age
Mid        1035  1022     0
Old        1057  1019  1040
Young       944  1007   966

/loc/scratch/35159533/ipykernel_27996/1340399861.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sample_counts = adata.obs.groupby([GROUPING_COLUMN, SAMPLE_COLUMN]).size().unstack(fill_value=0)

[6]:

sc.pl.umap(adata, color=[SAMPLE_COLUMN, GROUPING_COLUMN, CELL_TYPE_COLUMN], wspace=.2, frameon=False)

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_7_0.png

Standard Analysis (Without Sample Variance)¶

First, we’ll run analyses without accounting for sample variance. This serves as a baseline for comparison.

Differential Abundance¶

[7]:

da_standard = kompot.compute_differential_abundance(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    obsm_key=DIMENSIONALITY_REDUCTION,
)

[2025-10-03 13:22:53,906] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:22:53,907] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:22:53,910] [INFO    ] Fitting density estimator for condition 1...

WARNING:2025-10-03 13:22:54,045:jax._src.xla_bridge:966: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.

[2025-10-03 13:24:01,899] [INFO    ] Fitting density estimator for condition 2...
[2025-10-03 13:24:36,888] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:24:38,245] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:24:48,611] [INFO    ] This run will have `run_id=0`.

Differential Expression¶

[8]:

de_standard = kompot.compute_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    batch_size=0, # No batching for this small dataset
)

[2025-10-03 13:24:51,212] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:24:51,213] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:24:51,214] [INFO    ] Using 8090 of 8090 cells (100.0%)
[2025-10-03 13:24:51,757] [INFO    ] Preparing null distribution with null_genes=2000, null_seed=42
[2025-10-03 13:24:53,187] [INFO    ] Generated shuffled expression for 2000 null genes
[2025-10-03 13:25:40,733] [INFO    ] Fitting expression estimator for condition 1...
[2025-10-03 13:25:59,745] [INFO    ] Fitting expression estimator for condition 2...
[2025-10-03 13:26:41,529] [INFO    ] Landmark storage skipped (store_landmarks=False). Compute with store_landmarks=True to enable landmark reuse.
[2025-10-03 13:27:29,723] [INFO    ] Using 5,000 landmarks for Mahalanobis computation
[2025-10-03 13:27:56,773] [INFO    ] Computing FDR statistics from null distribution
[2025-10-03 13:27:59,165] [INFO    ] FDR analysis complete: 393/16285 genes significantly DE at FDR < 0.05
[2025-10-03 13:27:59,167] [INFO    ] Mahalanobis distance threshold for FDR < 0.05: 23.1080
[2025-10-03 13:28:05,630] [INFO    ] This run will have `run_id=0`.

Analysis With Sample Variance¶

Now we’ll repeat the analyses while accounting for sample variance by specifying the sample_col parameter.

Differential Abundance¶

When accounting for sample variance, you may want to relax significance thresholds slightly, as the test becomes more conservative:

[9]:

da_sample_var = kompot.compute_differential_abundance(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,  # Enable sample variance
    ptp_threshold=0.1,         # Relaxed from default 0.05
)

[2025-10-03 13:28:06,829] [INFO    ] Results with result_key='kompot_da' already exist in the dataset. Previous run was at 2025-10-03T13:24:48.611524 comparing Young to Old. Fields that will be overwritten: obs.kompot_da_Young_to_Old_lfc, obs.kompot_da_Young_log_density, obs.kompot_da_Old_log_density. Note: Only fields NOT affected by sample variance (like lfc, log_density) will be overwritten since they don't use the sample variance suffix. These results will likely be identical if other parameters haven't changed. This is a partial rerun with sample variance added to a previous analysis with matching parameters. Set overwrite=False to prevent overwriting or overwrite=True to silence this message.
[2025-10-03 13:28:06,833] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:28:06,834] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:28:06,839] [INFO    ] Using sample column 'Replicate' for sample variance estimation
[2025-10-03 13:28:06,841] [INFO    ] Found 3 unique sample(s) in condition 1
[2025-10-03 13:28:06,843] [INFO    ] Found 3 unique sample(s) in condition 2
[2025-10-03 13:28:06,843] [INFO    ] Fitting density estimator for condition 1...
[2025-10-03 13:28:27,958] [INFO    ] Fitting density estimator for condition 2...
[2025-10-03 13:28:39,128] [INFO    ] Sample variance estimation automatically enabled due to provided sample indices
[2025-10-03 13:28:39,130] [INFO    ] Setting up sample variance estimation...
[2025-10-03 13:28:39,132] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 13:28:39,133] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 13:28:39,133] [INFO    ] Training estimator for group 1 with 944 cells
[2025-10-03 13:28:47,487] [INFO    ] Training estimator for group 2 with 1,007 cells
[2025-10-03 13:28:55,294] [INFO    ] Training estimator for group 3 with 966 cells
[2025-10-03 13:29:03,927] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 13:29:03,934] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 13:29:03,937] [INFO    ] Training estimator for group 1 with 1,057 cells
[2025-10-03 13:29:11,520] [INFO    ] Training estimator for group 2 with 1,019 cells
[2025-10-03 13:29:19,274] [INFO    ] Training estimator for group 3 with 1,040 cells
[2025-10-03 13:29:26,788] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:26,795] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:31,912] [INFO    ] Computing sample-specific variance for condition 1...
[2025-10-03 13:29:33,082] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:33,825] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:34,605] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:35,334] [INFO    ] Computing sample-specific variance for condition 2...
[2025-10-03 13:29:35,814] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:36,484] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:37,193] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:37,883] [INFO    ] This run will have `run_id=1`.

Differential Expression¶

For DE with sample variance, memory requirements increase because Kompot computes gene-specific covariance matrices. We’ll use several strategies to manage this:

Limit to top genes: Analyze only the most significant genes from the standard analysis
Use disk storage: Store large covariance matrices on disk instead of in memory
Disable null genes: Skip FDR computation (set null_genes=0) to avoid computing sample covariances for null genes

Resource Planning¶

Before running, estimate resource requirements:

[10]:

# Select top 200 genes
topn = 200
top_genes = adata.var.sort_values(
    "kompot_de_Young_to_Old_mahalanobis", ascending=False
).head(topn).index

# Check resource requirements
plan = kompot.dry_run_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,
    genes=top_genes,
    store_arrays_on_disk=True,
    null_genes=0,  # Skip FDR to save memory
    batch_size=0, # batching does not effect the most costly covariance tensor
)

================================================================================
RESOURCE USAGE PLAN
================================================================================

System Resources:
  Memory: 359.31 GB available (of 754.59 GB total)
  Disk:   6.04 TB available at /loc/scratch/35159533

Total Requirements:
  Memory: 1.30 GB (0% of available)
  Disk:   74.51 GB (1% of available)

Memory Allocations:
  • Mellon precision matrix L (condition 1, 2,917/3,116 cells) (2917, 2917): 64.92 MB
  • Mellon precision matrix L (condition 2, 2,917/3,116 cells) (3116, 3116): 74.08 MB
  • Imputed expression (condition 1) (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_imputed']
  • Imputed expression (condition 2) (8090, 200): 12.34 MB → adata.layers['kompot_de_Old_imputed']
  • Fold change (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_to_Old_fold_change']
  • Standard deviation (condition 1) (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_std']
  • Standard deviation (condition 2) (8090, 200): 12.34 MB → adata.layers['kompot_de_Old_std']
  • Temporary matrices during predictions (batch_size=0) (8090, 5000) + (8090, 200): 320.95 MB
  • Function predictor covariances (per condition) (5000, 5000): 381.47 MB
  • Combined covariance matrix (5000, 5000): 190.73 MB
  • Cholesky decomposition (for Mahalanobis) (5000, 5000): 190.73 MB
  • Per-sample imputations (3 samples) (3, 5000, 200): 45.78 MB

Disk Storage:
  • Sample covariances (per condition, 3 samples) (5000, 5000, 200): 74.51 GB

Output Fields:
  adata.layers:
    - kompot_de_Young_imputed [OVERWRITES run_id=0]
    - kompot_de_Old_imputed [OVERWRITES run_id=0]
    - kompot_de_Young_to_Old_fold_change [OVERWRITES run_id=0]
    - kompot_de_Young_std
    - kompot_de_Old_std
  adata.var:
    - kompot_de_Young_to_Old_mahalanobis_sample_var
    - kompot_de_Young_to_Old_mean_lfc [OVERWRITES run_id=0]

Info:
  ℹ Cell batch_size (0) processes all 8090 cells at once. Consider reducing batch_size (e.g., batch_size=500) to lower peak memory by 301.12 MB.
  ℹ Disk arrays will be stored at system temp: /loc/scratch/35159533. To use a different location, set disk_storage_dir='/path/to/disk' or export TMPDIR=/path/to/disk

Warnings:
  ⚠ Results with result_key='kompot_de' already exist (run_id=0). Previous run: 2025-10-03T16:44:09.207147 comparing Young to Old (null_genes=2000). Fields that will be overwritten: var.kompot_de_Young_to_Old_mean_lfc, layers.kompot_de_Young_imputed, layers.kompot_de_Old_imputed, layers.kompot_de_Young_to_Old_fold_change

================================================================================
STATUS: ⚠ FEASIBLE WITH WARNINGS - Proceed with caution
================================================================================

The dry run shows:

Memory requirements for sample covariance computation
Disk space needed for storage
Which fields will be overwritten
Previous run details

If resources are sufficient, proceed with the analysis:

[17]:

de_sample_var = kompot.compute_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,  # Enable sample variance
    genes=top_genes,
    store_arrays_on_disk=True,
    null_genes=0,
    batch_size=0, # batching does not effect the most costly covariance tensor
)

[2025-10-03 17:00:33,268] [INFO    ] Differential expression results with result_key='kompot_de' already exist in the dataset. Previous run was at 2025-10-03T16:44:09.207147 comparing Young to Old. Fields that will be overwritten: var.kompot_de_Young_to_Old_mean_lfc, layers.kompot_de_Young_imputed, layers.kompot_de_Old_imputed, layers.kompot_de_Young_to_Old_fold_change. Note: Only fields NOT affected by sample variance (like mean_log_fold_change, imputed data, fold_change) will be overwritten since they don't use the sample variance suffix. These results will likely be identical if other parameters haven't changed. This is a partial rerun with sample variance added to a previous analysis with matching parameters. Set overwrite=False to prevent overwriting or overwrite=True to silence this message.
[2025-10-03 17:00:33,270] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 17:00:33,271] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 17:00:33,271] [INFO    ] Using 8090 of 8090 cells (100.0%)
[2025-10-03 17:00:33,815] [INFO    ] Using sample column 'Replicate' for sample variance estimation
[2025-10-03 17:00:33,817] [INFO    ] Found 3 unique sample(s) in condition 1
[2025-10-03 17:00:33,818] [INFO    ] Found 3 unique sample(s) in condition 2
[2025-10-03 17:00:39,966] [INFO    ] Fitting expression estimator for condition 1...
[2025-10-03 17:00:47,408] [INFO    ] Fitting expression estimator for condition 2...
[2025-10-03 17:01:01,966] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 17:01:01,967] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 17:01:01,968] [INFO    ] Training estimator for group 1 with 944 cells
[2025-10-03 17:01:13,175] [INFO    ] Training estimator for group 2 with 1,007 cells
[2025-10-03 17:01:26,939] [INFO    ] Training estimator for group 3 with 966 cells
[2025-10-03 17:01:39,744] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 17:01:39,745] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 17:01:39,746] [INFO    ] Training estimator for group 1 with 1,057 cells
[2025-10-03 17:02:00,552] [INFO    ] Training estimator for group 2 with 1,019 cells
[2025-10-03 17:02:36,124] [INFO    ] Training estimator for group 3 with 1,040 cells
[2025-10-03 17:02:59,580] [INFO    ] Landmark storage skipped (store_landmarks=False). Compute with store_landmarks=True to enable landmark reuse.
[2025-10-03 17:03:21,074] [INFO    ] Using 5,000 landmarks for Mahalanobis computation
[2025-10-03 17:03:31,704] [INFO    ] Disk space at /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv:
[2025-10-03 17:03:31,705] [INFO    ]   Total: 6.67 TB | Used: 297.64 GB | Free: 6.04 TB
[2025-10-03 17:03:31,705] [INFO    ]   Expected storage need: 44.70 GB
[2025-10-03 17:03:31,706] [INFO    ] Initialized disk storage at /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv with dask support (namespace: ns_15545be0)
[2025-10-03 17:03:31,706] [INFO    ] Using gene-by-gene disk storage for covariance matrix (shape=(5000, 5000, 200))
[2025-10-03 17:03:33,600] [INFO    ] Using Dask for parallel disk-backed covariance computation (all cores) (shape=(5000, 5000, 200))
[2025-10-03 17:03:34,700] [INFO    ] Removed temporary storage directory /tmp/kompot_arrays_bce0bf5b_egezr1d3
[2025-10-03 17:03:35,380] [INFO    ] Created Dask array for covariance with shape (5000, 5000, 200)
[2025-10-03 17:03:35,385] [INFO    ] Disk space at /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb:
[2025-10-03 17:03:35,386] [INFO    ]   Total: 6.67 TB | Used: 297.64 GB | Free: 6.04 TB
[2025-10-03 17:03:35,386] [INFO    ]   Expected storage need: 44.70 GB
[2025-10-03 17:03:35,387] [INFO    ] Initialized disk storage at /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb with dask support (namespace: ns_8f0ca8f5)
[2025-10-03 17:03:35,387] [INFO    ] Using gene-by-gene disk storage for covariance matrix (shape=(5000, 5000, 200))
[2025-10-03 17:03:37,034] [INFO    ] Using Dask for parallel disk-backed covariance computation (all cores) (shape=(5000, 5000, 200))
[2025-10-03 17:03:38,714] [INFO    ] Created Dask array for covariance with shape (5000, 5000, 200)

[2025-10-03 17:06:23,107] [INFO    ] Computing Mahalanobis distances using gene-specific covariance matrices
[2025-10-03 17:06:23,109] [INFO    ] Computing Mahalanobis distances for 200 genes using dask

[2025-10-03 17:11:27,690] [INFO    ] Successfully computed Mahalanobis distances for 200 genes using gene-specific covariance
[2025-10-03 17:11:59,242] [INFO    ] This run will have `run_id=1`.
[2025-10-03 17:11:59,304] [INFO    ] Removed temporary storage directory /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv
[2025-10-03 17:11:59,414] [INFO    ] Removed temporary storage directory /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb

Comparing Results¶

Differential Abundance¶

Results with sample variance include a _sample_var suffix to distinguish them from standard results.

Compare z-scores with and without sample variance:

[12]:

sc.pl.embedding(
    adata, "umap",
    color=[
        "kompot_da_Young_to_Old_lfc",
        "kompot_da_Young_to_Old_lfc_zscore",
        "kompot_da_Young_to_Old_lfc_zscore_sample_var",
    ],
    color_map="RdBu_r",
    vcenter=0,
    ncols=3,
)

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_19_0.png

Notice:

Log fold changes (left) are identical - sample variance only affects significance
Z-scores (middle and right) differ - sample variance produces more conservative estimates

Volcano Plots¶

Compare volcano plots from both analyses:

[13]:

# Standard analysis (run_id=0)
kompot.plot.volcano_da(
    adata,
    color=CELL_TYPE_COLUMN,
    run_id=0,
    title="Standard Analysis (Pooled Samples)",
)

[2025-10-03 13:42:32,069] [INFO    ] Found DA run info for run_id=0
[2025-10-03 13:42:32,070] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:32,071] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp' from run info
[2025-10-03 13:42:32,072] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:32,072] [WARNING ] Field 'kompot_da_Young_to_Old_neg_log10_lfc_ptp' was last written by run 0, but current context expects run 1. The field may have been overwritten.
[2025-10-03 13:42:32,073] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp'}
[2025-10-03 13:42:32,074] [WARNING ] Field inference completed with 2 warnings
[2025-10-03 13:42:32,074] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.05

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_22_1.png

[14]:

# With sample variance (run_id=1)
kompot.plot.volcano_da(
    adata,
    color=CELL_TYPE_COLUMN,
    run_id=1,
    title="With Sample Variance",
)

[2025-10-03 13:42:32,601] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:32,602] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:32,602] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var' from run info
[2025-10-03 13:42:32,602] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:32,603] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var'}
[2025-10-03 13:42:32,603] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:32,603] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.1

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_23_1.png

Note that thresholds such as ptp_threshold and lfc_threshold can be adjusted, and if update_direction=True is passed, then the cell-wise classification in adata.obs["kompot_da_Young_to_Old_lfc_direction_sample_var"] is updated accordingly.

[15]:

kompot.plot.volcano_da(adata, color=CELL_TYPE_COLUMN, run_id=1, ptp_threshold=0.2, update_direction=True)

[2025-10-03 13:42:33,119] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:33,120] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:33,121] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var' from run info
[2025-10-03 13:42:33,122] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,123] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var'}
[2025-10-03 13:42:33,124] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:33,125] [INFO    ] Updating direction column with new thresholds before plotting
[2025-10-03 13:42:33,126] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:33,126] [INFO    ] Found direction_key='kompot_da_Young_to_Old_lfc_direction_sample_var' from run info
[2025-10-03 13:42:33,127] [INFO    ] Successfully inferred fields: {'direction_key': 'kompot_da_Young_to_Old_lfc_direction_sample_var'}
[2025-10-03 13:42:33,128] [INFO    ] Updating direction column 'kompot_da_Young_to_Old_lfc_direction_sample_var' with thresholds: lfc_threshold=1.0, ptp_threshold=0.2
[2025-10-03 13:42:33,132] [INFO    ] Updated direction counts: up=595, down=495, neutral=7000
[2025-10-03 13:42:33,133] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.2

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_25_1.png

Key Observations:

Posterior tail probabilities (y-axis) are generally lower with sample variance
Fewer cells pass significance thresholds
Effect sizes (x-axis) remain unchanged
Cell type patterns are preserved but with adjusted confidence

Differential Expression¶

Compare Mahalanobis distances with and without sample variance:

[16]:

# Top genes by standard analysis
comparison_df = adata.var.loc[top_genes, :][
    [
        "kompot_de_Young_to_Old_mean_lfc",
        "kompot_de_Young_to_Old_mahalanobis",
        "kompot_de_Young_to_Old_mahalanobis_sample_var",
    ]
].sort_values("kompot_de_Young_to_Old_mahalanobis", ascending=False)

comparison_df.head(20)

[16]:

	kompot_de_Young_to_Old_mean_lfc	kompot_de_Young_to_Old_mahalanobis	kompot_de_Young_to_Old_mahalanobis_sample_var
H2-Q7	1.235066	71.517323	45.085446
Cd74	0.352121	60.021452	49.584290
H2-Aa	0.473756	59.371135	38.489260
H2-Ab1	0.511123	57.808982	33.768536
Igkc	0.054009	53.604112	38.973926
H2-Eb1	0.411752	53.403829	31.734637
AW112010	0.791367	53.386235	28.091721
S100a9	-0.170317	48.765184	46.502283
Ifitm3	0.367545	47.585813	40.375510
S100a8	-0.165121	47.372815	45.495007
H2-Q6	0.782279	45.991624	29.823029
Cd52	-0.127807	45.309906	42.044778
Aldh1a1	0.490811	44.596577	28.850373
Ifitm1	0.427663	44.320577	38.114722
Ifitm2	0.285836	43.940652	38.424422
Ighm	-0.173726	43.216410	35.535580
Cd79a	-0.004329	43.107792	33.076867
Apoe	-0.131211	42.935707	39.211520
Fos	0.531571	42.282847	37.407770
Gm47283	0.627173	41.956500	23.844169

Notice that some genes show substantially lower significance when accounting for sample variance, indicating high variability across replicates.

[17]:

# Volcano plot: Standard analysis
kompot.plot.volcano_de(
    adata,
    run_id=0,
    title="Standard Analysis (Pooled Samples)",
)

[2025-10-03 13:42:33,702] [INFO    ] Found DE run info for run_id=0
[2025-10-03 13:42:33,703] [INFO    ] Found mean_lfc_key='kompot_de_Young_to_Old_mean_lfc' from run info
[2025-10-03 13:42:33,704] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis' from run info
[2025-10-03 13:42:33,705] [WARNING ] Field 'kompot_de_Young_to_Old_mean_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,705] [WARNING ] Field 'kompot_de_Young_to_Old_mahalanobis' was last written by run 0, but current context expects run 1. The field may have been overwritten.
[2025-10-03 13:42:33,706] [INFO    ] Successfully inferred fields: {'mean_lfc_key': 'kompot_de_Young_to_Old_mean_lfc', 'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis'}
[2025-10-03 13:42:33,707] [WARNING ] Field inference completed with 2 warnings
[2025-10-03 13:42:33,708] [INFO    ] Using DE run 0: comparing Young to Old
[2025-10-03 13:42:33,719] [INFO    ] Using data columns from var - lfc: 'kompot_de_Young_to_Old_mean_lfc', score: 'kompot_de_Young_to_Old_mahalanobis'
[2025-10-03 13:42:33,725] [INFO    ] Highlighting 393 genes marked as DE (232 up, 161 down)
[2025-10-03 13:42:33,737] [INFO    ] Labeling top 10 genes by score

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_30_1.png

[18]:

# Volcano plot: With sample variance
kompot.plot.volcano_de(
    adata,
    run_id=1,
    n_top_genes=50,  # No FDR threshold available (null_genes=0)
    gene_labels=20,
    title="With Sample Variance (Top 200 Genes)",
)

[2025-10-03 13:42:33,990] [INFO    ] Found DE run info for run_id=1
[2025-10-03 13:42:33,992] [INFO    ] Found mean_lfc_key='kompot_de_Young_to_Old_mean_lfc' from run info
[2025-10-03 13:42:33,992] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run info
[2025-10-03 13:42:33,993] [WARNING ] Field 'kompot_de_Young_to_Old_mean_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,994] [INFO    ] Successfully inferred fields: {'mean_lfc_key': 'kompot_de_Young_to_Old_mean_lfc', 'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis_sample_var'}
[2025-10-03 13:42:33,995] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:33,995] [INFO    ] Using DE run 1: comparing Young to Old
[2025-10-03 13:42:34,004] [INFO    ] Using data columns from var - lfc: 'kompot_de_Young_to_Old_mean_lfc', score: 'kompot_de_Young_to_Old_mahalanobis_sample_var'
[2025-10-03 13:42:34,007] [INFO    ] Highlighting top 50 genes by kompot_de_Young_to_Old_mahalanobis_sample_var
[2025-10-03 13:42:34,017] [INFO    ] Labeling top 20 genes by score

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_31_1.png

Heatmaps¶

Visualize how ranking changes with sample variance:

[19]:

kompot.plot.heatmap(
    adata,
    n_top_genes=20,
    groupby=CELL_TYPE_COLUMN,
    exclude_groups="Plasma cell",
    run_id=1,  # Use sample variance ranking
    vmin="p1",
    vmax="p99",
)

[2025-10-03 13:42:34,288] [INFO    ] Found DE run info for run_id=1
[2025-10-03 13:42:34,289] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run info
[2025-10-03 13:42:34,290] [INFO    ] Successfully inferred fields: {'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis_sample_var'}
[2025-10-03 13:42:34,291] [INFO    ] Using DE run 1 for heatmap.
[2025-10-03 13:42:34,291] [INFO    ] Inferred score_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run information
[2025-10-03 13:42:34,295] [INFO    ] Inferred condition_column='Age' from run information
[2025-10-03 13:42:34,295] [INFO    ] Inferred condition1='Young' from run information
[2025-10-03 13:42:34,296] [INFO    ] Inferred condition2='Old' from run information
[2025-10-03 13:42:34,297] [INFO    ] Inferred layer='logged_counts' from run information
[2025-10-03 13:42:34,297] [INFO    ] Creating split heatmap with 20 genes/features
[2025-10-03 13:42:34,299] [INFO    ] Using expression data from layer: 'logged_counts'
[2025-10-03 13:42:34,419] [INFO    ] Excluded 7 cells from groups: Plasma cell
[2025-10-03 13:42:34,426] [INFO    ] Applying gene-wise z-scoring (standard_scale='var')

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_33_1.png

Understanding Sample-to-Sample Variability¶

To illustrate why sample variance matters, let’s examine genes with high inter-sample variability:

[20]:

# Genes with largest decrease in significance
comparison_df["decrease"] = (
    comparison_df["kompot_de_Young_to_Old_mahalanobis"] -
    comparison_df["kompot_de_Young_to_Old_mahalanobis_sample_var"]
)

variable_genes = comparison_df.nlargest(20, "decrease").index
print("Genes with highest sample-to-sample variability:")
print(variable_genes.tolist())

Genes with highest sample-to-sample variability:
['H2-Q7', 'AW112010', 'H2-Ab1', 'H2-Eb1', 'H2-Aa', 'Gm47283', 'Ifi27', 'Bcl11b', 'Slc7a11', 'Gimap4', 'Trac', 'Cirbp', 'H2-Q6', 'Ccl5', 'Trbc2', 'Aldh1a1', 'Cxcl2', 'Cd3e', 'Cd3g', 'Igkc']

Compare expression between individual replicates:

[21]:

# Compare replicates 1 and 3
kompot.plot.heatmap(
    adata,
    genes=variable_genes,
    groupby=CELL_TYPE_COLUMN,
    condition_column=SAMPLE_COLUMN,
    condition1="1",
    condition1_name="Replicate 1",
    condition2="3",
    condition2_name="Replicate 3",
    vmin="p2",
    vmax="p98",
    cluster_rows=False,
)

[2025-10-03 13:42:35,856] [INFO    ] Inferred layer='logged_counts' from run information
[2025-10-03 13:42:35,857] [INFO    ] Creating split heatmap with 20 genes/features
[2025-10-03 13:42:35,858] [INFO    ] Using display names: 'Replicate 1' for condition1, 'Replicate 3' for condition2
[2025-10-03 13:42:35,859] [INFO    ] Using expression data from layer: 'logged_counts'
[2025-10-03 13:42:35,939] [INFO    ] Applying gene-wise z-scoring (standard_scale='var')

../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_37_1.png

These genes show substantial expression differences between replicates of the same condition, explaining why their significance decreases when accounting for sample variance.

Run Tracking¶

Inspect metadata for both analyses:

[23]:

# Standard analysis
run0 = kompot.RunInfo(adata, run_id=0, analysis_type="de")
run0

[23]:

Run 0 (DE Analysis)

Run Summary

Analysis: DE | Run ID: 0 | Timestamp: 2025-10-03T19:33:09.986415 | Conditions: Young to OldFields Overwritten: 4

Parameter	Value
conditions	Young to Old
obsm_key	DM_EigenVectors
uses_sample_variance	False
layer	logged_counts
timestamp	2025-10-03T19:33:09.986415
Fields Created	9

All Parameters

Parameter	Value
auto_filtered	False
batch_size	0
compute_mahalanobis	True
condition1	Young
condition2	Old
copy	False
eps	1e-08
fdr_threshold	0.05
groupby	Age
inplace	True
jit_compile	False
landmarks	False
layer	logged_counts
ls_factor	10.0
max_memory_ratio	0.8
min_cells	2
n_landmarks	5000
null_genes	2000
null_seed	42
obsm_key	DM_EigenVectors
result_key	kompot_de
sigma	1.0
store_landmarks	False
store_posterior_covariance	False
use_sample_variance	False
used_landmarks	False

Environment

Parameter	Value
hostname	gizmok39
pid	35268
platform	Linux-4.15.0-213-generic-x86_64-with-glibc2.27
python_version	3.12.10
timestamp	2025-10-03T19:33:09.986565
username	dotto

Fields Created by This Run

Total Fields: 9 | Present: 5 | Missing: 0 | Overwritten: 4

Field Name	Location	Description	Status
LAYERS Fields
kompot_de_Old_imputed	layers	[imputed] Imputed expression for Old	Overwritten by Run 1
kompot_de_Young_imputed	layers	[imputed] Imputed expression for Young	Overwritten by Run 1
kompot_de_Young_to_Old_fold_change	layers	[fold_change] Log fold change for each cell and gene	Overwritten by Run 1
OBS Fields
kompot_de_Old_std	obs	[std] Posterior standard deviation of imputed expression for Old (same for all genes)	Present
kompot_de_Young_std	obs	[std] Posterior standard deviation of imputed expression for Young (same for all genes)	Present
VAR Fields
kompot_de_Young_to_Old_is_de	var	[is_de] Boolean indicator of differential expression at local FDR < 0.05	Present
kompot_de_Young_to_Old_mahalanobis	var	[mahalanobis] Mahalanobis distances	Present
kompot_de_Young_to_Old_mahalanobis_local_fdr	var	[mahalanobis_local_fdr] Local FDR values using empirical null estimation similar to R's fdrtool	Present
kompot_de_Young_to_Old_mean_lfc	var	[mean_log_fold_change] Mean log fold change values	Overwritten by Run 1

[25]:

# Compare runs
run0.compare_with(1)

[25]:

Comparison of Run 0 and Run 1

Summary

Analysis Type: DE | Run 0: unknown (2025-10-03T19:33:09.986415) | Run 1: unknown (2025-10-03T19:45:12.097844)

Parameters: 5 different, 32 sameFields: 8 different, 4 same

Aspect	Run Details
Aspect	Run 0	Run 1
conditions	Young to Old	Young to Old
result_key	kompot_de	kompot_de
uses_sample_variance	False	True
timestamp	2025-10-03T19:33:09.986415	2025-10-03T19:45:12.097844
Field Count	9	7

Parameter Differences

Total Parameters: 37 | Different: 5 | Only in Run 0: 0 | Only in Run 1: 0

Different Variance Method

Key Parameter Differences

Parameter	Run 0	Run 1
use_sample_variance	False	True

All Parameter Differences

Parameter	Run 0	Run 1
Different Parameters
genes	None	Index(['H2-Q7', 'Cd74', 'H2-Aa', 'H2-Ab1', 'Igkc', 'H2-Eb1', 'AW112010', 'S100a9', 'Ifitm3', 'S100a8', ... 'Rpl12', 'Dnaja1', 'Meis1', 'H2-T23', 'Pafah1b3', 'Il7r', 'Mmp8', 'Spi1', 'Bcl11a', 'Atpif1'], dtype='object', length=200)
null_genes	2000	0
sample_col	None	Replicate
store_arrays_on_disk	None	True
32 parameters are the same in both runs

Field Differences

Fields in both runs: 4 | Only in Run 0: 5 | Only in Run 1: 3

Field Name	Location	Status	Last Modified By
LAYERS Shared Fields
kompot_de_Old_imputed	layers	Current value from Run 1	Run 1
kompot_de_Young_imputed	layers	Current value from Run 1	Run 1
kompot_de_Young_to_Old_fold_change	layers	Current value from Run 1	Run 1
VAR Shared Fields
kompot_de_Young_to_Old_mean_lfc	var	Current value from Run 1	Run 1
LAYERS Different Fields
kompot_de_Old_std	layers	Only in Run 1	Run 1
kompot_de_Young_std	layers	Only in Run 1	Run 1
OBS Different Fields
kompot_de_Old_std	obs	Only in Run 0	Run 0
kompot_de_Young_std	obs	Only in Run 0	Run 0
VAR Different Fields
kompot_de_Young_to_Old_is_de	var	Only in Run 0	Run 0
kompot_de_Young_to_Old_mahalanobis	var	Only in Run 0	Run 0
kompot_de_Young_to_Old_mahalanobis_local_fdr	var	Only in Run 0	Run 0
kompot_de_Young_to_Old_mahalanobis_sample_var	var	Only in Run 1	Run 1

Note on shared fields: When both runs define the same field, the last run to write to the field overwrites the previous value. The 'Last Modified By' column shows which run's value is currently stored.

Best Practices¶

When to Use Sample Variance¶

Use sample variance when:

You have ≥3 biological replicates per condition
You want conservative, reproducible results
Inter-sample variability is expected (e.g., patient samples)

Skip sample variance when:

Only one sample per condition
Computational resources are limited

Resource Considerations¶

Sample variance increases computation time and memory:

DA: Modest increase (one estimator per sample)
DE: Substantial increase (gene-specific covariances)

For DE with sample variance:

Always run dry_run_differential_expression first
Consider analyzing a subset of genes (top hits from standard analysis)
Use store_arrays_on_disk=True for large datasets
Set null_genes=0 if you don’t need FDR thresholds

Interpretation¶

Results with sample variance are:

More conservative: Fewer significant findings
More reproducible: Robust to sample-specific effects
More biologically meaningful: Account for natural variation

Genes/cell states that remain significant after sample variance correction represent robust biological signals consistent across replicates.

Saving Results¶

[25]:

# Cleanup large layers from both runs
kompot.cleanup(adata)

[2025-10-03 13:42:38,074] [INFO    ] Cleaning up all 2 run(s)
[2025-10-03 13:42:38,087] [INFO    ] Cleaned up 3 field(s) from run 0:
[2025-10-03 13:42:38,088] [INFO    ]   layers (3 field(s)):
[2025-10-03 13:42:38,088] [INFO    ]     - kompot_de_Young_imputed
[2025-10-03 13:42:38,089] [INFO    ]     - kompot_de_Old_imputed
[2025-10-03 13:42:38,090] [INFO    ]     - kompot_de_Young_to_Old_fold_change
[2025-10-03 13:42:38,092] [INFO    ] Cleaned up 2 field(s) from run 1:
[2025-10-03 13:42:38,092] [INFO    ]   layers (2 field(s)):
[2025-10-03 13:42:38,093] [INFO    ]     - kompot_de_Young_std
[2025-10-03 13:42:38,094] [INFO    ]     - kompot_de_Old_std
[2025-10-03 13:42:38,094] [INFO    ] Total: Cleaned up 5 field(s) across 2 run(s)

[26]:

adata.write_h5ad("../data/murine_bone_marrow_aging_with_sample_variance.h5ad")

Summary¶

This tutorial demonstrated:

✓ Why sample variance matters in multi-sample studies
✓ How to enable sample variance with sample_col parameter
✓ Resource planning for sample variance analyses
✓ Comparing results with/without sample variance
✓ Identifying sample-variable genes
✓ Best practices for sample variance analysis

Key Takeaways¶

Sample variance provides more reliable significance estimates when you have biological replicates
Effect sizes remain identical; only statistical significance changes
Results are more conservative but also more reproducible
Use resource planning for DE with sample variance due to increased memory requirements

For more information, see the Kompot documentation.