Sample-Specific Variance in Differential Analysis

This tutorial demonstrates how to account for biological variability across samples in differential analysis. This is crucial when analyzing data from multiple biological replicates.

Why Sample Variance Matters

When comparing conditions (e.g., treated vs. control), we typically have multiple biological replicates per condition. Biological variation between samples within the same condition is often larger than technical variation, and ignoring it can lead to:

  • Inflated statistical significance

  • High false positive rates

  • Poor reproducibility across studies

Kompot’s sample variance estimation:

✓ Builds separate estimators for each sample
✓ Computes sample-to-sample covariance
✓ Adjusts significance based on inter-sample variability
✓ Produces more conservative and reproducible results

Dataset

We’ll use bone marrow data from young and old mice, with 3 biological replicates per age group.

[2]:
import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import palantir
import pandas as pd
import scanpy as sc

import kompot

plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
plt.rcParams["image.cmap"] = "Spectral_r"
[3]:
DATA_PATH = "../data/murine_bone_marrow_aging.h5ad"
SAMPLE_COLUMN = "Replicate"        # Sample/replicate identifiers
GROUPING_COLUMN = "Age"
CONDITIONS = ["Young", "Old"]
CELL_TYPE_COLUMN = "highres_celltype"
DIMENSIONALITY_REDUCTION = "DM_EigenVectors"
LAYER_FOR_EXPRESSION = "logged_counts"

Load Data

[4]:
adata = ad.read_h5ad(DATA_PATH)
palantir.utils.run_diffusion_maps(adata, pca_key="X_pca_harmony", n_components=40)
adata
[4]:
AnnData object with n_obs × n_vars = 8090 × 16285
    obs: 'Compartment', 'Replicate', 'Age', 'Sample', 'Info', 'batch', 'doublet_score', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'S_score', 'G2M_score', 'phase', 'leiden', 'phenograph', 'highres_celltype', 'midres_celltype'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'hb', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'Age_colors', 'Compartment_colors', 'DMEigenValues', 'Info_colors', 'README', 'Replicate_colors', 'Sample_colors', 'batch_colors', 'draw_graph', 'highres_celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'midres_celltype_colors', 'neighbors', 'pca', 'phase_colors', 'umap', 'DM_EigenValues'
    obsm: 'AbCapture', 'DM_EigenVectors', 'HTO', 'X_draw_graph_fa', 'X_pca', 'X_pca_harmony', 'X_pca_noregression', 'X_umap'
    varm: 'PCs'
    layers: 'MAGIC_imputed_data', 'logged_counts', 'normalized_counts', 'raw_counts'
    obsp: 'DM_Kernel', 'connectivities', 'distances', 'DM_Similarity'

Examine Sample Structure

[5]:
# Check sample distribution
sample_counts = adata.obs.groupby([GROUPING_COLUMN, SAMPLE_COLUMN]).size().unstack(fill_value=0)
print("Cells per sample:\n", sample_counts)
Cells per sample:
 Replicate     1     2     3
Age
Mid        1035  1022     0
Old        1057  1019  1040
Young       944  1007   966
/loc/scratch/35159533/ipykernel_27996/1340399861.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sample_counts = adata.obs.groupby([GROUPING_COLUMN, SAMPLE_COLUMN]).size().unstack(fill_value=0)
[6]:
sc.pl.umap(adata, color=[SAMPLE_COLUMN, GROUPING_COLUMN, CELL_TYPE_COLUMN], wspace=.2, frameon=False)
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_7_0.png

Standard Analysis (Without Sample Variance)

First, we’ll run analyses without accounting for sample variance. This serves as a baseline for comparison.

Differential Abundance

[7]:
da_standard = kompot.compute_differential_abundance(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    obsm_key=DIMENSIONALITY_REDUCTION,
)
[2025-10-03 13:22:53,906] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:22:53,907] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:22:53,910] [INFO    ] Fitting density estimator for condition 1...
WARNING:2025-10-03 13:22:54,045:jax._src.xla_bridge:966: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
[2025-10-03 13:24:01,899] [INFO    ] Fitting density estimator for condition 2...
[2025-10-03 13:24:36,888] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:24:38,245] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:24:48,611] [INFO    ] This run will have `run_id=0`.

Differential Expression

[8]:
de_standard = kompot.compute_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    batch_size=0, # No batching for this small dataset
)
[2025-10-03 13:24:51,212] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:24:51,213] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:24:51,214] [INFO    ] Using 8090 of 8090 cells (100.0%)
[2025-10-03 13:24:51,757] [INFO    ] Preparing null distribution with null_genes=2000, null_seed=42
[2025-10-03 13:24:53,187] [INFO    ] Generated shuffled expression for 2000 null genes
[2025-10-03 13:25:40,733] [INFO    ] Fitting expression estimator for condition 1...
[2025-10-03 13:25:59,745] [INFO    ] Fitting expression estimator for condition 2...
[2025-10-03 13:26:41,529] [INFO    ] Landmark storage skipped (store_landmarks=False). Compute with store_landmarks=True to enable landmark reuse.
[2025-10-03 13:27:29,723] [INFO    ] Using 5,000 landmarks for Mahalanobis computation
[2025-10-03 13:27:56,773] [INFO    ] Computing FDR statistics from null distribution
[2025-10-03 13:27:59,165] [INFO    ] FDR analysis complete: 393/16285 genes significantly DE at FDR < 0.05
[2025-10-03 13:27:59,167] [INFO    ] Mahalanobis distance threshold for FDR < 0.05: 23.1080
[2025-10-03 13:28:05,630] [INFO    ] This run will have `run_id=0`.

Analysis With Sample Variance

Now we’ll repeat the analyses while accounting for sample variance by specifying the sample_col parameter.

Differential Abundance

When accounting for sample variance, you may want to relax significance thresholds slightly, as the test becomes more conservative:

[9]:
da_sample_var = kompot.compute_differential_abundance(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,  # Enable sample variance
    ptp_threshold=0.1,         # Relaxed from default 0.05
)
[2025-10-03 13:28:06,829] [INFO    ] Results with result_key='kompot_da' already exist in the dataset. Previous run was at 2025-10-03T13:24:48.611524 comparing Young to Old. Fields that will be overwritten: obs.kompot_da_Young_to_Old_lfc, obs.kompot_da_Young_log_density, obs.kompot_da_Old_log_density. Note: Only fields NOT affected by sample variance (like lfc, log_density) will be overwritten since they don't use the sample variance suffix. These results will likely be identical if other parameters haven't changed. This is a partial rerun with sample variance added to a previous analysis with matching parameters. Set overwrite=False to prevent overwriting or overwrite=True to silence this message.
[2025-10-03 13:28:06,833] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 13:28:06,834] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 13:28:06,839] [INFO    ] Using sample column 'Replicate' for sample variance estimation
[2025-10-03 13:28:06,841] [INFO    ] Found 3 unique sample(s) in condition 1
[2025-10-03 13:28:06,843] [INFO    ] Found 3 unique sample(s) in condition 2
[2025-10-03 13:28:06,843] [INFO    ] Fitting density estimator for condition 1...
[2025-10-03 13:28:27,958] [INFO    ] Fitting density estimator for condition 2...
[2025-10-03 13:28:39,128] [INFO    ] Sample variance estimation automatically enabled due to provided sample indices
[2025-10-03 13:28:39,130] [INFO    ] Setting up sample variance estimation...
[2025-10-03 13:28:39,132] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 13:28:39,133] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 13:28:39,133] [INFO    ] Training estimator for group 1 with 944 cells
[2025-10-03 13:28:47,487] [INFO    ] Training estimator for group 2 with 1,007 cells
[2025-10-03 13:28:55,294] [INFO    ] Training estimator for group 3 with 966 cells
[2025-10-03 13:29:03,927] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 13:29:03,934] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 13:29:03,937] [INFO    ] Training estimator for group 1 with 1,057 cells
[2025-10-03 13:29:11,520] [INFO    ] Training estimator for group 2 with 1,019 cells
[2025-10-03 13:29:19,274] [INFO    ] Training estimator for group 3 with 1,040 cells
[2025-10-03 13:29:26,788] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:26,795] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:31,912] [INFO    ] Computing sample-specific variance for condition 1...
[2025-10-03 13:29:33,082] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:33,825] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:34,605] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:35,334] [INFO    ] Computing sample-specific variance for condition 2...
[2025-10-03 13:29:35,814] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:36,484] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:37,193] [WARNING ] The normalization is only effective if the density was trained with d_method="fractal".
[2025-10-03 13:29:37,883] [INFO    ] This run will have `run_id=1`.

Differential Expression

For DE with sample variance, memory requirements increase because Kompot computes gene-specific covariance matrices. We’ll use several strategies to manage this:

  1. Limit to top genes: Analyze only the most significant genes from the standard analysis

  2. Use disk storage: Store large covariance matrices on disk instead of in memory

  3. Disable null genes: Skip FDR computation (set null_genes=0) to avoid computing sample covariances for null genes

Resource Planning

Before running, estimate resource requirements:

[10]:
# Select top 200 genes
topn = 200
top_genes = adata.var.sort_values(
    "kompot_de_Young_to_Old_mahalanobis", ascending=False
).head(topn).index

# Check resource requirements
plan = kompot.dry_run_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,
    genes=top_genes,
    store_arrays_on_disk=True,
    null_genes=0,  # Skip FDR to save memory
    batch_size=0, # batching does not effect the most costly covariance tensor
)
================================================================================
RESOURCE USAGE PLAN
================================================================================

System Resources:
  Memory: 359.31 GB available (of 754.59 GB total)
  Disk:   6.04 TB available at /loc/scratch/35159533

Total Requirements:
  Memory: 1.30 GB (0% of available)
  Disk:   74.51 GB (1% of available)

Memory Allocations:
  • Mellon precision matrix L (condition 1, 2,917/3,116 cells) (2917, 2917): 64.92 MB
  • Mellon precision matrix L (condition 2, 2,917/3,116 cells) (3116, 3116): 74.08 MB
  • Imputed expression (condition 1) (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_imputed']
  • Imputed expression (condition 2) (8090, 200): 12.34 MB → adata.layers['kompot_de_Old_imputed']
  • Fold change (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_to_Old_fold_change']
  • Standard deviation (condition 1) (8090, 200): 12.34 MB → adata.layers['kompot_de_Young_std']
  • Standard deviation (condition 2) (8090, 200): 12.34 MB → adata.layers['kompot_de_Old_std']
  • Temporary matrices during predictions (batch_size=0) (8090, 5000) + (8090, 200): 320.95 MB
  • Function predictor covariances (per condition) (5000, 5000): 381.47 MB
  • Combined covariance matrix (5000, 5000): 190.73 MB
  • Cholesky decomposition (for Mahalanobis) (5000, 5000): 190.73 MB
  • Per-sample imputations (3 samples) (3, 5000, 200): 45.78 MB

Disk Storage:
  • Sample covariances (per condition, 3 samples) (5000, 5000, 200): 74.51 GB

Output Fields:
  adata.layers:
    - kompot_de_Young_imputed [OVERWRITES run_id=0]
    - kompot_de_Old_imputed [OVERWRITES run_id=0]
    - kompot_de_Young_to_Old_fold_change [OVERWRITES run_id=0]
    - kompot_de_Young_std
    - kompot_de_Old_std
  adata.var:
    - kompot_de_Young_to_Old_mahalanobis_sample_var
    - kompot_de_Young_to_Old_mean_lfc [OVERWRITES run_id=0]

Info:
  ℹ Cell batch_size (0) processes all 8090 cells at once. Consider reducing batch_size (e.g., batch_size=500) to lower peak memory by 301.12 MB.
  ℹ Disk arrays will be stored at system temp: /loc/scratch/35159533. To use a different location, set disk_storage_dir='/path/to/disk' or export TMPDIR=/path/to/disk

Warnings:
  ⚠ Results with result_key='kompot_de' already exist (run_id=0). Previous run: 2025-10-03T16:44:09.207147 comparing Young to Old (null_genes=2000). Fields that will be overwritten: var.kompot_de_Young_to_Old_mean_lfc, layers.kompot_de_Young_imputed, layers.kompot_de_Old_imputed, layers.kompot_de_Young_to_Old_fold_change

================================================================================
STATUS: ⚠ FEASIBLE WITH WARNINGS - Proceed with caution
================================================================================

The dry run shows:

  • Memory requirements for sample covariance computation

  • Disk space needed for storage

  • Which fields will be overwritten

  • Previous run details

If resources are sufficient, proceed with the analysis:

[17]:
de_sample_var = kompot.compute_differential_expression(
    adata,
    groupby=GROUPING_COLUMN,
    condition1=CONDITIONS[0],
    condition2=CONDITIONS[1],
    layer=LAYER_FOR_EXPRESSION,
    obsm_key=DIMENSIONALITY_REDUCTION,
    sample_col=SAMPLE_COLUMN,  # Enable sample variance
    genes=top_genes,
    store_arrays_on_disk=True,
    null_genes=0,
    batch_size=0, # batching does not effect the most costly covariance tensor
)
[2025-10-03 17:00:33,268] [INFO    ] Differential expression results with result_key='kompot_de' already exist in the dataset. Previous run was at 2025-10-03T16:44:09.207147 comparing Young to Old. Fields that will be overwritten: var.kompot_de_Young_to_Old_mean_lfc, layers.kompot_de_Young_imputed, layers.kompot_de_Old_imputed, layers.kompot_de_Young_to_Old_fold_change. Note: Only fields NOT affected by sample variance (like mean_log_fold_change, imputed data, fold_change) will be overwritten since they don't use the sample variance suffix. These results will likely be identical if other parameters haven't changed. This is a partial rerun with sample variance added to a previous analysis with matching parameters. Set overwrite=False to prevent overwriting or overwrite=True to silence this message.
[2025-10-03 17:00:33,270] [INFO    ] Condition 1 (Young): 2,917 cells
[2025-10-03 17:00:33,271] [INFO    ] Condition 2 (Old): 3,116 cells
[2025-10-03 17:00:33,271] [INFO    ] Using 8090 of 8090 cells (100.0%)
[2025-10-03 17:00:33,815] [INFO    ] Using sample column 'Replicate' for sample variance estimation
[2025-10-03 17:00:33,817] [INFO    ] Found 3 unique sample(s) in condition 1
[2025-10-03 17:00:33,818] [INFO    ] Found 3 unique sample(s) in condition 2
[2025-10-03 17:00:39,966] [INFO    ] Fitting expression estimator for condition 1...
[2025-10-03 17:00:47,408] [INFO    ] Fitting expression estimator for condition 2...
[2025-10-03 17:01:01,966] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 17:01:01,967] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 17:01:01,968] [INFO    ] Training estimator for group 1 with 944 cells
[2025-10-03 17:01:13,175] [INFO    ] Training estimator for group 2 with 1,007 cells
[2025-10-03 17:01:26,939] [INFO    ] Training estimator for group 3 with 966 cells
[2025-10-03 17:01:39,744] [INFO    ] Found 3 unique groups for variance estimation
[2025-10-03 17:01:39,745] [INFO    ] 3 groups have sufficient cells (>= 2) for variance estimation
[2025-10-03 17:01:39,746] [INFO    ] Training estimator for group 1 with 1,057 cells
[2025-10-03 17:02:00,552] [INFO    ] Training estimator for group 2 with 1,019 cells
[2025-10-03 17:02:36,124] [INFO    ] Training estimator for group 3 with 1,040 cells
[2025-10-03 17:02:59,580] [INFO    ] Landmark storage skipped (store_landmarks=False). Compute with store_landmarks=True to enable landmark reuse.
[2025-10-03 17:03:21,074] [INFO    ] Using 5,000 landmarks for Mahalanobis computation
[2025-10-03 17:03:31,704] [INFO    ] Disk space at /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv:
[2025-10-03 17:03:31,705] [INFO    ]   Total: 6.67 TB | Used: 297.64 GB | Free: 6.04 TB
[2025-10-03 17:03:31,705] [INFO    ]   Expected storage need: 44.70 GB
[2025-10-03 17:03:31,706] [INFO    ] Initialized disk storage at /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv with dask support (namespace: ns_15545be0)
[2025-10-03 17:03:31,706] [INFO    ] Using gene-by-gene disk storage for covariance matrix (shape=(5000, 5000, 200))
[2025-10-03 17:03:33,600] [INFO    ] Using Dask for parallel disk-backed covariance computation (all cores) (shape=(5000, 5000, 200))
[2025-10-03 17:03:34,700] [INFO    ] Removed temporary storage directory /tmp/kompot_arrays_bce0bf5b_egezr1d3
[2025-10-03 17:03:35,380] [INFO    ] Created Dask array for covariance with shape (5000, 5000, 200)
[2025-10-03 17:03:35,385] [INFO    ] Disk space at /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb:
[2025-10-03 17:03:35,386] [INFO    ]   Total: 6.67 TB | Used: 297.64 GB | Free: 6.04 TB
[2025-10-03 17:03:35,386] [INFO    ]   Expected storage need: 44.70 GB
[2025-10-03 17:03:35,387] [INFO    ] Initialized disk storage at /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb with dask support (namespace: ns_8f0ca8f5)
[2025-10-03 17:03:35,387] [INFO    ] Using gene-by-gene disk storage for covariance matrix (shape=(5000, 5000, 200))
[2025-10-03 17:03:37,034] [INFO    ] Using Dask for parallel disk-backed covariance computation (all cores) (shape=(5000, 5000, 200))
[2025-10-03 17:03:38,714] [INFO    ] Created Dask array for covariance with shape (5000, 5000, 200)
[2025-10-03 17:06:23,107] [INFO    ] Computing Mahalanobis distances using gene-specific covariance matrices
[2025-10-03 17:06:23,109] [INFO    ] Computing Mahalanobis distances for 200 genes using dask
[2025-10-03 17:11:27,690] [INFO    ] Successfully computed Mahalanobis distances for 200 genes using gene-specific covariance
[2025-10-03 17:11:59,242] [INFO    ] This run will have `run_id=1`.
[2025-10-03 17:11:59,304] [INFO    ] Removed temporary storage directory /loc/scratch/35159533/kompot_arrays_15545be0_wy9kdwrv
[2025-10-03 17:11:59,414] [INFO    ] Removed temporary storage directory /loc/scratch/35159533/kompot_arrays_8f0ca8f5_jybq2uhb

Comparing Results

Differential Abundance

Results with sample variance include a _sample_var suffix to distinguish them from standard results.

Compare z-scores with and without sample variance:

[12]:
sc.pl.embedding(
    adata, "umap",
    color=[
        "kompot_da_Young_to_Old_lfc",
        "kompot_da_Young_to_Old_lfc_zscore",
        "kompot_da_Young_to_Old_lfc_zscore_sample_var",
    ],
    color_map="RdBu_r",
    vcenter=0,
    ncols=3,
)
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_19_0.png

Notice:

  • Log fold changes (left) are identical - sample variance only affects significance

  • Z-scores (middle and right) differ - sample variance produces more conservative estimates

Volcano Plots

Compare volcano plots from both analyses:

[13]:
# Standard analysis (run_id=0)
kompot.plot.volcano_da(
    adata,
    color=CELL_TYPE_COLUMN,
    run_id=0,
    title="Standard Analysis (Pooled Samples)",
)
[2025-10-03 13:42:32,069] [INFO    ] Found DA run info for run_id=0
[2025-10-03 13:42:32,070] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:32,071] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp' from run info
[2025-10-03 13:42:32,072] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:32,072] [WARNING ] Field 'kompot_da_Young_to_Old_neg_log10_lfc_ptp' was last written by run 0, but current context expects run 1. The field may have been overwritten.
[2025-10-03 13:42:32,073] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp'}
[2025-10-03 13:42:32,074] [WARNING ] Field inference completed with 2 warnings
[2025-10-03 13:42:32,074] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.05
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_22_1.png
[14]:
# With sample variance (run_id=1)
kompot.plot.volcano_da(
    adata,
    color=CELL_TYPE_COLUMN,
    run_id=1,
    title="With Sample Variance",
)
[2025-10-03 13:42:32,601] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:32,602] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:32,602] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var' from run info
[2025-10-03 13:42:32,602] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:32,603] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var'}
[2025-10-03 13:42:32,603] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:32,603] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.1
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_23_1.png

Note that thresholds such as ptp_threshold and lfc_threshold can be adjusted, and if update_direction=True is passed, then the cell-wise classification in adata.obs["kompot_da_Young_to_Old_lfc_direction_sample_var"] is updated accordingly.

[15]:
kompot.plot.volcano_da(adata, color=CELL_TYPE_COLUMN, run_id=1, ptp_threshold=0.2, update_direction=True)
[2025-10-03 13:42:33,119] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:33,120] [INFO    ] Found lfc_key='kompot_da_Young_to_Old_lfc' from run info
[2025-10-03 13:42:33,121] [INFO    ] Found ptp_key='kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var' from run info
[2025-10-03 13:42:33,122] [WARNING ] Field 'kompot_da_Young_to_Old_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,123] [INFO    ] Successfully inferred fields: {'lfc_key': 'kompot_da_Young_to_Old_lfc', 'ptp_key': 'kompot_da_Young_to_Old_neg_log10_lfc_ptp_sample_var'}
[2025-10-03 13:42:33,124] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:33,125] [INFO    ] Updating direction column with new thresholds before plotting
[2025-10-03 13:42:33,126] [INFO    ] Found DA run info for run_id=1
[2025-10-03 13:42:33,126] [INFO    ] Found direction_key='kompot_da_Young_to_Old_lfc_direction_sample_var' from run info
[2025-10-03 13:42:33,127] [INFO    ] Successfully inferred fields: {'direction_key': 'kompot_da_Young_to_Old_lfc_direction_sample_var'}
[2025-10-03 13:42:33,128] [INFO    ] Updating direction column 'kompot_da_Young_to_Old_lfc_direction_sample_var' with thresholds: lfc_threshold=1.0, ptp_threshold=0.2
[2025-10-03 13:42:33,132] [INFO    ] Updated direction counts: up=595, down=495, neutral=7000
[2025-10-03 13:42:33,133] [INFO    ] Using inferred thresholds - lfc_threshold: 1.0, ptp_threshold: 0.2
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_25_1.png

Key Observations:

  • Posterior tail probabilities (y-axis) are generally lower with sample variance

  • Fewer cells pass significance thresholds

  • Effect sizes (x-axis) remain unchanged

  • Cell type patterns are preserved but with adjusted confidence

Differential Expression

Compare Mahalanobis distances with and without sample variance:

[16]:
# Top genes by standard analysis
comparison_df = adata.var.loc[top_genes, :][
    [
        "kompot_de_Young_to_Old_mean_lfc",
        "kompot_de_Young_to_Old_mahalanobis",
        "kompot_de_Young_to_Old_mahalanobis_sample_var",
    ]
].sort_values("kompot_de_Young_to_Old_mahalanobis", ascending=False)

comparison_df.head(20)
[16]:
kompot_de_Young_to_Old_mean_lfc kompot_de_Young_to_Old_mahalanobis kompot_de_Young_to_Old_mahalanobis_sample_var
H2-Q7 1.235066 71.517323 45.085446
Cd74 0.352121 60.021452 49.584290
H2-Aa 0.473756 59.371135 38.489260
H2-Ab1 0.511123 57.808982 33.768536
Igkc 0.054009 53.604112 38.973926
H2-Eb1 0.411752 53.403829 31.734637
AW112010 0.791367 53.386235 28.091721
S100a9 -0.170317 48.765184 46.502283
Ifitm3 0.367545 47.585813 40.375510
S100a8 -0.165121 47.372815 45.495007
H2-Q6 0.782279 45.991624 29.823029
Cd52 -0.127807 45.309906 42.044778
Aldh1a1 0.490811 44.596577 28.850373
Ifitm1 0.427663 44.320577 38.114722
Ifitm2 0.285836 43.940652 38.424422
Ighm -0.173726 43.216410 35.535580
Cd79a -0.004329 43.107792 33.076867
Apoe -0.131211 42.935707 39.211520
Fos 0.531571 42.282847 37.407770
Gm47283 0.627173 41.956500 23.844169

Notice that some genes show substantially lower significance when accounting for sample variance, indicating high variability across replicates.

[17]:
# Volcano plot: Standard analysis
kompot.plot.volcano_de(
    adata,
    run_id=0,
    title="Standard Analysis (Pooled Samples)",
)
[2025-10-03 13:42:33,702] [INFO    ] Found DE run info for run_id=0
[2025-10-03 13:42:33,703] [INFO    ] Found mean_lfc_key='kompot_de_Young_to_Old_mean_lfc' from run info
[2025-10-03 13:42:33,704] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis' from run info
[2025-10-03 13:42:33,705] [WARNING ] Field 'kompot_de_Young_to_Old_mean_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,705] [WARNING ] Field 'kompot_de_Young_to_Old_mahalanobis' was last written by run 0, but current context expects run 1. The field may have been overwritten.
[2025-10-03 13:42:33,706] [INFO    ] Successfully inferred fields: {'mean_lfc_key': 'kompot_de_Young_to_Old_mean_lfc', 'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis'}
[2025-10-03 13:42:33,707] [WARNING ] Field inference completed with 2 warnings
[2025-10-03 13:42:33,708] [INFO    ] Using DE run 0: comparing Young to Old
[2025-10-03 13:42:33,719] [INFO    ] Using data columns from var - lfc: 'kompot_de_Young_to_Old_mean_lfc', score: 'kompot_de_Young_to_Old_mahalanobis'
[2025-10-03 13:42:33,725] [INFO    ] Highlighting 393 genes marked as DE (232 up, 161 down)
[2025-10-03 13:42:33,737] [INFO    ] Labeling top 10 genes by score
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_30_1.png
[18]:
# Volcano plot: With sample variance
kompot.plot.volcano_de(
    adata,
    run_id=1,
    n_top_genes=50,  # No FDR threshold available (null_genes=0)
    gene_labels=20,
    title="With Sample Variance (Top 200 Genes)",
)
[2025-10-03 13:42:33,990] [INFO    ] Found DE run info for run_id=1
[2025-10-03 13:42:33,992] [INFO    ] Found mean_lfc_key='kompot_de_Young_to_Old_mean_lfc' from run info
[2025-10-03 13:42:33,992] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run info
[2025-10-03 13:42:33,993] [WARNING ] Field 'kompot_de_Young_to_Old_mean_lfc' has been written by 2 different runs, indicating potential overwrites
[2025-10-03 13:42:33,994] [INFO    ] Successfully inferred fields: {'mean_lfc_key': 'kompot_de_Young_to_Old_mean_lfc', 'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis_sample_var'}
[2025-10-03 13:42:33,995] [WARNING ] Field inference completed with 1 warnings
[2025-10-03 13:42:33,995] [INFO    ] Using DE run 1: comparing Young to Old
[2025-10-03 13:42:34,004] [INFO    ] Using data columns from var - lfc: 'kompot_de_Young_to_Old_mean_lfc', score: 'kompot_de_Young_to_Old_mahalanobis_sample_var'
[2025-10-03 13:42:34,007] [INFO    ] Highlighting top 50 genes by kompot_de_Young_to_Old_mahalanobis_sample_var
[2025-10-03 13:42:34,017] [INFO    ] Labeling top 20 genes by score
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_31_1.png

Heatmaps

Visualize how ranking changes with sample variance:

[19]:
kompot.plot.heatmap(
    adata,
    n_top_genes=20,
    groupby=CELL_TYPE_COLUMN,
    exclude_groups="Plasma cell",
    run_id=1,  # Use sample variance ranking
    vmin="p1",
    vmax="p99",
)
[2025-10-03 13:42:34,288] [INFO    ] Found DE run info for run_id=1
[2025-10-03 13:42:34,289] [INFO    ] Found mahalanobis_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run info
[2025-10-03 13:42:34,290] [INFO    ] Successfully inferred fields: {'mahalanobis_key': 'kompot_de_Young_to_Old_mahalanobis_sample_var'}
[2025-10-03 13:42:34,291] [INFO    ] Using DE run 1 for heatmap.
[2025-10-03 13:42:34,291] [INFO    ] Inferred score_key='kompot_de_Young_to_Old_mahalanobis_sample_var' from run information
[2025-10-03 13:42:34,295] [INFO    ] Inferred condition_column='Age' from run information
[2025-10-03 13:42:34,295] [INFO    ] Inferred condition1='Young' from run information
[2025-10-03 13:42:34,296] [INFO    ] Inferred condition2='Old' from run information
[2025-10-03 13:42:34,297] [INFO    ] Inferred layer='logged_counts' from run information
[2025-10-03 13:42:34,297] [INFO    ] Creating split heatmap with 20 genes/features
[2025-10-03 13:42:34,299] [INFO    ] Using expression data from layer: 'logged_counts'
[2025-10-03 13:42:34,419] [INFO    ] Excluded 7 cells from groups: Plasma cell
[2025-10-03 13:42:34,426] [INFO    ] Applying gene-wise z-scoring (standard_scale='var')
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_33_1.png

Understanding Sample-to-Sample Variability

To illustrate why sample variance matters, let’s examine genes with high inter-sample variability:

[20]:
# Genes with largest decrease in significance
comparison_df["decrease"] = (
    comparison_df["kompot_de_Young_to_Old_mahalanobis"] -
    comparison_df["kompot_de_Young_to_Old_mahalanobis_sample_var"]
)

variable_genes = comparison_df.nlargest(20, "decrease").index
print("Genes with highest sample-to-sample variability:")
print(variable_genes.tolist())
Genes with highest sample-to-sample variability:
['H2-Q7', 'AW112010', 'H2-Ab1', 'H2-Eb1', 'H2-Aa', 'Gm47283', 'Ifi27', 'Bcl11b', 'Slc7a11', 'Gimap4', 'Trac', 'Cirbp', 'H2-Q6', 'Ccl5', 'Trbc2', 'Aldh1a1', 'Cxcl2', 'Cd3e', 'Cd3g', 'Igkc']

Compare expression between individual replicates:

[21]:
# Compare replicates 1 and 3
kompot.plot.heatmap(
    adata,
    genes=variable_genes,
    groupby=CELL_TYPE_COLUMN,
    condition_column=SAMPLE_COLUMN,
    condition1="1",
    condition1_name="Replicate 1",
    condition2="3",
    condition2_name="Replicate 3",
    vmin="p2",
    vmax="p98",
    cluster_rows=False,
)
[2025-10-03 13:42:35,856] [INFO    ] Inferred layer='logged_counts' from run information
[2025-10-03 13:42:35,857] [INFO    ] Creating split heatmap with 20 genes/features
[2025-10-03 13:42:35,858] [INFO    ] Using display names: 'Replicate 1' for condition1, 'Replicate 3' for condition2
[2025-10-03 13:42:35,859] [INFO    ] Using expression data from layer: 'logged_counts'
[2025-10-03 13:42:35,939] [INFO    ] Applying gene-wise z-scoring (standard_scale='var')
../../_build/doctrees/nbsphinx/notebooks_03_sample_variance_37_1.png

These genes show substantial expression differences between replicates of the same condition, explaining why their significance decreases when accounting for sample variance.

Run Tracking

Inspect metadata for both analyses:

[23]:
# Standard analysis
run0 = kompot.RunInfo(adata, run_id=0, analysis_type="de")
run0
[23]:

Run 0 (DE Analysis)

Run Summary
Analysis: DE  |  Run ID: 0  |  Timestamp: 2025-10-03T19:33:09.986415  |  Conditions: Young to OldFields Overwritten: 4
ParameterValue
conditionsYoung to Old
obsm_keyDM_EigenVectors
uses_sample_varianceFalse
layerlogged_counts
timestamp2025-10-03T19:33:09.986415
Fields Created9
All Parameters
ParameterValue
auto_filteredFalse
batch_size0
compute_mahalanobisTrue
condition1Young
condition2Old
copyFalse
eps1e-08
fdr_threshold0.05
groupbyAge
inplaceTrue
jit_compileFalse
landmarksFalse
layerlogged_counts
ls_factor10.0
max_memory_ratio0.8
min_cells2
n_landmarks5000
null_genes2000
null_seed42
obsm_keyDM_EigenVectors
result_keykompot_de
sigma1.0
store_landmarksFalse
store_posterior_covarianceFalse
use_sample_varianceFalse
used_landmarksFalse
Environment
ParameterValue
hostnamegizmok39
pid35268
platformLinux-4.15.0-213-generic-x86_64-with-glibc2.27
python_version3.12.10
timestamp2025-10-03T19:33:09.986565
usernamedotto
Fields Created by This Run
Total Fields: 9  |  Present: 5  |  Missing: 0  |  Overwritten: 4
Field NameLocationDescriptionStatus
LAYERS Fields
kompot_de_Old_imputedlayers[imputed] Imputed expression for OldOverwritten by Run 1
kompot_de_Young_imputedlayers[imputed] Imputed expression for YoungOverwritten by Run 1
kompot_de_Young_to_Old_fold_changelayers[fold_change] Log fold change for each cell and geneOverwritten by Run 1
OBS Fields
kompot_de_Old_stdobs[std] Posterior standard deviation of imputed expression for Old (same for all genes)Present
kompot_de_Young_stdobs[std] Posterior standard deviation of imputed expression for Young (same for all genes)Present
VAR Fields
kompot_de_Young_to_Old_is_devar[is_de] Boolean indicator of differential expression at local FDR < 0.05Present
kompot_de_Young_to_Old_mahalanobisvar[mahalanobis] Mahalanobis distancesPresent
kompot_de_Young_to_Old_mahalanobis_local_fdrvar[mahalanobis_local_fdr] Local FDR values using empirical null estimation similar to R's fdrtoolPresent
kompot_de_Young_to_Old_mean_lfcvar[mean_log_fold_change] Mean log fold change valuesOverwritten by Run 1
[25]:
# Compare runs
run0.compare_with(1)
[25]:

Comparison of Run 0 and Run 1

Summary
Analysis Type: DE  |  Run 0: unknown (2025-10-03T19:33:09.986415)  |  Run 1: unknown (2025-10-03T19:45:12.097844)
Parameters: 5 different, 32 sameFields: 8 different, 4 same
AspectRun Details
Run 0Run 1
conditionsYoung to OldYoung to Old
result_keykompot_dekompot_de
uses_sample_varianceFalseTrue
timestamp2025-10-03T19:33:09.9864152025-10-03T19:45:12.097844
Field Count97
Parameter Differences
Total Parameters: 37  |  Different: 5  |  Only in Run 0: 0  |  Only in Run 1: 0
Different Variance Method
Key Parameter Differences
ParameterRun 0Run 1
use_sample_varianceFalseTrue
All Parameter Differences
ParameterRun 0Run 1
Different Parameters
genesNoneIndex(['H2-Q7', 'Cd74', 'H2-Aa', 'H2-Ab1', 'Igkc', 'H2-Eb1', 'AW112010', 'S100a9', 'Ifitm3', 'S100a8', ... 'Rpl12', 'Dnaja1', 'Meis1', 'H2-T23', 'Pafah1b3', 'Il7r', 'Mmp8', 'Spi1', 'Bcl11a', 'Atpif1'], dtype='object', length=200)
null_genes20000
sample_colNoneReplicate
store_arrays_on_diskNoneTrue
32 parameters are the same in both runs
Field Differences
Fields in both runs: 4  |  Only in Run 0: 5  |  Only in Run 1: 3
Field NameLocationStatusLast Modified By
LAYERS Shared Fields
kompot_de_Old_imputedlayersCurrent value from Run 1Run 1
kompot_de_Young_imputedlayersCurrent value from Run 1Run 1
kompot_de_Young_to_Old_fold_changelayersCurrent value from Run 1Run 1
VAR Shared Fields
kompot_de_Young_to_Old_mean_lfcvarCurrent value from Run 1Run 1
LAYERS Different Fields
kompot_de_Old_stdlayersOnly in Run 1Run 1
kompot_de_Young_stdlayersOnly in Run 1Run 1
OBS Different Fields
kompot_de_Old_stdobsOnly in Run 0Run 0
kompot_de_Young_stdobsOnly in Run 0Run 0
VAR Different Fields
kompot_de_Young_to_Old_is_devarOnly in Run 0Run 0
kompot_de_Young_to_Old_mahalanobisvarOnly in Run 0Run 0
kompot_de_Young_to_Old_mahalanobis_local_fdrvarOnly in Run 0Run 0
kompot_de_Young_to_Old_mahalanobis_sample_varvarOnly in Run 1Run 1
Note on shared fields: When both runs define the same field, the last run to write to the field overwrites the previous value. The 'Last Modified By' column shows which run's value is currently stored.

Best Practices

When to Use Sample Variance

Use sample variance when:

  • You have ≥3 biological replicates per condition

  • You want conservative, reproducible results

  • Inter-sample variability is expected (e.g., patient samples)

Skip sample variance when:

  • Only one sample per condition

  • Computational resources are limited

Resource Considerations

Sample variance increases computation time and memory:

  • DA: Modest increase (one estimator per sample)

  • DE: Substantial increase (gene-specific covariances)

For DE with sample variance:

  1. Always run dry_run_differential_expression first

  2. Consider analyzing a subset of genes (top hits from standard analysis)

  3. Use store_arrays_on_disk=True for large datasets

  4. Set null_genes=0 if you don’t need FDR thresholds

Interpretation

Results with sample variance are:

  • More conservative: Fewer significant findings

  • More reproducible: Robust to sample-specific effects

  • More biologically meaningful: Account for natural variation

Genes/cell states that remain significant after sample variance correction represent robust biological signals consistent across replicates.

Saving Results

[25]:
# Cleanup large layers from both runs
kompot.cleanup(adata)
[2025-10-03 13:42:38,074] [INFO    ] Cleaning up all 2 run(s)
[2025-10-03 13:42:38,087] [INFO    ] Cleaned up 3 field(s) from run 0:
[2025-10-03 13:42:38,088] [INFO    ]   layers (3 field(s)):
[2025-10-03 13:42:38,088] [INFO    ]     - kompot_de_Young_imputed
[2025-10-03 13:42:38,089] [INFO    ]     - kompot_de_Old_imputed
[2025-10-03 13:42:38,090] [INFO    ]     - kompot_de_Young_to_Old_fold_change
[2025-10-03 13:42:38,092] [INFO    ] Cleaned up 2 field(s) from run 1:
[2025-10-03 13:42:38,092] [INFO    ]   layers (2 field(s)):
[2025-10-03 13:42:38,093] [INFO    ]     - kompot_de_Young_std
[2025-10-03 13:42:38,094] [INFO    ]     - kompot_de_Old_std
[2025-10-03 13:42:38,094] [INFO    ] Total: Cleaned up 5 field(s) across 2 run(s)
[26]:
adata.write_h5ad("../data/murine_bone_marrow_aging_with_sample_variance.h5ad")

Summary

This tutorial demonstrated:

✓ Why sample variance matters in multi-sample studies
✓ How to enable sample variance with sample_col parameter
✓ Resource planning for sample variance analyses
✓ Comparing results with/without sample variance
✓ Identifying sample-variable genes
✓ Best practices for sample variance analysis

Key Takeaways

  1. Sample variance provides more reliable significance estimates when you have biological replicates

  2. Effect sizes remain identical; only statistical significance changes

  3. Results are more conservative but also more reproducible

  4. Use resource planning for DE with sample variance due to increased memory requirements

For more information, see the Kompot documentation.