AnnData Integration¶

The AnnData Integration module provides high-level convenience functions that work directly with AnnData objects. These functions handle the data flow, parameter management, and result storage automatically, making it easy to perform differential analysis with minimal setup.

When to use AnnData Integration: - You want a simple, one-function-call approach to differential analysis - You’re working primarily with AnnData objects in your workflow - You want automatic result storage and metadata tracking - You prefer convenience over fine-grained control

Key advantages: - Automatic parameter validation and data preparation - Built-in result storage with run history tracking - Seamless integration with plotting functions - Handles complex data structures (layers, embeddings) automatically

Differential Abundance¶

Differential abundance analysis for AnnData objects.

kompot.anndata.differential_abundance.compute_differential_abundance(adata, groupby: str, condition1: str, condition2: str, obsm_key: str = 'DM_EigenVectors', n_landmarks: int | None = None, landmarks: ndarray | None = None, sample_col: str | None = None, log_fold_change_threshold: float = 1.0, ptp_threshold: float = 0.05, ls_factor: float = 10.0, jit_compile: bool = False, random_state: int | None = None, copy: bool = False, inplace: bool = True, result_key: str = 'kompot_da', batch_size: int | None = None, overwrite: bool | None = None, store_landmarks: bool = False, return_full_results: bool = False, allow_single_condition_variance: bool = False, **density_kwargs) → Dict[str, ndarray] | AnnData | Tuple[Dict[str, ndarray], AnnData]View on GitHub ¶

Compute differential abundance between two conditions directly from an AnnData object.

This function is a scverse-compatible wrapper around the DifferentialAbundance class that operates directly on AnnData objects.

Parameters:

adata (AnnData) – AnnData object containing cells from both conditions.
groupby (str) – Column in adata.obs containing the condition labels.
condition1 (str) – Label in the groupby column identifying the first condition.
condition2 (str) – Label in the groupby column identifying the second condition.
obsm_key (str, optional) – Key in adata.obsm containing the cell states (e.g., PCA, diffusion maps), by default “DM_EigenVectors”.
n_landmarks (int, optional) – Number of landmarks to use for approximation. If None, use all points, by default None. Ignored if landmarks is provided.
landmarks (np.ndarray, optional) – Pre-computed landmarks to use. If provided, n_landmarks will be ignored. Shape (n_landmarks, n_features).
sample_col (str, optional) – Column name in adata.obs containing sample labels. If provided, these will be used to compute sample-specific variance and will automatically enable sample variance estimation.
allow_single_condition_variance (bool, optional) – If True, allows variance estimation with only one condition having multiple samples. By default False, which requires both conditions to have multiple samples.
log_fold_change_threshold (float, optional) – Threshold for considering a log fold change significant, by default 1.7.
ptp_threshold (float, optional) – Threshold for considering a PTP (Posterior Tail Probability) significant, by default 1e-3. The posterior tail probability is a significance measure similar to p-values.
ls_factor (float, optional) – Multiplication factor to apply to length scale when it’s automatically inferred, by default 10.0. Only used when length scale is not explicitly provided.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default False.
random_state (int, optional) – Random seed for reproducible landmark selection when n_landmarks is specified. Controls the random selection of points when using approximation, by default None.
copy (bool, optional) – If True, return a copy of the AnnData object with results added, by default False.
inplace (bool, optional) – If True, modify adata in place, by default True.
result_key (str, optional) – Key in adata.uns where results will be stored, by default “kompot_da”.
batch_size (int, optional) – Number of samples to process at once during density estimation to manage memory usage. If None or 0, all samples will be processed at once. If processing all at once causes a memory error, a default batch size of 500 will be used automatically. Default is None.
overwrite (bool, optional) –
Controls behavior when results with the same result_key already exist:
- If None (default): Behaves contextually:
  - For partial reruns with sample variance added where other parameters match, logs an informative message at INFO level and proceeds with overwriting
  - For other cases, warns about existing results but proceeds with overwriting
- If True: Silently overwrite existing results
- If False: Raise an error if results would be overwritten
Note: When running with sample_col for a subset of cells that were previously analyzed without sample variance, only fields affected by sample variance will be modified. Fields unaffected by sample variance will be overwritten but not change if the parameters match.
store_landmarks (bool, optional) – Whether to store landmarks in adata.uns for future reuse, by default False. Setting to True will allow reusing landmarks with future analyses but may significantly increase the AnnData file size.
return_full_results (bool, optional) – If True, return the full results dictionary including the differential model, by default False. If False, and if copy=True, only return the AnnData object.
**density_kwargs (dict) – Additional arguments to pass to the DensityEstimator.

Returns:

Return value depends on copy and return_full_results parameters:

If copy=True and return_full_results=False: Returns modified AnnData object
If copy=True and return_full_results=True: Returns tuple (results_dict, adata)
If copy=False and return_full_results=False: Returns None (modifies in place)
If copy=False and return_full_results=True: Returns results_dict

The results_dict contains the following keys for programmatic access:

"table": pandas DataFrame with cell-level results, indexed by cell names. Columns: lfc (log fold change), lfc_zscore (z-scores), neg_log10_ptp (negative log10 posterior tail probabilities), direction (labels: ‘up’, ‘neutral’, ‘down’)
"model": The fitted DifferentialAbundance object for additional analyses
"landmarks": Computed landmarks array if applicable (n_landmarks, n_features)
"field_names": Dictionary mapping result types to their AnnData field names

The model object provides access to the complete density estimation model, enabling additional downstream analyses such as computing gradients, accessing density estimators, or performing custom predictions on new data.

Return type:

Union[Dict[str, np.ndarray], AnnData, Tuple[Dict[str, np.ndarray], AnnData]]

Notes

Results are stored in various components of the AnnData object:

adata.obs[f”{result_key}_{condition1}_to_{condition2}_lfc”]: Log fold change values for each cell
adata.obs[f”{result_key}_{condition1}_to_{condition2}_lfc_zscore”]: Z-scores for each cell
adata.obs[f”{result_key}_{condition1}_to_{condition2}_neg_log10_lfc_ptp”]: Negative log10 PTPs (Posterior Tail Probabilities) for each cell
adata.obs[f”{result_key}_{condition1}_to_{condition2}_lfc_direction”]: Direction of change (‘up’, ‘neutral’, ‘down’)
adata.uns[f”{result_key}_lfc_direction_colors”]: Color mapping for direction categories
adata.uns[result_key]: Dictionary with additional information and parameters
If landmarks are computed, they are stored in adata.uns[result_key][‘landmarks’] for potential reuse in other analyses.

The color scheme used for directions is: - “up”: “#d73027” (red) - “down”: “#4575b4” (blue) - “neutral”: “#d3d3d3” (light gray)

Differential Expression¶

Differential expression analysis for AnnData objects.

kompot.anndata.differential_expression.compute_differential_expression(adata, groupby: str, condition1: str, condition2: str, obsm_key: str = 'DM_EigenVectors', layer: str | None = None, genes: List[str] | None = None, n_landmarks: int | None = 5000, landmarks: ndarray | None = None, sample_col: str | None = None, sigma: float = 1.0, ls: float | None = None, ls_factor: float = 10.0, compute_mahalanobis: bool = True, jit_compile: bool = False, eps: float = 1e-08, random_state: int | None = None, batch_size: int = 100, store_arrays_on_disk: bool | None = None, disk_storage_dir: str | None = None, max_memory_ratio: float = 0.8, cell_filter: str | List[str] | Dict[str, Any] | List[Dict[str, Any]] | None = None, groups: str | Dict[str, Any] | List[Dict[str, Any]] | Series | ndarray | List[ndarray] | None = None, min_cells: int = 2, min_percentage: float | None = None, check_representation: bool | None = None, copy: bool = False, inplace: bool = True, result_key: str = 'kompot_de', overwrite: bool | None = None, store_landmarks: bool = False, return_full_results: bool = False, store_posterior_covariance: bool = False, allow_single_condition_variance: bool = False, progress: bool = True, null_genes: int | List[int] | None = 2000, null_seed: int | None = 42, fdr_threshold: float = 0.05, store_additional_stats: bool = False, **function_kwargs) → Dict[str, ndarray] | AnyView on GitHub ¶

Compute differential expression between two conditions directly from an AnnData object.

This function is a scverse-compatible wrapper around the DifferentialExpression class that operates directly on AnnData objects.

Parameters:

adata (AnnData) – AnnData object containing cells from both conditions.
groupby (str) – Column in adata.obs containing the condition labels.
condition1 (str) – Label in the groupby column identifying the first condition.
condition2 (str) – Label in the groupby column identifying the second condition.
obsm_key (str, optional) – Key in adata.obsm containing the cell states (e.g., PCA, diffusion maps), by default “DM_EigenVectors”.
layer (str, optional) – Layer in adata.layers containing gene expression data. If None, use adata.X, by default None.
genes (List[str], optional) – List of gene names to include in the analysis. If None, use all genes, by default None.
n_landmarks (int, optional) – Number of landmarks to use for approximation. If None, use all points, by default 5000. Ignored if landmarks is provided.
landmarks (np.ndarray, optional) – Pre-computed landmarks to use. If provided, n_landmarks will be ignored. Shape (n_landmarks, n_features).
sample_col (str, optional) – Column name in adata.obs containing sample labels. If provided, these will be used to compute sample-specific variance and will automatically enable sample variance estimation.
allow_single_condition_variance (bool, optional) – If True, allows variance estimation with only one condition having multiple samples. By default False, which requires both conditions to have multiple samples.
sigma (float, optional) – Noise level for function estimator, by default 1.0.
ls (float, optional) – Length scale for the GP kernel. If None, it will be estimated, by default None.
ls_factor (float, optional) – Multiplication factor to apply to length scale when it’s automatically inferred, by default 10.0. Only used when ls is None.
compute_mahalanobis (bool, optional) – Whether to compute Mahalanobis distances for gene ranking, by default True.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default False.
eps (float, optional) – Small constant for numerical stability in covariance matrices, by default 1e-8. Increase this value if Cholesky decomposition fails during Mahalanobis distance computation.
random_state (int, optional) – Random seed for reproducible landmark selection when n_landmarks is specified. Controls the random selection of points when using approximation, by default None.
batch_size (int, optional) – Number of cells to process at once during prediction and Mahalanobis distance computation to manage memory usage. If None or 0, all samples will be processed at once. Default is 100.
store_arrays_on_disk (bool, optional) – Whether to store large arrays on disk instead of in memory, by default None. If None, it will be determined based on disk_storage_dir (True if provided, False otherwise). This is useful for very large datasets with many genes, where covariance matrices would otherwise exceed available memory.
disk_storage_dir (str, optional) – Directory to store arrays on disk. If provided and store_arrays_on_disk is None, store_arrays_on_disk will be set to True. If store_arrays_on_disk is False and this is provided, a warning will be logged and disk storage will not be used.
max_memory_ratio (float, optional) – Maximum fraction of available memory that arrays should occupy before triggering warnings or enabling disk storage, by default 0.8 (80%).
cell_filter (str, List[str], Dict, List[Dict], optional) –
Specification for cells or groups to exclude from the analysis. Will be interpreted in the following ways:
- If str and groups is provided: Interpreted as a group name to exclude from the groups defined by the groups parameter.
- If List[str] and groups is provided: Multiple group names to exclude from the groups defined by the groups parameter.
- If Dict: Keys are column names in adata.obs, and values are specific values to exclude.
- If List[Dict]: Multiple dictionaries specifying different exclusion criteria.
Cells matching any of the specified exclusion criteria will be excluded from the analysis. The string and list of strings formats are only valid when the groups parameter is also provided, as they refer to excluding groups from the subset analysis. The dictionary formats work independently to exclude cells based on their metadata.
groups (str, Dict, Dict[str, Dict], List[Dict], pd.Series, np.ndarray, List[np.ndarray], optional) –
Specification for subsetting or grouping cells for additional analysis. Will be interpreted in the following ways:
- If str: Used as column name in adata.obs. - If column is boolean: True values form a subset. - If column is categorical or string: Each unique value forms a subset. - If column doesn’t allow grouping (e.g., floats): Raises an error.
- If Dict: Interpreted as a filter with keys being column names of adata.obs and values being allowed values in this column. Example: {‘category’: [‘cat1’, ‘cat2’], ‘is_selected’: True} creates a subset of cells where category is ‘cat1’ or ‘cat2’ AND is_selected is True.
- If Dict[str, Dict]: Dict of filters for different subgroups, where outer dict keys are used as subset names. Each inner dict defines a filter as above. Example: {‘control_group’: {‘treatment’: ‘control’}, ‘high_dose’: {‘treatment’: ‘drug’, ‘dose’: ‘high’}} creates two named subsets using the provided names as identifiers.
- If List[Dict]: Each dictionary specifies a different subset using the same filtering mechanism as above, but subset names are auto-generated.
- If pd.Series or np.ndarray: Interpreted like a column specified with a string.
- If array of appropriate shape with boolean values: Each row specifies a subset.
- If List of vectors/series: Each element is processed as above.
When subsetting is defined, the global comparison is still run first, followed by analyses on each subset. Only the ‘mean_log_fold_change’ and ‘mahalanobis_distances’ metrics are saved for each subset with appropriate name suffixes.
min_cells (int, optional) – Minimum number of cells required for a condition to be considered adequately represented within each group, by default 10.
min_percentage (float, optional) – Minimum percentage of cells required for a condition within each group, relative to total cells in the group. If None, uses 10% divided by the number of conditions, by default None.
check_representation (None or bool, optional) – Controls checking for underrepresentation when groups are specified, by default None. - If None: Checks and warns about underrepresentation but does not filter automatically - If True: Checks for underrepresentation and automatically applies the filter - If False: Skips the underrepresentation check entirely
copy (bool, optional) – If True, return a copy of the AnnData object with results added, by default False.
inplace (bool, optional) – If True, modify adata in place, by default True.
result_key (str, optional) – Key in adata.uns where results will be stored, by default “kompot_de”.
overwrite (bool, optional) –
Controls behavior when results with the same result_key already exist:
- If None (default): Behaves contextually:
  - For partial reruns with sample variance added where other parameters match, logs an informative message at INFO level and proceeds with overwriting
  - For other cases, warns about existing results but proceeds with overwriting
- If True: Silently overwrite existing results
- If False: Raise an error if results would be overwritten
Note: When running with sample_col for a subset of cells that were previously analyzed without sample variance, only fields affected by sample variance will be modified. Fields unaffected by sample variance will be overwritten but not change if the parameters match.
store_landmarks (bool, optional) – Whether to store landmarks in adata.uns for future reuse, by default False. Setting to True will allow reusing landmarks with future analyses but may significantly increase the AnnData file size.
return_full_results (bool, optional) – If True, return the full results dictionary including the differential model, by default False. If False, and if copy=True, only return the AnnData object.
store_posterior_covariance (bool, optional) – Whether to store the posterior covariance matrix in adata.obsp. Only available when not using sample variance (sample_col=None). The covariance matrix can be quite large, as it is of shape (n_cells, n_cells), so this should be used carefully with large datasets. Default is False.
progress (bool, optional) – Whether to show progress bars during computation. When True, displays tqdm.auto progress bars for all batch processing operations including prediction, uncertainty computation, and Mahalanobis distance calculations. When False, all progress bars are disabled. Default is True.
null_genes (int, List[int], or None, optional) –
Specification for generating null distribution to compute FDR-corrected p-values:
- If int: Number of genes to randomly sample for null distribution
- If List[int]: Specific gene indices to use for null distribution
- If None or 0: Disable FDR calculation (no p-values computed)
Default is 2000 (uses 2000 randomly sampled null genes for FDR estimation).

Null genes have their expression values shuffled between conditions to break the association with cell state, creating a background distribution for statistical testing.
null_seed (int, optional) – Random seed for reproducible null gene selection and expression shuffling. Ensures consistent results across runs when using random null gene sampling. If None, results will vary between runs. Default is 42.
fdr_threshold (float, optional) – FDR threshold for identifying significantly differentially expressed genes. Genes with FDR < fdr_threshold will be marked as significantly DE in a boolean column. Also used for reporting the Mahalanobis distance threshold in logs. Default is 0.05.
store_additional_stats (bool, optional) –
Whether to store additional statistical measures as .var columns beyond the default local FDR and is_de boolean. When True, stores:
- Raw p-values from empirical null distribution
- Tail-based FDR (Benjamini-Hochberg correction)
- PTP (posterior tail probability from chi-squared)
- Fold change z-scores
When False (default), only stores local FDR and is_de boolean, which are the primary significance measures. All fields follow the same naming and field tracking logic as other results. Default is False.
**function_kwargs (dict) – Additional arguments to pass to the FunctionEstimator.

Returns:

Return value depends on copy and return_full_results parameters:

If copy=True and return_full_results=False: Returns modified AnnData object
If copy=True and return_full_results=True: Returns tuple (results_dict, adata)
If copy=False and return_full_results=False: Returns None (modifies in place)
If copy=False and return_full_results=True: Returns results_dict

The results_dict contains the following keys for programmatic access:

"table": pandas DataFrame with gene-level results, indexed by gene names. Columns: mean_lfc (mean log fold change). If FDR is computed (null_genes > 0), also includes: pvalue, local_fdr, tail_fdr, is_de (significant at threshold). If compute_mahalanobis=True: mahalanobis. If compute_ptp=True: ptp.
"model": The fitted DifferentialExpression object for additional analyses
"landmarks": Computed landmarks array if applicable (n_landmarks, n_features)
"field_names": Dictionary mapping result types to their AnnData field names

If groups is specified, additional group-specific keys are included:

"group_mean_log_fold_change": Mean LFC per group (n_genes, n_groups)
"group_mahalanobis_distances": Mahalanobis distances per group (n_genes, n_groups)
"group_names": List of group names corresponding to columns

The model object provides access to the complete Gaussian Process model, enabling additional downstream analyses such as computing gradients, accessing kernel parameters, or performing custom predictions.

Return type:

Union[Dict[str, np.ndarray], AnnData, Tuple[Dict[str, np.ndarray], AnnData]]

Notes

Results are stored in various components of the AnnData object:

Always stored: - adata.var[f”{result_key}_mahalanobis”]: Mahalanobis distance for each gene (if compute_mahalanobis is True) - adata.var[f”{result_key}_mean_lfc”]: Mean log fold change for each gene - adata.var[f”{result_key}_mahalanobis_local_fdr”]: Local FDR values using empirical null estimation similar to R’s fdrtool (if null_genes is not None) - adata.var[f”{result_key}_is_de”]: Boolean indicator of differential expression at specified local FDR threshold (if null_genes is not None) - adata.layers[f”{result_key}_condition1_imputed”]: Imputed expression for condition 1 - adata.layers[f”{result_key}_condition2_imputed”]: Imputed expression for condition 2 - adata.layers[f”{result_key}_fold_change”]: Log fold change for each cell and gene

Stored only when store_additional_stats=True: - adata.var[f”{result_key}_ptp”]: Posterior tail probability from chi-squared distribution (if compute_mahalanobis is True) - adata.var[f”{result_key}_mahalanobis_pvalue”]: P-values from empirical null distribution (if null_genes is not None) - adata.var[f”{result_key}_mahalanobis_tail_fdr”]: Tail-based FDR values using Benjamini-Hochberg correction (if null_genes is not None) - adata.layers[f”{result_key}_fold_change_zscores”]: Z-scores of log fold changes accounting for uncertainty (and sample variance if sample_col is provided)

Optional: - adata.obsp[“posterior_covariance”]: If store_posterior_covariance=True and conditions are met,

the posterior covariance matrix. Shape (n_cells, n_cells).

adata.uns[result_key]: Dictionary with additional information and parameters

Posterior standard deviations of imputed expression values are stored in: - If sample_col is not None (with sample variance):

adata.layers[f”{result_key}_{condition1}_std”]: Cell-wise standard deviation for condition 1 (sparse matrix)

adata.layers[f”{result_key}_{condition2}_std”]: Cell-wise standard deviation for condition 2 (sparse matrix)

If sample_col is None (without sample variance): - adata.obs[f”{result_key}_{condition1}_std”]: Cell-wise standard deviation for condition 1 (same for all genes) - adata.obs[f”{result_key}_{condition2}_std”]: Cell-wise standard deviation for condition 2 (same for all genes)

If landmarks are computed, they are stored in adata.uns[result_key][‘landmarks’] for potential reuse in other analyses.

Resource Estimation¶

Before running resource-intensive differential expression analyses, you can use the dry run utility to estimate memory and disk requirements, check for field overwrites, and verify parameters.

Key features:

Memory and disk estimation: Calculates expected resource usage for all intermediate arrays and final results
Null genes accounting: Correctly estimates resource inflation from null distribution genes (default 2000 additional genes)
Field overwrite detection: Shows which fields will be overwritten, including their run_id and previous run details
Sample variance impact: Estimates additional memory for sample-specific covariance tensors
Disk storage planning: Estimates disk space needed when using store_arrays_on_disk=True

import kompot as kp

# Run a dry run before actual computation
plan = kp.dry_run_differential_expression(
    adata,
    condition1='Young',
    condition2='Old',
    groupby='age',
    use_sample_variance=True,
    sample_column='donor_id',
    verbose=True
)

# Examine the report
print(plan.format_report(verbose=True))

The dry run output shows:

System Resources: Available memory and disk space
Total Requirements: Memory and disk needed with percentage of available
Memory Allocations: Detailed breakdown of each array (precision matrices, imputed expression, covariances)
Output Fields: All fields that will be created, with [OVERWRITES run_id=X] markers for existing fields
Warnings: Field overwrite information showing previous run timestamp, conditions, and parameters
Status: Whether the analysis is feasible given available resources

kompot.dry_run_differential_expression(adata, groupby: str, condition1: str, condition2: str, verbose: bool = True, **kwargs) → ResourcePlanView on GitHub ¶

Estimate resource requirements for differential expression analysis.

This is a planning tool that lets you explore different parameter configurations and understand resource requirements BEFORE attempting an actual run. Use this to:

Compare memory usage with/without sample_variance
Decide whether to use disk_storage_dir
Choose appropriate landmark subsampling
Understand which fields will be created/overwritten with run_id tracking
See detailed previous run information for fields that will be overwritten
Check if you have sufficient resources

The actual kompot.compute_differential_expression() also checks resources, but this dry-run lets you experiment with parameters without waiting for the full computation.

Field Overwrite Detection:

The dry run shows which fields will be overwritten with their run_id in the Output Fields section (e.g., [OVERWRITES run_id=0]). The warnings section provides detailed information about the previous run including timestamp, conditions, and parameters like use_sample_variance and null_genes.

Parameters:

adata (AnnData) – Annotated data object
groupby (str) – Column in adata.obs that defines groups
condition1 (str) – First condition to compare
condition2 (str) – Second condition to compare
verbose (bool) – If True, print the resource plan report (default: True)
**kwargs – All other parameters that would be passed to compute_differential_expression (use_sample_variance, sample_column, disk_storage_dir, landmarks, etc.)

Returns:

Complete resource plan with requirements, availability, and feasibility check

Return type:

ResourcePlan

Examples

Compare different configurations:

>>> from kompot.resource_estimation import dry_run_differential_expression
>>>
>>> # Option 1: In-memory sample variance (highest memory)
>>> plan1 = dry_run_differential_expression(
...     adata, 'treated', 'control', 'condition',
...     use_sample_variance=True,
...     sample_column='donor_id'
... )
>>>
>>> # Option 2: Disk-backed sample variance (lower memory, needs disk space)
>>> plan2 = dry_run_differential_expression(
...     adata, 'treated', 'control', 'condition',
...     use_sample_variance=True,
...     sample_column='donor_id',
...     disk_storage_dir='/scratch/de_analysis'
... )
>>>
>>> # Option 3: No sample variance (lowest resources)
>>> plan3 = dry_run_differential_expression(
...     adata, 'treated', 'control', 'condition',
...     use_sample_variance=False
... )
>>>
>>> # Compare memory usage
>>> print(f"Option 1 memory: {plan1.requirements[0].size_human}")
>>> print(f"Option 2 memory: {plan2.requirements[0].size_human}")
>>> print(f"Option 3 memory: {plan3.requirements[0].size_human}")

Check field overwrites:

>>> plan = dry_run_differential_expression(
...     adata, 'Young', 'Old', 'age',
...     result_key='kompot_de'
... )
>>> # The output shows which fields will be overwritten with their run_id:
>>> # Output Fields:
>>> #   adata.layers:
>>> #     - kompot_de_Young_imputed [OVERWRITES run_id=0]
>>> #     - kompot_de_Old_imputed [OVERWRITES run_id=0]
>>> #   adata.var:
>>> #     - kompot_de_Young_to_Old_mean_lfc [OVERWRITES run_id=0]
>>> #
>>> # Warnings:
>>> #   ⚠ Results with result_key='kompot_de' already exist (run_id=0).
>>> #     Previous run: 2025-10-02T12:30:00 comparing Young to Old
>>> #     (null_genes=2000). Fields that will be overwritten: ...

Use with landmarks:

>>> # Test with different landmark counts to find sweet spot
>>> import numpy as np
>>> for n_landmarks in [500, 1000, 2000]:
...     landmarks = adata.obsm['X_pca'][::adata.n_obs//n_landmarks][:n_landmarks]
...     plan = dry_run_differential_expression(
...         adata, 'A', 'B', 'condition',
...         landmarks=landmarks,
...         use_sample_variance=True,
...         sample_column='donor',
...         verbose=False
...     )
...     print(f"{n_landmarks} landmarks: {plan.total_memory_required / 1024**3:.2f} GB")

Notes

This function only estimates resources. The actual kompot.compute_differential_expression() will perform its own checks before running. Use this dry-run to explore options and make informed decisions about parameters.

Utilities¶

class kompot.anndata.utils.RunInfo(adata, run_id: int | None = None, analysis_type: str | None = None)View on GitHub ¶

Bases: object

Class for accessing run information for differential analysis.

Provides access to run history, parameters, and result fields.

adata¶

AnnData object containing the run history

Type:: AnnData

run_id¶

Requested run ID (may be negative for relative indexing)

Type:: int

adjusted_run_id¶

Actual run ID after adjusting for negative indexing

Type:: int

analysis_type¶

Type of analysis: ‘de’ for differential expression or ‘da’ for differential abundance

Type:: str

storage_key¶

Key for accessing the analysis data in adata.uns

Type:: str

run_info¶

Dictionary with all information about the run

Type:: dict

field_names¶

Dictionary with field names used in this run

Type:: dict

params¶

The parameters used for this analysis

Type:: dict

environment¶

Information about the environment where the analysis was run

Type:: dict

overwritten_fields¶

List of fields that were overwritten by newer runs

Type:: list

missing_fields¶

List of fields that are missing/deleted from the AnnData object

Type:: list

__init__(adata, run_id: int | None = None, analysis_type: str | None = None)View on GitHub ¶

Initialize a RunInfo object.

Parameters:

adata (AnnData) – AnnData object containing run history
run_id (int, optional) – Run ID to retrieve. Negative indices count from the end. If None, uses the most recent run (-1).
analysis_type (str, optional) – Type of analysis: ‘de’ for differential expression or ‘da’ for differential abundance. If None, attempts to detect.

compare_with(other_run_id: int) → RunComparisonView on GitHub ¶

Compare this run with another run.

Parameters:: other_run_id (int) – Run ID to compare with
Returns:: Object containing comparison results with nice display methods
Return type:: RunComparison

get_data() → Dict[str, Any]View on GitHub ¶

Get all data related to this run.

Returns:: Dictionary with all run data
Return type:: Dict[str, Any]

get_summary() → Dict[str, Any]View on GitHub ¶

Get a summary of this run with key information.

Returns:: Dictionary with run summary
Return type:: Dict[str, Any]

Cleanup Utilities¶

Remove large data (layers, obsp, varm) from differential analysis results.

This function helps reduce AnnData object size by removing large arrays like imputed expression layers, fold change layers, and posterior covariance matrices while retaining the statistical results in var/obs columns.

Parameters:

adata (AnnData) – AnnData object with differential analysis results
run_ids (int, list of int, or None, optional) – Run ID(s) to clean up. Negative indices count from the end. - If None (default): Cleans up ALL runs - If int: Cleans up single run - If list: Cleans up specified runs
analysis_type (str, default 'de') – Type of analysis: ‘de’ for differential expression or ‘da’ for differential abundance
keep_layers (bool or list of str, optional) –
- If None (default): Remove all layers from specified run(s)
- If False: Remove all layers from specified run(s)
- If True: Keep all layers from specified run(s)
- If list: Keep only the specified layer types
keep_var_fields (bool or list of str, optional) –
- If True (default): Keep all var fields from specified run(s)
- If False: Remove all var fields from specified run(s)
- If list: Keep only the specified var field types
keep_obs_fields (bool or list of str, optional) –
- If True (default): Keep all obs fields from specified run(s)
- If False: Remove all obs fields from specified run(s)
- If list: Keep only the specified obs field types
keep_obsp_fields (bool or list of str, optional) –
- If None (default): Remove all obsp fields from specified run(s)
- If False: Remove all obsp fields from specified run(s)
- If True: Keep all obsp fields from specified run(s)
- If list: Keep only the specified obsp field types
keep_varm_fields (bool or list of str, optional) –
- If None (default): Remove all varm fields from specified run(s)
- If False: Remove all varm fields from specified run(s)
- If True: Keep all varm fields from specified run(s)
- If list: Keep only the specified varm field types
inplace (bool, default True) – If True, modify adata in place. If False, return a copy.

Returns:

AnnData or None – If inplace=False, returns modified copy. If inplace=True, returns None.
Field Types
———–
**Layer field types (****) –
- ’imputed’: Imputed expression for each condition
- ’fold_change’: Log fold change for each cell and gene
- ’fold_change_zscores’: Z-scores of log fold changes (requires store_additional_stats=True)
- ’std_with_sample_var’: Posterior standard deviations with sample variance
**Var field types (****) –
- ’mean_log_fold_change’: Mean log fold change values
- ’mahalanobis’: Mahalanobis distances
- ’ptp’: Posterior tail probability (requires store_additional_stats=True)
- ’mahalanobis_pvalue’: P-values from empirical null (requires store_additional_stats=True)
- ’mahalanobis_local_fdr’: Local FDR values (primary significance measure)
- ’mahalanobis_tail_fdr’: Tail-based FDR values (requires store_additional_stats=True)
- ’is_de’: Boolean indicator of differential expression
- ’weighted_mean_log_fold_change’: Weighted mean log fold change (with differential abundance)
**Obs field types (****) –
- ’std’: Posterior standard deviations (without sample variance, same for all genes)
**Obsp field types (****) –
- ’covariance’: Posterior covariance matrices for fold changes
**Varm field types (****) –
- ’mean_log_fold_change’: Mean log fold change per group (when using groups parameter)
- ’mahalanobis’: Mahalanobis distances per group
- ’weighted_mean_log_fold_change’: Weighted mean log fold change per group

Examples

# Remove all layers from all runs (default behavior) cleanup(adata)

# Remove layers from specific run cleanup(adata, run_ids=0)

# Remove layers from multiple specific runs cleanup(adata, run_ids=[0, 2, 5])

# Keep only fold change layer, remove everything else large cleanup(adata, keep_layers=[‘fold_change’])

# Remove all layers and obsp covariance matrices cleanup(adata, keep_layers=False, keep_obsp_fields=False)

# Keep only essential statistical fields from run 0 cleanup(

adata, run_ids=0, keep_layers=False, keep_var_fields=[‘mahalanobis’, ‘mahalanobis_local_fdr’, ‘is_de’, ‘mean_log_fold_change’], keep_obs_fields=False

)

Notes

By default, cleans up ALL runs to maximize space savings
By default, keeps all statistical results (var/obs fields) but removes layers
Large data typically in: layers (imputed, fold_change), obsp (covariance)
This does NOT modify the run history - deleted fields are marked as missing
Use RunInfo to check which fields are present vs deleted

kompot.get_field_status(adata: AnnData, run_id: int | None = None, analysis_type: str = 'de') → Dict[str, Dict[str, Dict[str, bool]]]View on GitHub ¶

Get the status of all fields from a differential analysis run.

Shows which fields are present vs missing/deleted.

Parameters:

adata (AnnData) – AnnData object with differential analysis results
run_id (int, optional) – Run ID to check. If None, uses most recent run.
analysis_type (str, default 'de') – Type of analysis: ‘de’ or ‘da’

Returns:

Nested dictionary with structure: {location: {field_type: {field_name: is_present}}}

Return type:

dict

Examples

>>> status = get_field_status(adata)
>>> print(status['layers']['imputed'])
{'result_A_imputed': True, 'result_B_imputed': False}

Representation Analysis¶

kompot.check_underrepresentation(adata: AnnData, groupby: str, groups: str | dict | list | ndarray, conditions: List[str] | None = None, min_cells: int = 30, min_percentage: float | None = None, warn: bool = True, print_summary: bool = False) → Dict[str, Any]View on GitHub ¶

Check if any condition is underrepresented in any group.

Parameters:

adata (AnnData) – AnnData object containing cells/observations
groupby (str) – Column in adata.obs defining conditions to check
groups (str, dict, list, np.ndarray) – Groups to check for representation, either: - str: Column name in adata.obs defining groups - dict: Mapping from group names to boolean masks or indices - list, np.ndarray: Boolean mask or indices for a single group
conditions (List[str], optional) – List of condition values to check, by default None (uses all values in groupby column)
min_cells (int, optional) – Minimum number of cells required for each condition in each group, by default 30
min_percentage (float, optional) – Minimum percentage of cells for each condition in each group, by default None
warn (bool, optional) – Whether to emit warnings for underrepresentation, by default True
print_summary (bool, optional) – Whether to print a summary of underrepresentation results, by default False

Returns:

Dictionary with underrepresentation data, contains: - __underrepresentation_data: Dict mapping groups to underrepresented conditions - group_key: List of group names (if groups was a string column name) - Other metadata depending on groups type

Return type:

Dict[str, Any]