AnnData Integration¶
The AnnData Integration module provides high-level convenience functions that work directly with AnnData objects. These functions handle the data flow, parameter management, and result storage automatically, making it easy to perform differential analysis with minimal setup.
When to use AnnData Integration: - You want a simple, one-function-call approach to differential analysis - You’re working primarily with AnnData objects in your workflow - You want automatic result storage and metadata tracking - You prefer convenience over fine-grained control
Key advantages: - Automatic parameter validation and data preparation - Built-in result storage with run history tracking - Seamless integration with plotting functions - Handles complex data structures (layers, embeddings) automatically
Differential Abundance¶
Differential abundance analysis for AnnData objects.
- kompot.anndata.differential_abundance.da(adata, groupby: str, condition1: str, condition2: str, obsm_key: str = 'DM_EigenVectors', sample_col=None, gp: GPSettings | None = None, threshold: DAThresholdSettings | None = None, storage: StorageSettings | None = None, output: OutputSettings | None = None, model: ModelSettings | None = None, **density_kwargs) Dict[str, ndarray] | AnyView on GitHub¶
Run differential abundance analysis on an AnnData object.
The most common call is just:
kompot.da(adata, "condition", "Young", "Old")
Advanced options are available through the settings dataclasses (
GPSettings,DAThresholdSettings,StorageSettings,OutputSettings). Any field left at its default is equivalent to omitting it entirely. Extra**density_kwargsare forwarded to mellon’sDensityEstimator.- Parameters:
adata (AnnData) – AnnData object containing cells from both conditions.
groupby (str) – Column in
adata.obswith condition labels.condition1 (str) – Labels identifying the two conditions.
condition2 (str) – Labels identifying the two conditions.
obsm_key (str) – Key in
adata.obsmfor cell-state coordinates.sample_col (str, optional) – Column with biological-replicate labels.
gp (GPSettings, optional) – GP model parameters (
ls_factor,n_landmarks,landmarks,batch_size,jit_compile,random_state).threshold (DAThresholdSettings, optional) – Significance thresholds for abundance changes.
storage (StorageSettings, optional) – Where and how results are stored.
output (OutputSettings, optional) – Return-value control.
model (ModelSettings, optional) – Pre-fitted models or predictors to inject. For DA, only
density_predictor1/2andvariance_predictor1/2are used. SeeModelSettings.**density_kwargs – Forwarded to
DensityEstimator.
- Returns:
Return value depends on
copyandreturn_full_resultsinOutputSettings.- Return type:
Union[Dict[str, np.ndarray], AnnData, Tuple[Dict[str, np.ndarray], AnnData]]
Differential Expression¶
Differential expression analysis for AnnData objects.
- kompot.anndata.differential_expression.de(adata, groupby: str, condition1: str, condition2: str, obsm_key: str = 'DM_EigenVectors', layer=None, genes=None, sample_col=None, gp: GPSettings | None = None, fdr: FDRSettings | None = None, filter: FilterSettings | None = None, storage: StorageSettings | None = None, output: OutputSettings | None = None, model: ModelSettings | None = None, dry_run: bool = False, **function_kwargs) Dict[str, ndarray] | AnyView on GitHub¶
Run differential expression analysis on an AnnData object.
The most common call is just:
kompot.de(adata, "condition", "Young", "Old")
Advanced options are available through the settings dataclasses (
GPSettings,FDRSettings,FilterSettings,StorageSettings,OutputSettings). Any field left at its default is equivalent to omitting it entirely. Extra**function_kwargsare forwarded to mellon’sFunctionEstimator.- Parameters:
adata (AnnData) – AnnData object containing cells from both conditions.
groupby (str) – Column in
adata.obswith condition labels.condition1 (str) – Labels identifying the two conditions.
condition2 (str) – Labels identifying the two conditions.
obsm_key (str) – Key in
adata.obsmfor cell-state coordinates.layer (str, optional) – Layer with expression data (
None→adata.X).genes (list of str, optional) – Subset of genes to analyse.
sample_col (str, optional) – Column with biological-replicate labels.
gp (GPSettings, optional) – GP model parameters (sigma, ls, n_landmarks, etc.).
fdr (FDRSettings, optional) – FDR / null-distribution parameters.
filter (FilterSettings, optional) – Cell filtering and group-subsetting.
storage (StorageSettings, optional) – Where and how results are stored.
output (OutputSettings, optional) – Return-value and progress-bar control.
model (ModelSettings, optional) – Pre-fitted models or predictors to inject. When provided, fitting is skipped for the corresponding components. See
ModelSettings.dry_run (bool, optional) – If True, estimate resource requirements and print a report instead of running the analysis. Returns a
ResourcePlan.**function_kwargs – Forwarded to
FunctionEstimator.
- Returns:
Return value depends on
copyandreturn_full_resultsinOutputSettings.- Return type:
Union[Dict[str, np.ndarray], AnnData, Tuple[Dict[str, np.ndarray], AnnData]]
Smooth Expression¶
Expression smoothing for AnnData objects.
- kompot.anndata.smooth.smooth_expression(adata, groupby: str | None = None, condition: str | None = None, obsm_key: str = 'DM_EigenVectors', layer: str | None = None, genes: List[str] | None = None, sample_col: str | None = None, gp: GPSettings | None = None, storage: StorageSettings | None = None, output: OutputSettings | None = None, model: ModelSettings | None = None, **function_kwargs) Dict[str, Any] | Any | NoneView on GitHub¶
Smooth gene expression for a single condition using GP regression.
Fits an
ExpressionModelon the selected cells and evaluates it on all cells inadata. This means every cell gets a smoothed value and uncertainty estimate, even if it was not part of the training condition. Stores the smoothed values, posterior standard deviations, and (optionally) empirical and sample variance layers inadata.The most common call is just:
kompot.smooth_expression(adata, "condition", "Young")
Advanced options are available through the settings dataclasses (
GPSettings,StorageSettings,OutputSettings). Any field left at its default is equivalent to omitting it entirely. Extra**function_kwargsare forwarded to mellon’sFunctionEstimator.- Parameters:
adata (AnnData) – AnnData object.
groupby (str, optional) – Column in
adata.obsidentifying conditions. Required whenconditionis specified.condition (str, optional) – Which group in
groupbyto smooth. If None andgroupbyis None, all cells are used.obsm_key (str) – Key in
adata.obsmfor cell-state coordinates.layer (str, optional) – Layer to use as expression input. None means
adata.X.genes (list of str, optional) – Subset of genes to smooth. None means all genes.
sample_col (str, optional) – Column in
adata.obswith biological-replicate labels.gp (GPSettings, optional) – GP model parameters (sigma, ls, n_landmarks, etc.).
storage (StorageSettings, optional) – Output storage parameters (result_key, overwrite).
output (OutputSettings, optional) – Return behavior (copy, inplace, return_full_results, progress).
model (ModelSettings, optional) – Pre-fitted
ExpressionModelto inject viamodel1. When provided, skips internal fitting.**function_kwargs – Forwarded to
mellon.FunctionEstimator.
- Returns:
None when results are stored in-place. If
return_full_resultsis True, a dictionary with keys"model","table", and"field_names".- Return type:
None or dict
Resource Estimation¶
Before running resource-intensive differential expression analyses, you can use the dry run utility to estimate memory and disk requirements, check for field overwrites, and verify parameters.
Key features:
Memory and disk estimation: Calculates expected resource usage for all intermediate arrays and final results
Null genes accounting: Correctly estimates resource inflation from null distribution genes (default 2000 additional genes)
Field overwrite detection: Shows which fields will be overwritten, including their run_id and previous run details
Sample variance impact: Estimates additional memory for sample-specific covariance tensors
Disk storage planning: Estimates disk space needed when using
store_arrays_on_disk=True
import kompot as kp
# Run a dry run before actual computation
plan = kp.de(
adata,
groupby='age',
condition1='Young',
condition2='Old',
sample_col='donor_id',
dry_run=True,
)
The dry run output shows:
System Resources: Available memory and disk space
Total Requirements: Memory and disk needed with percentage of available
Memory Allocations: Detailed breakdown of each array (precision matrices, smoothed expression, covariances)
Output Fields: All fields that will be created, with
[OVERWRITES run_id=X]markers for existing fieldsWarnings: Field overwrite information showing previous run timestamp, conditions, and parameters
Status: Whether the analysis is feasible given available resources
Utilities¶
- class kompot.anndata.utils.RunInfo(adata, run_id: int | None = None, analysis_type: str | None = None)View on GitHub¶
Bases:
objectClass for accessing run information for differential analysis or smoothing.
Provides access to run history, parameters, and result fields.
- adata¶
AnnData object containing the run history
- Type:
AnnData
- run_id¶
Requested run ID (may be negative for relative indexing)
- Type:
int
- adjusted_run_id¶
Actual run ID after adjusting for negative indexing
- Type:
int
- analysis_type¶
Type of analysis: ‘de’, ‘da’, or ‘smooth’
- Type:
str
- storage_key¶
Key for accessing the analysis data in adata.uns
- Type:
str
- run_info¶
Dictionary with all information about the run
- Type:
dict
- field_names¶
Dictionary with field names used in this run
- Type:
dict
- params¶
The parameters used for this analysis
- Type:
dict
- environment¶
Information about the environment where the analysis was run
- Type:
dict
- overwritten_fields¶
List of fields that were overwritten by newer runs
- Type:
list
- missing_fields¶
List of fields that are missing/deleted from the AnnData object
- Type:
list
- __init__(adata, run_id: int | None = None, analysis_type: str | None = None)View on GitHub¶
Initialize a RunInfo object.
- Parameters:
adata (AnnData) – AnnData object containing run history
run_id (int, optional) – Run ID to retrieve. Negative indices count from the end. If None, uses the most recent run (-1).
analysis_type (str, optional) – Type of analysis: ‘de’, ‘da’, or ‘smooth’. If None, attempts to detect from
adata.uns.
- call_args() Dict[str, Any]View on GitHub¶
Build kwargs that reproduce this run via
da()/de().The returned dict contains top-level arguments (
groupby,condition1, …) and Settings objects (gp,fdr, …). All values are mutable — edit them before passing tode()orda():kwargs = run.call_args() kwargs["fdr"].threshold = 0.01 # tighten FDR kompot.de(adata, **kwargs)
- Returns:
Ready for
kompot.de(adata, **result)orkompot.da(adata, **result).- Return type:
dict
- compare_with(other_run_id: int) RunComparisonView on GitHub¶
Compare this run with another run.
- Parameters:
other_run_id (int) – Run ID to compare with
- Returns:
Object containing comparison results with nice display methods
- Return type:
RunComparison
- get_data() Dict[str, Any]View on GitHub¶
Get all data related to this run.
- Returns:
Dictionary with all run data
- Return type:
Dict[str, Any]
- get_summary() Dict[str, Any]View on GitHub¶
Get a summary of this run with key information.
- Returns:
Dictionary with run summary
- Return type:
Dict[str, Any]
- to_settings() Dict[str, Any]View on GitHub¶
Reconstruct Settings dataclass objects from stored parameters.
- Returns:
{"gp": GPSettings(…), "fdr": FDRSettings(…), …}— only Settings that were recorded for this run.- Return type:
dict
Examples
>>> run = kompot.RunInfo(adata, run_id=0, analysis_type="de") >>> settings = run.to_settings() >>> settings["gp"].sigma 1.0
Cleanup Utilities¶
- kompot.cleanup(adata: AnnData, run_ids: int | List[int] | None = None, analysis_type: str = 'de', keep_layers: bool | List[str] | None = None, keep_var_fields: bool | List[str] | None = True, keep_obs_fields: bool | List[str] | None = True, keep_obsp_fields: bool | List[str] | None = None, keep_varm_fields: bool | List[str] | None = None, inplace: bool = True) AnnData | NoneView on GitHub¶
Remove large data (layers, obsp, varm) from differential analysis results.
This function helps reduce AnnData object size by removing large arrays like smoothed expression layers, fold change layers, and posterior covariance matrices while retaining the statistical results in var/obs columns.
- Parameters:
adata (AnnData) – AnnData object with differential analysis results
run_ids (int, list of int, or None, optional) – Run ID(s) to clean up. Negative indices count from the end. - If None (default): Cleans up ALL runs - If int: Cleans up single run - If list: Cleans up specified runs
analysis_type (str, default 'de') – Type of analysis: ‘de’ for differential expression, ‘da’ for differential abundance, or ‘smooth’ for expression smoothing
keep_layers (bool or list of str, optional) –
If None (default): Remove all layers from specified run(s)
If False: Remove all layers from specified run(s)
If True: Keep all layers from specified run(s)
If list: Keep only the specified layer types
keep_var_fields (bool or list of str, optional) –
If True (default): Keep all var fields from specified run(s)
If False: Remove all var fields from specified run(s)
If list: Keep only the specified var field types
keep_obs_fields (bool or list of str, optional) –
If True (default): Keep all obs fields from specified run(s)
If False: Remove all obs fields from specified run(s)
If list: Keep only the specified obs field types
keep_obsp_fields (bool or list of str, optional) –
If None (default): Remove all obsp fields from specified run(s)
If False: Remove all obsp fields from specified run(s)
If True: Keep all obsp fields from specified run(s)
If list: Keep only the specified obsp field types
keep_varm_fields (bool or list of str, optional) –
If None (default): Remove all varm fields from specified run(s)
If False: Remove all varm fields from specified run(s)
If True: Keep all varm fields from specified run(s)
If list: Keep only the specified varm field types
inplace (bool, default True) – If True, modify adata in place. If False, return a copy.
- Returns:
If inplace=False, returns modified copy. If inplace=True, returns None.
- Return type:
AnnData or None
Notes
Layer field types:
'smoothed': Smoothed expression for each condition'fold_change': Log fold change for each cell and gene'fold_change_zscores': Z-scores of log fold changes'std_with_sample_var': Posterior standard deviations with sample variance
Var field types:
'mean_log_fold_change': Mean log fold change values'mahalanobis': Mahalanobis distances'ptp': Posterior tail probability'mahalanobis_pvalue': P-values from empirical null'mahalanobis_local_fdr': Local FDR values'mahalanobis_tail_fdr': Tail-based FDR values'is_de': Boolean indicator of differential expression'weighted_mean_log_fold_change': Weighted mean log fold change
Obs field types:
'std': Posterior standard deviations
Obsp field types:
'covariance': Posterior covariance matrices for fold changes
Varm field types:
'mean_log_fold_change': Mean log fold change per group'mahalanobis': Mahalanobis distances per group'weighted_mean_log_fold_change': Weighted mean log fold change per group
Examples
>>> cleanup(adata) # Remove all layers from all runs
>>> cleanup(adata, run_ids=0) # Remove layers from specific run
>>> cleanup(adata, run_ids=[0, 2, 5]) # Multiple runs
>>> cleanup(adata, keep_layers=['fold_change']) # Keep only fold change
>>> # Remove all layers and obsp covariance matrices >>> cleanup(adata, keep_layers=False, keep_obsp_fields=False)
>>> # Keep only essential statistical fields from run 0 >>> cleanup( ... adata, ... run_ids=0, ... keep_layers=False, ... keep_var_fields=['mahalanobis', 'mahalanobis_local_fdr', 'is_de', 'mean_log_fold_change'], ... keep_obs_fields=False, ... )
Notes
By default, cleans up ALL runs to maximize space savings
By default, keeps all statistical results (var/obs fields) but removes layers
Large data typically in: layers (smoothed, fold_change), obsp (covariance)
This does NOT modify the run history - deleted fields are marked as missing
Use RunInfo to check which fields are present vs deleted
- kompot.get_field_status(adata: AnnData, run_id: int | None = None, analysis_type: str = 'de') Dict[str, Dict[str, Dict[str, bool]]]View on GitHub¶
Get the status of all fields from a differential analysis run.
Shows which fields are present vs missing/deleted.
- Parameters:
adata (AnnData) – AnnData object with differential analysis results
run_id (int, optional) – Run ID to check. If None, uses most recent run.
analysis_type (str, default 'de') – Type of analysis: ‘de’, ‘da’, or ‘smooth’
- Returns:
Nested dictionary with structure: {location: {field_type: {field_name: is_present}}}
- Return type:
dict
Examples
>>> status = get_field_status(adata) >>> print(status['layers']['smoothed']) {'result_A_smoothed': True, 'result_B_smoothed': False}
Representation Analysis¶
- kompot.check_underrepresentation(adata: AnnData, groupby: str, groups: str | dict | list | ndarray, conditions: List[str] | None = None, min_cells: int = 30, min_percentage: float | None = None, warn: bool = True, print_summary: bool = False) Dict[str, Any]View on GitHub¶
Check if any condition is underrepresented in any group.
- Parameters:
adata (AnnData) – AnnData object containing cells/observations
groupby (str) – Column in adata.obs defining conditions to check
groups (str, dict, list, np.ndarray) – Groups to check for representation, either: - str: Column name in adata.obs defining groups - dict: Mapping from group names to boolean masks or indices - list, np.ndarray: Boolean mask or indices for a single group
conditions (List[str], optional) – List of condition values to check, by default None (uses all values in groupby column)
min_cells (int, optional) – Minimum number of cells required for each condition in each group, by default 30
min_percentage (float, optional) – Minimum percentage of cells for each condition in each group, by default None
warn (bool, optional) – Whether to emit warnings for underrepresentation, by default True
print_summary (bool, optional) – Whether to print a summary of underrepresentation results, by default False
- Returns:
Dictionary with underrepresentation data, contains: - __underrepresentation_data: Dict mapping groups to underrepresented conditions - group_key: List of group names (if groups was a string column name) - Other metadata depending on groups type
- Return type:
Dict[str, Any]