Utilities¶

Utility functions for Kompot package.

kompot.utils.build_graph(X: ndarray, n_neighbors: int = 15, random_state: int | None = 42) → Tuple[List[Tuple[int, int]], NNDescent]View on GitHub ¶

Build a graph from a dataset using approximate nearest neighbors.

Parameters:

X (np.ndarray) – Data matrix of shape (n_samples, n_features).
n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.
random_state (int or None, optional) – Seed passed to pynndescent.NNDescent for the approximate nearest-neighbor construction, by default 42. Passing None lets pynndescent draw its own entropy (non-reproducible).

Returns:

A tuple containing: - edges: List of (source, target) tuples defining the graph - index: The nearest neighbor index for future queries

Return type:

Tuple[List[Tuple[int, int]], pynndescent.NNDescent]

kompot.utils.compute_mahalanobis_distance(diff_values: ndarray, covariance_matrix: ndarray, eps: float = 1e-08, jit_compile: bool = True) → floatView on GitHub ¶

Compute the Mahalanobis distance for a vector given a covariance matrix.

This is a convenience function for computing a single Mahalanobis distance. For multiple vectors, use compute_mahalanobis_distances for better performance.

Parameters:

diff_values (np.ndarray) – The difference vector for which to compute the Mahalanobis distance.
covariance_matrix (np.ndarray) – The covariance matrix.
eps (float, optional) – Small constant for numerical stability, by default 1e-10.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.

Returns:

The Mahalanobis distance.

Return type:

float

kompot.utils.compute_mahalanobis_distances(diff_values: ndarray, covariance: ndarray | Array | da.Array, batch_size: int = 500, jit_compile: bool = True, eps: float = 1e-08, progress: bool = True, diagonal_variance: ndarray | None = None) → ndarrayView on GitHub ¶

Compute Mahalanobis distances for multiple difference vectors efficiently.

This function computes the Mahalanobis distance for each provided difference vector using the provided covariance matrix or tensor. It handles both single covariance matrix and gene-specific covariance tensors.

Parameters:

diff_values (np.ndarray) – The difference vectors for which to compute Mahalanobis distances. Shape should be (n_samples, n_features) or (n_features, n_samples).
covariance (np.ndarray, jnp.ndarray, or dask.array.Array) – Covariance matrix or tensor: - If 2D shape (n_points, n_points): shared covariance for all vectors - If 3D shape (n_points, n_points, n_genes): gene-specific covariance matrices - Can be a dask array for lazy/distributed computation
batch_size (int, optional) – Number of vectors to process at once, by default 500.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.
eps (float, optional) – Small constant for numerical stability, by default 1e-8.
progress (bool, optional) – Whether to show a progress bar for calculations, by default True.
diagonal_variance (np.ndarray, optional) – Per-gene diagonal variance to add to the shared covariance matrix. Shape (n_genes, n_points). When provided, uses a factor trick to efficiently incorporate per-gene heteroscedastic noise without constructing per-gene covariance matrices. Only used with 2D (shared) covariance matrices. By default None.

Returns:

Array of Mahalanobis distances for each input vector.

Return type:

np.ndarray

kompot.utils.find_landmarks(X: ndarray, n_clusters: int = 200, n_neighbors: int = 15, tol: float = 0.1, max_iter: int = 10, random_state: int | None = None) → Tuple[ndarray, ndarray]View on GitHub ¶

Identify landmark points representing clusters in the dataset.

Parameters:

X (np.ndarray) – Data matrix of shape (n_samples, n_features).
n_clusters (int, optional) – Desired number of clusters/landmarks, by default 200.
n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.
tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.
max_iter (int, optional) – Maximum number of iterations for resolution search, by default 10.
random_state (int or None, optional) – Seed for reproducible landmark selection, by default None. The Leiden community detection underlying landmark discovery draws from igraph’s global random number generator, which is otherwise left unseeded, so the returned landmarks vary run-to-run even for identical input. Passing an int seeds both the nearest-neighbor construction and the Leiden step, making find_landmarks fully reproducible through the public API (same X + same random_state yields identical landmark indices and coordinates). The default (None) preserves the historical non-deterministic behavior, so existing callers are unaffected.

Returns:

A tuple containing: - landmarks: Matrix of shape (n_clusters, n_features) containing landmark coordinates - landmark_indices: Indices of landmarks in the original dataset

Return type:

Tuple[np.ndarray, np.ndarray]

kompot.utils.find_optimal_resolution(edges: List[Tuple[int, int]], n_obs: int, n_clusters: int, tol: float = 0.1, max_iter: int = 10, random_state: int | None = None) → Tuple[float, any]View on GitHub ¶

Find an optimal resolution for Leiden clustering to achieve a target number of clusters.

Parameters:

edges (List[Tuple[int, int]]) – List of edges defining the graph.
n_obs (int) – Number of observations (nodes) in the graph.
n_clusters (int) – Desired number of clusters.
tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.
max_iter (int, optional) – Maximum number of iterations for the search, by default 10.
random_state (int or None, optional) – Seed for igraph’s Leiden community detection, by default None. igraph’s community_leiden draws from a global random number generator that is otherwise left unseeded, so results vary run-to-run. Passing an int seeds that generator (via igraph.set_random_number_generator) for the duration of this call and restores the default afterwards, making the clustering reproducible. Passing None preserves the historical non-deterministic behavior.

Returns:

A tuple containing: - optimal_resolution: The resolution value that best approximates the desired number of clusters - best_partition: The clustering partition at the optimal resolution

Return type:

Tuple[float, any]