Utilities¶
Utility functions for Kompot package.
- kompot.utils.build_graph(X: ndarray, n_neighbors: int = 15) Tuple[List[Tuple[int, int]], NNDescent]View on GitHub¶
Build a graph from a dataset using approximate nearest neighbors.
- Parameters:
X (np.ndarray) – Data matrix of shape (n_samples, n_features).
n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.
- Returns:
A tuple containing: - edges: List of (source, target) tuples defining the graph - index: The nearest neighbor index for future queries
- Return type:
Tuple[List[Tuple[int, int]], pynndescent.NNDescent]
- kompot.utils.compute_mahalanobis_distance(diff_values: ndarray, covariance_matrix: ndarray, eps: float = 1e-08, jit_compile: bool = True) floatView on GitHub¶
Compute the Mahalanobis distance for a vector given a covariance matrix.
This is a convenience function for computing a single Mahalanobis distance. For multiple vectors, use compute_mahalanobis_distances for better performance.
- Parameters:
diff_values (np.ndarray) – The difference vector for which to compute the Mahalanobis distance.
covariance_matrix (np.ndarray) – The covariance matrix.
eps (float, optional) – Small constant for numerical stability, by default 1e-10.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.
- Returns:
The Mahalanobis distance.
- Return type:
float
- kompot.utils.compute_mahalanobis_distances(diff_values: ndarray, covariance: ndarray | Array | da.Array, batch_size: int = 500, jit_compile: bool = True, eps: float = 1e-08, progress: bool = True, diagonal_variance: ndarray | None = None) ndarrayView on GitHub¶
Compute Mahalanobis distances for multiple difference vectors efficiently.
This function computes the Mahalanobis distance for each provided difference vector using the provided covariance matrix or tensor. It handles both single covariance matrix and gene-specific covariance tensors.
- Parameters:
diff_values (np.ndarray) – The difference vectors for which to compute Mahalanobis distances. Shape should be (n_samples, n_features) or (n_features, n_samples).
covariance (np.ndarray, jnp.ndarray, or dask.array.Array) – Covariance matrix or tensor: - If 2D shape (n_points, n_points): shared covariance for all vectors - If 3D shape (n_points, n_points, n_genes): gene-specific covariance matrices - Can be a dask array for lazy/distributed computation
batch_size (int, optional) – Number of vectors to process at once, by default 500.
jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.
eps (float, optional) – Small constant for numerical stability, by default 1e-8.
progress (bool, optional) – Whether to show a progress bar for calculations, by default True.
diagonal_variance (np.ndarray, optional) – Per-gene diagonal variance to add to the shared covariance matrix. Shape (n_genes, n_points). When provided, uses a factor trick to efficiently incorporate per-gene heteroscedastic noise without constructing per-gene covariance matrices. Only used with 2D (shared) covariance matrices. By default None.
- Returns:
Array of Mahalanobis distances for each input vector.
- Return type:
np.ndarray
- kompot.utils.find_landmarks(X: ndarray, n_clusters: int = 200, n_neighbors: int = 15, tol: float = 0.1, max_iter: int = 10) Tuple[ndarray, ndarray]View on GitHub¶
Identify landmark points representing clusters in the dataset.
- Parameters:
X (np.ndarray) – Data matrix of shape (n_samples, n_features).
n_clusters (int, optional) – Desired number of clusters/landmarks, by default 200.
n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.
tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.
max_iter (int, optional) – Maximum number of iterations for resolution search, by default 10.
- Returns:
A tuple containing: - landmarks: Matrix of shape (n_clusters, n_features) containing landmark coordinates - landmark_indices: Indices of landmarks in the original dataset
- Return type:
Tuple[np.ndarray, np.ndarray]
- kompot.utils.find_optimal_resolution(edges: List[Tuple[int, int]], n_obs: int, n_clusters: int, tol: float = 0.1, max_iter: int = 10) Tuple[float, any]View on GitHub¶
Find an optimal resolution for Leiden clustering to achieve a target number of clusters.
- Parameters:
edges (List[Tuple[int, int]]) – List of edges defining the graph.
n_obs (int) – Number of observations (nodes) in the graph.
n_clusters (int) – Desired number of clusters.
tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.
max_iter (int, optional) – Maximum number of iterations for the search, by default 10.
- Returns:
A tuple containing: - optimal_resolution: The resolution value that best approximates the desired number of clusters - best_partition: The clustering partition at the optimal resolution
- Return type:
Tuple[float, any]