Utilities

Utility functions for Kompot package.

kompot.utils.build_graph(X: ndarray, n_neighbors: int = 15) Tuple[List[Tuple[int, int]], NNDescent]View on GitHub

Build a graph from a dataset using approximate nearest neighbors.

Parameters:
  • X (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.

Returns:

A tuple containing: - edges: List of (source, target) tuples defining the graph - index: The nearest neighbor index for future queries

Return type:

Tuple[List[Tuple[int, int]], pynndescent.NNDescent]

kompot.utils.compute_mahalanobis_distance(diff_values: ndarray, covariance_matrix: ndarray, eps: float = 1e-08, jit_compile: bool = True) floatView on GitHub

Compute the Mahalanobis distance for a vector given a covariance matrix.

This is a convenience function for computing a single Mahalanobis distance. For multiple vectors, use compute_mahalanobis_distances for better performance.

Parameters:
  • diff_values (np.ndarray) – The difference vector for which to compute the Mahalanobis distance.

  • covariance_matrix (np.ndarray) – The covariance matrix.

  • eps (float, optional) – Small constant for numerical stability, by default 1e-10.

  • jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.

Returns:

The Mahalanobis distance.

Return type:

float

kompot.utils.compute_mahalanobis_distances(diff_values: ndarray, covariance: ndarray | Array | da.Array, batch_size: int = 500, jit_compile: bool = True, eps: float = 1e-08, progress: bool = True, diagonal_variance: ndarray | None = None) ndarrayView on GitHub

Compute Mahalanobis distances for multiple difference vectors efficiently.

This function computes the Mahalanobis distance for each provided difference vector using the provided covariance matrix or tensor. It handles both single covariance matrix and gene-specific covariance tensors.

Parameters:
  • diff_values (np.ndarray) – The difference vectors for which to compute Mahalanobis distances. Shape should be (n_samples, n_features) or (n_features, n_samples).

  • covariance (np.ndarray, jnp.ndarray, or dask.array.Array) – Covariance matrix or tensor: - If 2D shape (n_points, n_points): shared covariance for all vectors - If 3D shape (n_points, n_points, n_genes): gene-specific covariance matrices - Can be a dask array for lazy/distributed computation

  • batch_size (int, optional) – Number of vectors to process at once, by default 500.

  • jit_compile (bool, optional) – Whether to use JAX just-in-time compilation, by default True.

  • eps (float, optional) – Small constant for numerical stability, by default 1e-8.

  • progress (bool, optional) – Whether to show a progress bar for calculations, by default True.

  • diagonal_variance (np.ndarray, optional) – Per-gene diagonal variance to add to the shared covariance matrix. Shape (n_genes, n_points). When provided, uses a factor trick to efficiently incorporate per-gene heteroscedastic noise without constructing per-gene covariance matrices. Only used with 2D (shared) covariance matrices. By default None.

Returns:

Array of Mahalanobis distances for each input vector.

Return type:

np.ndarray

kompot.utils.find_landmarks(X: ndarray, n_clusters: int = 200, n_neighbors: int = 15, tol: float = 0.1, max_iter: int = 10) Tuple[ndarray, ndarray]View on GitHub

Identify landmark points representing clusters in the dataset.

Parameters:
  • X (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • n_clusters (int, optional) – Desired number of clusters/landmarks, by default 200.

  • n_neighbors (int, optional) – Number of neighbors for graph construction, by default 15.

  • tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.

  • max_iter (int, optional) – Maximum number of iterations for resolution search, by default 10.

Returns:

A tuple containing: - landmarks: Matrix of shape (n_clusters, n_features) containing landmark coordinates - landmark_indices: Indices of landmarks in the original dataset

Return type:

Tuple[np.ndarray, np.ndarray]

kompot.utils.find_optimal_resolution(edges: List[Tuple[int, int]], n_obs: int, n_clusters: int, tol: float = 0.1, max_iter: int = 10) Tuple[float, any]View on GitHub

Find an optimal resolution for Leiden clustering to achieve a target number of clusters.

Parameters:
  • edges (List[Tuple[int, int]]) – List of edges defining the graph.

  • n_obs (int) – Number of observations (nodes) in the graph.

  • n_clusters (int) – Desired number of clusters.

  • tol (float, optional) – Tolerance for the deviation from the target number of clusters, by default 0.1.

  • max_iter (int, optional) – Maximum number of iterations for the search, by default 10.

Returns:

A tuple containing: - optimal_resolution: The resolution value that best approximates the desired number of clusters - best_partition: The clustering partition at the optimal resolution

Return type:

Tuple[float, any]