Poster Presentation 46th Lorne Genome Conference 2025

The specious use of nearest neighbors for evaluating data structure retention (#149)

Charles Herring 1 , Ryan Lister 1
  1. Centre for Medical Research, University of Western Australia, Perth, WA, Australia

Visualizing data taps into our innate ability to quickly and intuitively recognize patterns, relationships, and outliers. However, visualization beyond three dimensions is challenging, so dimensionality reduction techniques like UMAP are widely used in single-cell data analysis. These methods make high-dimensional data accessible in 2D, supporting qualitative exploration of data partitioning (e.g., cell types, batch effects). Despite their popularity, dimension reduction methods remain controversial. A recent prominent study by Chari and Pachter (2023) questions the fundamental capability of these methods to preserve high-dimensional structure, even suggesting 2D embeddings are as arbitrary as randomly shaped embeddings. However, this study—and the field at large—relies primarily on nearest-neighbor (NN) retention as a measure of structure preservation, with low NN overlap being indicative of poor structure preservation. Surprisingly, though, NN retention remains unproven as a measure of data structure.

In this study, we demonstrate that NN similarity is a poor metric for quantifying data structure similarity and introduce a more reliable graph-based metric for measuring structure preservation. We use this metric to benchmark a range of dimensionality reduction approaches, showing that high-dimensional structure is well preserved in the 2D spaces of UMAP and t-SNE but poorly conserved with arbitrarily shaped embeddings. Additionally, using time-series data, we show that large-scale trends—such as cell differentiation and development—are also well preserved in the 2D embeddings of UMAP and PHATE.

Over-parameterization is a common criticism of dimensionality reduction, so we further test UMAP, the best-performing method, and demonstrate that our findings hold across a reasonable range of parameter settings. Finally, we introduce a scoring metric that can be presented alongside a dimensionality reduction to quantify the degree of structure preservation in the visualization.

Our findings provide a framework for directly quantifying structure preservation, and reinforces the utility for exploring high-dimensional data with low dimensional embeddings.

  1. Chari T, Pachter L (2023) The specious art of single-cell genomics. PLoS Comput Biol 19(8): e1011288. https://doi.org/10.1371/journal.pcbi.1011288