Parameter screening

Many nonlinear dimensionality reduction (DR) methods expose parameters that control the balance between local and global structure preservation. The optimal parameter values are data-dependent and may vary substantially with dataset size, density, and noise.

kmer-ord therefore supports parameter screening, where multiple embeddings are generated across a predefined parameter grid instead of relying on a single configuration.

This allows users to:

assess embedding stability
compare local vs global structure preservation
select parameters appropriate for dataset scale and analysis goals

What is parameter screening?

Parameter screening refers to the systematic evaluation of multiple hyperparameter combinations for a given DR method.

Rather than producing a single embedding, the algorithm generates a set of embeddings, each corresponding to a different parameter configuration.

Each embedding is saved independently and can be inspected visually or quantitatively downstream.

Why parameter screening is important

Nonlinear DR methods are inherently ill-posed: many different low-dimensional embeddings can faithfully represent aspects of the same high-dimensional data.

Key trade-offs controlled by parameters include:

Local vs global structure preservation
Cluster compactness vs continuity
Noise sensitivity
Scalability to large datasets

There is no universally optimal parameter setting.

When should I use parameter screening?

Parameter screening is recommended when:
exploring a new dataset
embeddings appear unstable
cluster structure is unclear
dataset size is large or highly imbalanced
For routine analysis, scale-aware defaults are often sufficient.

UMAP parameter screening

UMAP exposes two primary parameters that strongly affect the embedding:

n_neighbors
min_dist

kmer-ord screens both parameters jointly when --screen_params is enabled.

`n_neighbors`

Controls:
The size of the local neighborhood used to construct the manifold graph.

Interpretation:

Smaller values emphasize local structure
Larger values incorporate more global structure

Effect on embeddings:

`n_neighbors`	Effect
Small (5–30)	Tight local clusters, potential fragmentation
Medium (50–100)	Balanced local and global structure
Large (100–200+)	Smoother embeddings, improved global continuity

Relation to dataset size:

Small datasets benefit from smaller n_neighbors
Large datasets require larger values to avoid over-fragmentation

`min_dist`

Controls:
The minimum distance allowed between embedded points.

Interpretation:

Smaller values allow points to pack tightly
Larger values enforce separation between points

Effect on embeddings:

`min_dist`	Effect
0–0.1	Very compact clusters
0.1–0.3	Moderate separation
0.5–1.0	Emphasis on global structure

Key insight:
min_dist does not change neighborhood relationships — it changes how densely points are packed in the embedding.

Interaction between `n_neighbors` and `min_dist`

These parameters interact strongly:

Large n_neighbors + small min_dist → globally coherent but dense clusters
Small n_neighbors + large min_dist → fragmented but separated structure

Parameter screening is particularly useful because good combinations are dataset-specific.

UMAP screening grid in kmer-ord

When screening is enabled, kmer-ord evaluates the following grid:

n_neighbors ∈ {5, 10, 50, 100, 150}
min_dist    ∈ {0.0, 0.1, 0.25, 0.5, 1.0}

Each combination produces a separate embedding saved to disk.

Dataset scale and parameter defaults

kmer-ord provides scale-aware defaults when parameter screening is disabled.

Dataset scale	`n_neighbors`	`min_dist`	Rationale
small	30	0.05	Emphasize fine structure
medium	100	0.1	Balance local and global
large	150	0.2	Stabilize global structure

These defaults are heuristics, not guarantees.

Comparison with other methods

Parameter screening is implemented for multiple DR methods, each exposing different structural trade-offs:

Method	Screened parameters	Primary structure
t-SNE	`perplexity`, `learning_rate`	Local
UMAP	`n_neighbors`, `min_dist`	Local + global
TRIMAP	`n_inliers`, `weight_temp`	Global
PaCMAP	`FP_ratio`, `MN_ratio`	Multi-scale
LocalMAP	`FP_ratio`, `n_neighbors`	Strong local

t-SNE

t-SNE emphasizes local neighborhood preservation and is not designed to faithfully preserve global distances.

`perplexity`

Controls:
The effective number of neighbors considered for each point.

Interpretation:
Perplexity approximates the scale at which local structure is modeled.

Perplexity	Effect
Low (5–30)	Very local structure, tight clusters
Medium (30–100)	Moderate neighborhood scope
High (100–200+)	More global coherence, risk of crowding

Dataset size guidance:

Small datasets: lower perplexity
Large datasets: higher perplexity required for stability

Rule of thumb: perplexity << n_samples

`learning_rate`

Controls:
Step size of the gradient descent optimization.

Interpretation:
Affects convergence speed and embedding stability.

Learning rate	Effect
Too small	Slow convergence, poor separation
Too large	Instability, distorted structure

Parameter screening helps identify stable regions.

TRIMAP

TRIMAP explicitly emphasizes global structure preservation by incorporating triplet constraints.

`n_inliers`

Controls:
Number of nearest neighbors treated as “inliers” in triplet constraints.

Interpretation:

Larger values incorporate broader neighborhood information
Improves global consistency

`n_inliers`	Effect
Small	Strong local structure
Large	Improved global layout

Dataset size guidance:

Increase n_inliers with dataset size

`weight_temp`

Controls:
Relative weighting between inlier and outlier triplets.

Interpretation:

Lower values emphasize local constraints
Higher values strengthen global repulsion

`weight_temp`	Effect
Low (≤0.3)	Local emphasis
Medium (0.4–0.6)	Balanced
High (>1.0)	Strong global structure

PaCMAP

PaCMAP is designed to preserve structure at multiple scales simultaneously.

`MN_ratio` (Mid-near ratio)

Controls:
Proportion of mid-range neighbors relative to nearest neighbors.

Effect:

Stabilizes intermediate-scale structure
Prevents over-fragmentation

Usually kept fixed in kmer-ord.

`FP_ratio` (Further point ratio)

Controls:
Strength of repulsive forces from distant points.

Interpretation:

Low values → local structure
High values → improved global separation

`FP_ratio`	Effect
Low	Compact clusters
High	Enhanced global layout

Dataset size guidance:

Increase FP_ratio for large datasets

LocalMAP

LocalMAP is a PaCMAP-derived method optimized for local neighborhood fidelity.

`n_neighbors`

Controls:
Neighborhood size used to define local structure.

Effect:

Smaller values isolate fine-scale structure
Larger values smooth embeddings

`FP_ratio`

Controls:
Global repulsion strength.

Interpretation:

LocalMAP typically uses lower FP_ratio values than PaCMAP to avoid disrupting local continuity.

FP_ratio	Effect
Low	Strong local preservation
High	Increased global separation

Summary

Parameter screening exposes the structure–scale trade-off inherent to nonlinear dimensionality reduction. Rather than hiding this complexity, kmer-ord makes it explicit and reproducible.

treat embeddings as exploratory models, not definitive representations.

scale in kmer-ord

Method	Scale	Hyperparameters
UMAP	default	`n_neighbors=15`, `min_dist=0.1`
	small	`n_neighbors=30`, `min_dist=0.05`
	medium	`n_neighbors=100`, `min_dist=0.1`
	large	`n_neighbors=150`, `min_dist=0.2`
t-SNE	default	`init=pca`, `random_state=42`
	small	`perplexity=30`, `init=pca`, `random_state=42`
	medium	`perplexity=100`, `init=pca`, `random_state=42`
	large	`perplexity=200`, `init=pca`, `random_state=42`
TRIMAP	default	`n_inliers=10`, `weight_temp=0.5`
	small	`n_inliers=50`, `weight_temp=0.3`
	medium	`n_inliers=100`, `weight_temp=0.4`
	large	`n_inliers=150`, `weight_temp=0.5`
PaCMAP	default	`MN_ratio=0.5`, `FP_ratio=2`
	small	`MN_ratio=0.5`, `FP_ratio=2`
	medium	`MN_ratio=0.5`, `FP_ratio=3`
	large	`MN_ratio=0.5`, `FP_ratio=5`
LocalMAP	default	`MN_ratio=0.5`, `FP_ratio=0.5`
	small	`MN_ratio=0.3`, `FP_ratio=0.5`
	medium	`MN_ratio=0.5`, `FP_ratio=0.7`
	large	`MN_ratio=0.7`, `FP_ratio=1.0`

Parameter screening

What is parameter screening?

Why parameter screening is important

When should I use parameter screening?

UMAP parameter screening

n_neighbors

min_dist

Interaction between n_neighbors and min_dist

UMAP screening grid in kmer-ord

Dataset scale and parameter defaults

These defaults are heuristics, not guarantees.

t-SNE

perplexity

learning_rate

TRIMAP

n_inliers

weight_temp

PaCMAP

MN_ratio (Mid-near ratio)

FP_ratio (Further point ratio)

LocalMAP

n_neighbors

FP_ratio

Summary

scale in kmer-ord

`n_neighbors`

`min_dist`

Interaction between `n_neighbors` and `min_dist`

`perplexity`

`learning_rate`

`n_inliers`

`weight_temp`

`MN_ratio` (Mid-near ratio)

`FP_ratio` (Further point ratio)

`n_neighbors`

`FP_ratio`