Parameter screening
Many nonlinear dimensionality reduction (DR) methods expose parameters that control the balance between local and global structure preservation. The optimal parameter values are data-dependent and may vary substantially with dataset size, density, and noise.
kmer-ord therefore supports parameter screening, where multiple
embeddings are generated across a predefined parameter grid instead of relying
on a single configuration.
This allows users to:
- assess embedding stability
- compare local vs global structure preservation
- select parameters appropriate for dataset scale and analysis goals
What is parameter screening?
Parameter screening refers to the systematic evaluation of multiple hyperparameter combinations for a given DR method.
Rather than producing a single embedding, the algorithm generates a set of embeddings, each corresponding to a different parameter configuration.
Each embedding is saved independently and can be inspected visually or quantitatively downstream.
Why parameter screening is important
Nonlinear DR methods are inherently ill-posed: many different low-dimensional embeddings can faithfully represent aspects of the same high-dimensional data.
Key trade-offs controlled by parameters include:
- Local vs global structure preservation
- Cluster compactness vs continuity
- Noise sensitivity
- Scalability to large datasets
There is no universally optimal parameter setting.
When should I use parameter screening?
- Parameter screening is recommended when:
- exploring a new dataset
- embeddings appear unstable
- cluster structure is unclear
- dataset size is large or highly imbalanced
- For routine analysis, scale-aware defaults are often sufficient.
UMAP parameter screening
UMAP exposes two primary parameters that strongly affect the embedding:
n_neighborsmin_dist
kmer-ord screens both parameters jointly when --screen_params is enabled.
n_neighbors
Controls:
The size of the local neighborhood used to construct the manifold graph.
Interpretation:
- Smaller values emphasize local structure
- Larger values incorporate more global structure
Effect on embeddings:
n_neighbors |
Effect |
|---|---|
| Small (5–30) | Tight local clusters, potential fragmentation |
| Medium (50–100) | Balanced local and global structure |
| Large (100–200+) | Smoother embeddings, improved global continuity |
Relation to dataset size:
- Small datasets benefit from smaller
n_neighbors - Large datasets require larger values to avoid over-fragmentation
min_dist
Controls:
The minimum distance allowed between embedded points.
Interpretation:
- Smaller values allow points to pack tightly
- Larger values enforce separation between points
Effect on embeddings:
min_dist |
Effect |
|---|---|
| 0–0.1 | Very compact clusters |
| 0.1–0.3 | Moderate separation |
| 0.5–1.0 | Emphasis on global structure |
Key insight:
min_dist does not change neighborhood relationships — it changes how
densely points are packed in the embedding.
Interaction between n_neighbors and min_dist
These parameters interact strongly:
- Large
n_neighbors+ smallmin_dist→ globally coherent but dense clusters - Small
n_neighbors+ largemin_dist→ fragmented but separated structure
Parameter screening is particularly useful because good combinations are dataset-specific.
UMAP screening grid in kmer-ord
When screening is enabled, kmer-ord evaluates the following grid:
Each combination produces a separate embedding saved to disk.
Dataset scale and parameter defaults
kmer-ord provides scale-aware defaults when parameter screening is disabled.
| Dataset scale | n_neighbors |
min_dist |
Rationale |
|---|---|---|---|
| small | 30 | 0.05 | Emphasize fine structure |
| medium | 100 | 0.1 | Balance local and global |
| large | 150 | 0.2 | Stabilize global structure |
These defaults are heuristics, not guarantees.
Comparison with other methods
Parameter screening is implemented for multiple DR methods, each exposing different structural trade-offs:
| Method | Screened parameters | Primary structure |
|---|---|---|
| t-SNE | perplexity, learning_rate |
Local |
| UMAP | n_neighbors, min_dist |
Local + global |
| TRIMAP | n_inliers, weight_temp |
Global |
| PaCMAP | FP_ratio, MN_ratio |
Multi-scale |
| LocalMAP | FP_ratio, n_neighbors |
Strong local |
t-SNE
t-SNE emphasizes local neighborhood preservation and is not designed to faithfully preserve global distances.
perplexity
Controls:
The effective number of neighbors considered for each point.
Interpretation:
Perplexity approximates the scale at which local structure is modeled.
| Perplexity | Effect |
|---|---|
| Low (5–30) | Very local structure, tight clusters |
| Medium (30–100) | Moderate neighborhood scope |
| High (100–200+) | More global coherence, risk of crowding |
Dataset size guidance:
- Small datasets: lower perplexity
- Large datasets: higher perplexity required for stability
Rule of thumb:
perplexity << n_samples
learning_rate
Controls:
Step size of the gradient descent optimization.
Interpretation:
Affects convergence speed and embedding stability.
| Learning rate | Effect |
|---|---|
| Too small | Slow convergence, poor separation |
| Too large | Instability, distorted structure |
Parameter screening helps identify stable regions.
TRIMAP
TRIMAP explicitly emphasizes global structure preservation by incorporating triplet constraints.
n_inliers
Controls:
Number of nearest neighbors treated as “inliers” in triplet constraints.
Interpretation:
- Larger values incorporate broader neighborhood information
- Improves global consistency
n_inliers |
Effect |
|---|---|
| Small | Strong local structure |
| Large | Improved global layout |
Dataset size guidance:
- Increase
n_inlierswith dataset size
weight_temp
Controls:
Relative weighting between inlier and outlier triplets.
Interpretation:
- Lower values emphasize local constraints
- Higher values strengthen global repulsion
weight_temp |
Effect |
|---|---|
| Low (≤0.3) | Local emphasis |
| Medium (0.4–0.6) | Balanced |
| High (>1.0) | Strong global structure |
PaCMAP
PaCMAP is designed to preserve structure at multiple scales simultaneously.
MN_ratio (Mid-near ratio)
Controls:
Proportion of mid-range neighbors relative to nearest neighbors.
Effect:
- Stabilizes intermediate-scale structure
- Prevents over-fragmentation
Usually kept fixed in kmer-ord.
FP_ratio (Further point ratio)
Controls:
Strength of repulsive forces from distant points.
Interpretation:
- Low values → local structure
- High values → improved global separation
FP_ratio |
Effect |
|---|---|
| Low | Compact clusters |
| High | Enhanced global layout |
Dataset size guidance:
- Increase
FP_ratiofor large datasets
LocalMAP
LocalMAP is a PaCMAP-derived method optimized for local neighborhood fidelity.
n_neighbors
Controls:
Neighborhood size used to define local structure.
Effect:
- Smaller values isolate fine-scale structure
- Larger values smooth embeddings
FP_ratio
Controls:
Global repulsion strength.
Interpretation:
LocalMAP typically uses lower FP_ratio values than PaCMAP to avoid disrupting
local continuity.
| FP_ratio | Effect |
|---|---|
| Low | Strong local preservation |
| High | Increased global separation |
Summary
Parameter screening exposes the structure–scale trade-off inherent to
nonlinear dimensionality reduction. Rather than hiding this complexity,
kmer-ord makes it explicit and reproducible.
treat embeddings as exploratory models, not definitive representations.
scale in kmer-ord
| Method | Scale | Hyperparameters |
|---|---|---|
| UMAP | default | n_neighbors=15, min_dist=0.1 |
| small | n_neighbors=30, min_dist=0.05 |
|
| medium | n_neighbors=100, min_dist=0.1 |
|
| large | n_neighbors=150, min_dist=0.2 |
|
| t-SNE | default | init=pca, random_state=42 |
| small | perplexity=30, init=pca, random_state=42 |
|
| medium | perplexity=100, init=pca, random_state=42 |
|
| large | perplexity=200, init=pca, random_state=42 |
|
| TRIMAP | default | n_inliers=10, weight_temp=0.5 |
| small | n_inliers=50, weight_temp=0.3 |
|
| medium | n_inliers=100, weight_temp=0.4 |
|
| large | n_inliers=150, weight_temp=0.5 |
|
| PaCMAP | default | MN_ratio=0.5, FP_ratio=2 |
| small | MN_ratio=0.5, FP_ratio=2 |
|
| medium | MN_ratio=0.5, FP_ratio=3 |
|
| large | MN_ratio=0.5, FP_ratio=5 |
|
| LocalMAP | default | MN_ratio=0.5, FP_ratio=0.5 |
| small | MN_ratio=0.3, FP_ratio=0.5 |
|
| medium | MN_ratio=0.5, FP_ratio=0.7 |
|
| large | MN_ratio=0.7, FP_ratio=1.0 |