kmer-ord cluster
Perform high-dimensional embedding followed by unsupervised clustering.
- construct a high-dimensional representation of k-mer space
- detect intrinsic structure using clustering algorithms (DBSCAN, HDBSCAN, Leiden)
- integrate results into an existing SQLite/SpatiaLite database
Given that the clustering results are integrated in an existing SQLite database, we suggest to run kmer-ord project first. This allows clustering results to be visualised on 2D/3D grahps for inspection.
Usage
Input / Output
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Input file | -i, --input |
path | — | FASTA/FASTQ input (supports .gz) |
| Output directory | -o, --output |
path | — | Directory for all outputs |
| Force overwrite | -f, --force |
bool | False |
Recompute outputs even if they already exist |
| Database path | --db |
path | None |
Optional existing database to append results |
k-mer
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| K-mer size | -k, --kmer |
int | 6 |
Length of k-mers |
Embedding options
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| DR method(s) | --dr |
string | umap |
Comma-separated methods (e.g. umap,tsne) |
| Embedding dimensions | -d, --dims |
int | 15 |
High-dimensional embedding size |
| Dataset scale preset | -s, --scale |
string | auto |
Hyperparameter preset (auto, small, medium, large, default) |
| Normalisation | --norm |
string | clr |
Normalisation method (raw, relative, log, clr, zscore) |
| PCA pre-processing | --pca-pre |
bool | False |
Apply PCA before embedding |
| Number of PCs | --keep-pcs |
int | None |
Fixed number of principal components |
| Variance threshold | --keep-variance |
float | None |
Retain fraction of variance (e.g. 0.9) |
| Parameter screening | --screen_params |
bool | False |
Sweep DR hyperparameters |
Clustering options
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Clustering methods | --cluster |
string | hdbscan |
Comma-separated methods (hdbscan,leiden,dbscan) |
| HDBSCAN sweep | --hdbscan-sweep |
bool | False |
Explore min_cluster_size |
| Leiden sweep | --leiden-sweep |
bool | False |
Explore resolution parameter |
| DBSCAN sweep | --dbscan-sweep |
bool | False |
Explore eps values |
Performance
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Threads | -t, --threads |
int | 4 |
Number of CPU threads |
Supported clustering methods
| Method | Description |
|---|---|
hdbscan |
Density-based clustering (robust default) |
leiden |
Graph-based community detection |
dbscan |
Density clustering with fixed radius |
Start with HDBSCAN — it requires minimal tuning and handles noise well.
Output
- embedding coordinates
- clustering assignments
- SQLite/SpatiaLite database
<output_dir>/
├── kmer/
├── matrices/
├── dr/
│ └── <norm>/<method>/
├── clustering/
│ └── *.tsv
└── kmer-ord.sqlite
-1 typically indicates noise (unassigned points)
Database integration
All embeddings and clustering results are automatically stored in the database:
Clustering table (merged results)
See: Database schema
When to use cluster
Use this command when you want to:
- identify structure in large datasets
- explore unsupervised binning
We highly recommend to run kmer-ord project first, which will allow visualisation of clustering results in 2D/3D.
Clustering predictions can be inspected and refined using kmer-ord project. Alhough in some cases, unsupervised clustering may yield faithful clustering of the data, we do suggest careful curation.
See also: