Skip to content

kmer-ord cluster

Perform high-dimensional embedding followed by unsupervised clustering.

  • construct a high-dimensional representation of k-mer space
  • detect intrinsic structure using clustering algorithms (DBSCAN, HDBSCAN, Leiden)
  • integrate results into an existing SQLite/SpatiaLite database

Given that the clustering results are integrated in an existing SQLite database, we suggest to run kmer-ord project first. This allows clustering results to be visualised on 2D/3D grahps for inspection.


Usage

kmer-ord cluster [OPTIONS]

Input / Output

Parameter Flag Type Default Description
Input file -i, --input path FASTA/FASTQ input (supports .gz)
Output directory -o, --output path Directory for all outputs
Force overwrite -f, --force bool False Recompute outputs even if they already exist
Database path --db path None Optional existing database to append results

k-mer

Parameter Flag Type Default Description
K-mer size -k, --kmer int 6 Length of k-mers

Embedding options

Parameter Flag Type Default Description
DR method(s) --dr string umap Comma-separated methods (e.g. umap,tsne)
Embedding dimensions -d, --dims int 15 High-dimensional embedding size
Dataset scale preset -s, --scale string auto Hyperparameter preset (auto, small, medium, large, default)
Normalisation --norm string clr Normalisation method (raw, relative, log, clr, zscore)
PCA pre-processing --pca-pre bool False Apply PCA before embedding
Number of PCs --keep-pcs int None Fixed number of principal components
Variance threshold --keep-variance float None Retain fraction of variance (e.g. 0.9)
Parameter screening --screen_params bool False Sweep DR hyperparameters

Clustering options

Parameter Flag Type Default Description
Clustering methods --cluster string hdbscan Comma-separated methods (hdbscan,leiden,dbscan)
HDBSCAN sweep --hdbscan-sweep bool False Explore min_cluster_size
Leiden sweep --leiden-sweep bool False Explore resolution parameter
DBSCAN sweep --dbscan-sweep bool False Explore eps values

Performance

Parameter Flag Type Default Description
Threads -t, --threads int 4 Number of CPU threads

Supported clustering methods

Method Description
hdbscan Density-based clustering (robust default)
leiden Graph-based community detection
dbscan Density clustering with fixed radius

Start with HDBSCAN — it requires minimal tuning and handles noise well.


Output

  • embedding coordinates
  • clustering assignments
  • SQLite/SpatiaLite database
<output_dir>/
├── kmer/
├── matrices/
├── dr/
   └── <norm>/<method>/
├── clustering/
   └── *.tsv
└── kmer-ord.sqlite
<output_dir>/clustering/<method>.tsv
sequence_id    hdbscan_cluster
read_001       1
read_002       -1
read_003       2

-1 typically indicates noise (unassigned points)


Database integration

All embeddings and clustering results are automatically stored in the database:

Clustering table (merged results)

See: Database schema


When to use cluster

Use this command when you want to:

  • identify structure in large datasets
  • explore unsupervised binning

We highly recommend to run kmer-ord project first, which will allow visualisation of clustering results in 2D/3D.

Clustering predictions can be inspected and refined using kmer-ord project. Alhough in some cases, unsupervised clustering may yield faithful clustering of the data, we do suggest careful curation.


See also: