Advanced Usage

This pae covers advanced configuration and modular usage of kmer-ord, including fine-tuning the pipeline for your datasets or needs, as well as building custom workflows

Projection pipeline (`project`)

Basic usage

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir

Example

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  -k 6 \
  -t 8 \
  --dr umap,localmap,pacmap \
  --norm clr \
  --dims 2 \
  --pca-pre \
  --keep-variance 0.9 \
  --scale large \
  --screen_params

key options

input/output

Option	Description
`-i, --input`	FASTA/FASTQ input (supports `.gz`)
`-o, --output`	Output directory
`-f, --force`	Recompute all steps

k-mer configuration

Option	Description
`-k, --kmer`	K-mer length (default: 6)

Dimensionality Reduction (DR)

Option	Description
`--dr`	Methods (comma-separated, e.g. `umap,tsne`)
`-d, --dims`	Output dimensions (2 or 3 typical)
`--scale`	Preset tuning (`auto`, `small`, `medium`, `large`)
`--screen_params`	Parameter sweep for DR

Preprocessing

Option	Description
`--norm`	Normalisation (`raw`, `relative`, `log`, `clr`, `zscore`)
`--pca-pre`	Apply PCA before DR
`--keep-pcs`	Fixed number of PCs
`--keep-variance`	Retain variance threshold (e.g. `0.9`)

Performance

Option	Description
`-t, --threads`	Number of threads

Notes

Multiple DR methods will generate multiple embeddings. Specify multiple DR methods as comma-separated string (--dr umap,localmap,pacmap)
PCA preprocessing is entirly optional, but usually not needed
--scale parameter can be used to apply dataset size dependend hyperparameters for DR methods. --scale auto adapts hyperparameters to dataset size automatically.

Clustering pipeline (`cluster`)

The cluster command perfoms high-dimensional embedding followed by clustering.

Example

kmer-ord cluster \
  -i reads.fastq.gz \
  -o output_dir \
  --dr umap \
  --dims 15 \
  --cluster hdbscan,leiden \
  --hdbscan-sweep \
  --threads 8

Clustering methods

Method	Description
`hdbscan`	Density-based clustering (robust default)
`leiden`	Graph-based clustering
`dbscan`	Density clustering with epsilon

parameter sweeps

Option	Description
`--hdbscan-sweep`	Explore `min_cluster_size`
`--leiden-sweep`	Explore resolution
`--dbscan-sweep`	Explore `eps` values

output

high-dimensional embeddings
cluster assignments
SQLite/SpatiaLite database

Visualisation (`visualise`)

Generate plots from a database

kmer-ord visualise \
  -d results/kmer-ord.sqlite \
  --embedding-mode all \
  --max-categories 15

Options

Option	Description
`--embeddings / --no-embeddings`	Toggle embedding plots
`--features / --no-features`	Toggle feature plots
`--embedding-mode`	`density`, `categorical`, `continuous`, `all`
`--max-categories`	Limit categorical plot complexity

Output

All plots are saved to:

<output_dir>/plots/

Modular Workflow

Each pipeline step can be run independently

fastq → fasta

kmer-ord fastq-to-fasta -i reads.fastq.gz -o reads.fasta

Sequence statistics

kmer-ord fasta-stats -i reads.fasta -o results/

k-mer counting

kmer-ord kmer-count -i reads.fasta -o results/ -k 6 -t 8

k-mer metrics

kmer-ord kmer-metrics -i kmer_matrix.tsv -o results/

Dimensionality reduction

kmer-ord dr \
  -i kmer_matrix.tsv \
  -o results/ \
  -m umap,tsne \
  --norm clr \
  --dims 2

Clustering

kmer-ord clustering \
  -i results/ \
  -o results/ \
  --method hdbscan

Tiara classification

kmer-ord run-tiara -i reads.fasta -o results/ -t 8

Database construction

kmer-ord build-db -i results/ -o results/