Skip to content

Projection pipeline

The project command is the main entry point for generating read embeddings from raw sequence data.

It performs: - k-mer counting - normalisation - dimensionality reduction (DR) - optional parameter screening


Basic usage

kmer-ord project -i reads.fastq.gz -o output_dir

Example

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  -k 6 \
  -t 8 \
  --dr umap,localmap,pacmap \
  --norm clr \
  --dims 2 \
  --pca-pre \
  --keep-variance 0.9 \
  --scale large \
  --screen_params

What this produces

  • k-mer count table
  • embedding coordinates (e.g. UMAP, PaCMAP)
  • feature table
  • SQLite database (used for visualisation and interactive binning)

Info

  • Use multiple DR methods for robustness
  • --scale large works well for most datasets
  • PCA preprocessing is usually not required

Some extra guidance

1. k-mer count table

/kmer-ord project produces a k-mer count table as tab-separated matrix:

<output_dir>/kmer/<basename>_<k>mer_matrix.tsv
reads_6mer_matrix.tsv
1
2
3
Sequence_ID   AAA...   AAC...   AAG...   ...
read_001     12        0         4
read_002     7         2         1
  • Rows correspond to samples (reads, contigs, or assemblies)
  • Columns correspond to cannonicalised k-mers
  • Values are raw k-mer counts

For more details on:

  • k-mer counting, see: kmer_counting.py(concepts/kmer_counting.md)
  • kmer-ord project command-line details: project(reference/project.md)

2. Choosing normalisation

k-mer count matrices are compositional, as total counts depend on sequence length. Normalisation is therefore recommended.

By default, Centered Log-Ratio (CLR) normalisation is applied. Alternative strategies can be specified:

kmer-ord project \
  --input reads.fq.gz \
  --methods umap \
  --normalisation clr,tss

Each normalisation method is applied independently.

For more detail on normalisation: - Concepts: Compositional data - kmer-ord project command-line details: project(reference/project.md)


3. PCA pre-reduction (optional)

For high-dimensional matrices, PCA can be applied prior to nonlinear embedding:

kmer-ord project \
  --input reads.fq.gz \
  --methods umap \
  --pca_dim_red \
  --keep_variance 0.9

This reduces dimensionality while retaining 90% of the variance before embedding.


4. Running DR

kmer-ord project then runs DR. multiple DR methods can be specified.

Run a single method:

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  --dr umap

Run multiple method:

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  --dr umap,localmap,pacmap,tsne

Run all supported methods:

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  --dr all

5. Parameter screening

Dimensionality reduction is sensitive to hyperparameters. kmer-ord project provides automated parameter screening to explore how different hyperparameter choices affect the resultion embeddings. Screening focuses on parameters that control the balance between global structure preservation, which is particularly important as dataset sizes increases

kmer-ord project \
  -i reads.fastq.gz \
  -o output_dir \
  -k 6 \
  --dr umap,localmap,pacmap \
  --screen_params

When enabled, the script evaluates multiple parameter combinations and writes each embedding to disk.

  • t-SNE screens multiple perplexity and learning-rate values
  • UMAP screens multiple n_neighbors and min_dist values
  • TRIMAP screens multiple n_inliers and weight_temp values

Note

  • screening is computationally expensive for large datasets
  • parameter screening is indended for exploratory analysis
  • For routine analysis, scale-dependent presents (via --scale) provide reasonable hyperparameter settings without exhaustive screening.
  • k-mer counting, see: kmer_counting.py(concepts/parameter_screening.md)
  • kmer-ord project command-line details: project(reference/project.md)

6. Hyperparameter ~ dataset scale

In addition to explicit parameter screening, kmer-ord project provides predefined hyperparameter presets that adapt DR methods to different dataset sizes. Use --scale auto:

python kmer-ord.py \
  --input kmer_matrix.tsv \
  --methods umap \
  --scale auto
Parameter Flag Type Default Description
Dataset scale --scale string default Select a predefined hyperparameter preset

allowed values are default, small, medium, large, and auto

Each preset maps to method-specific hyperparameters that are chosen to be sensible to the corresponding dataset size (e.g. n_neighbors for UMAP, perplexity for t-SNE, or FP_ratio for PaCMAP and LocalMAP).

when --scale auto is used, the scale category is inferred from the number of sequences in the input matrix:

  • small: fewer than 100,000 sequences
  • medium 100,000 to 1,000,000 sequences
  • large: more than 1,000,000 sequences

Note

When screening is enabled via --screen_params, preset values are ignored (--scale), all relevant parameter combinations are evaluated and saved.


See also: