Skip to content

kmer-ord project

Convert sequences (FASTQ/FASTA) into k-mer feature space, compute sequence-level metrics, and generate a low-dimensional (2D/3D) embedding that captures geometric relationships in k-mer space. Results are stored in the database for dowstream exploration and annotation.

fastq -> fasta -> sequence stats -> kmer-counting -> DR -> database

Usage

kmer-ord project [OPTIONS]

Options

Input / Output

Parameter Flag Type Default Description
Input file -i, --input path required Input FASTA/FASTQ file (supports .gz)
Output directory -o, --output path required Directory where all results are written
Force recomputation -f, --force bool False Overwrite existing outputs and recompute all steps

K-mer

Parameter Flag Type Default Description
K-mer length -k, --kmer int 6 Length of k-mers used to construct the feature matrix

Smaller k captures coarse composition, larger k captures finer sequence structure but increases dimensionality.

Dimensionality Reduction

Parameter Flag Type Default Description
DR methods --dr string umap Comma-separated methods (e.g. umap,tsne,pacmap)
Dimensions -d, --dims int 2 Number of output dimensions (typically 2 or 3)
Scale preset -s, --scale string auto Dataset-aware hyperparameter preset (auto, small, medium, large, default)
Parameter screening --screen_params bool False Perform parameter sweeps for supported DR methods

Common methods include: umap, tsne, pacmap, trimap, localmap, pca

Use multiple DR methods to assess robustness of structure across embeddings.

Preprocessing

Parameter Flag Type Default Description
Normalisation --norm string clr Feature normalisation (raw, relative, log, clr, zscore)
PCA pre-reduction --pca-pre bool False Apply PCA before dimensionality reduction
Number of PCs --keep-pcs int None Retain a fixed number of principal components
Variance threshold --keep-variance float None Retain PCs explaining given variance (e.g. 0.9)

Use either --keep-pcs or --keep-variance, not both.

PCA is usually unnecessary, but may help with very high-dimensional or large datasets.

Performance

Parameter Flag Type Default Description
Threads -t, --threads int 4 Number of CPU threads used for computation

Outputs

All outputs are written to the specified --output directory.


Output structure

<output_dir>/
├── kmer/
│   └── <basename>_<k>mer_matrix.tsv
├── matrices/
│   └── <basename>_<k>mer_matrix_<norm>.npy
├── dr/
│   └── <norm>/
│       └── <method>/
│           ├── <embedding files>
│           └── parameter_screen/ (optional)
└── kmerord.sqlite

1. k-mer count table

/kmer-ord project produces a k-mer count table as tab-separated matrix:

<output_dir>/kmer/<basename>_<k>mer_matrix.tsv
reads_6mer_matrix.tsv
1
2
3
Sequence_ID   AAA...   AAC...   AAG...   ...
read_001     12        0         4
read_002     7         2         1
  • Columns: canonical k-mers (lexicographically sorted)
  • Rows: sequences from the input FASTA
  • Values: uint32 counts

2. Preprocessed matrices

For each normalisation method, the preprocessed feature matrix is saved as:

<output_dir>/matrices/<basename>_<k>mer_matrix_<normalisation>.npy

These files store the transformed feature matrices used as input for dimensionality reduction.


3. Projection outputs

Each file contains coordinates for all reads in the selected embedding space.

Single embedding

Low-dimensional projections are written to:

<output_dir>/dr/<normalisation>/<dr_method>/

for example

results/dr/clr/umap/

reads_6mer_matrix_clr_umap_2D.tsv

reads_6mer_matrix_clr_umap_2D.tsv
1
2
3
4
5
6
sequence_id umap_1  umap_2
read_1  21.173244   7.4076867
read_2  20.183973   8.162238
read_3  22.094797   7.7953444
read_4  21.380056   9.051636
read_5  22.158167   8.300437

Merged embeddings

in case multiple methods were ran together, a merged embedding file is created

<output_dir>/dr/<normalisation>/<basename>_<k>mer_<normalisation>_<d>D_merged_embeddings.tsv

for example

reads_6mer_matrix_clr_2D_merged_embeddings.tsv
1
2
3
4
5
6
sequence_id tsne_1  tsne_2  umap_1  umap_2  localmap_1  localmap_2  pacmap_1    pacmap_2    trimap_1    trimap_2    pca_1   pca_2
read_1  50.68971    -2.4937842  21.173244   7.4076867   15.127128   4.2103953   17.75506    -6.4360504  73.977356   1.4536395   211.96341   3.347329
read_2  63.401493   -4.7595797  20.183973   8.162238    14.252548   -12.957433  18.174234   -2.8368852  79.05072    -18.2425    237.61957   -230.95534
read_3  50.066235   7.268653    22.094797   7.7953444   17.74772    4.29285 19.611523   -6.8463845  63.427784   -2.5171304  247.15794   22.478891
read_4  65.61936    9.432227    21.380056   9.051636    20.69087    -7.443064   21.496445   -2.955438   65.022255   -20.538568  250.27419   116.60115
read_5  54.19407    12.485539   22.158167   8.300437    24.474653   1.8577204   21.308455   -5.744142   61.44304    -10.609265  263.355 60.49537

4. Parameter screening (optional)

If --screen_params is enabled, additional embedding files are generated:

results/dr/clr/umap/parameter_screen

Each file corresponds to a different hyperparameter configuration.

Examples:

reads_6mer_matrix_clr_umap_n5_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n15_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv

This allows comparison of embedding quality across parameter settings.

5. Database

The pipeline generates a SQLite/SpatiaLite database:

<output_dir>/kmerord.sqlite

See: Database schema


See also: