kmer-ord project
Convert sequences (FASTQ/FASTA) into k-mer feature space, compute sequence-level metrics, and generate a low-dimensional (2D/3D) embedding that captures geometric relationships in k-mer space. Results are stored in the database for dowstream exploration and annotation.
fastq -> fasta -> sequence stats -> kmer-counting -> DR -> database
Usage
Options
Input / Output
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Input file | -i, --input |
path | required | Input FASTA/FASTQ file (supports .gz) |
| Output directory | -o, --output |
path | required | Directory where all results are written |
| Force recomputation | -f, --force |
bool | False |
Overwrite existing outputs and recompute all steps |
K-mer
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| K-mer length | -k, --kmer |
int | 6 |
Length of k-mers used to construct the feature matrix |
Smaller k captures coarse composition, larger k captures finer sequence structure but increases dimensionality.
Dimensionality Reduction
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| DR methods | --dr |
string | umap |
Comma-separated methods (e.g. umap,tsne,pacmap) |
| Dimensions | -d, --dims |
int | 2 |
Number of output dimensions (typically 2 or 3) |
| Scale preset | -s, --scale |
string | auto |
Dataset-aware hyperparameter preset (auto, small, medium, large, default) |
| Parameter screening | --screen_params |
bool | False |
Perform parameter sweeps for supported DR methods |
Common methods include: umap, tsne, pacmap, trimap, localmap, pca
Use multiple DR methods to assess robustness of structure across embeddings.
Preprocessing
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Normalisation | --norm |
string | clr |
Feature normalisation (raw, relative, log, clr, zscore) |
| PCA pre-reduction | --pca-pre |
bool | False |
Apply PCA before dimensionality reduction |
| Number of PCs | --keep-pcs |
int | None |
Retain a fixed number of principal components |
| Variance threshold | --keep-variance |
float | None |
Retain PCs explaining given variance (e.g. 0.9) |
Use either --keep-pcs or --keep-variance, not both.
PCA is usually unnecessary, but may help with very high-dimensional or large datasets.
Performance
| Parameter | Flag | Type | Default | Description |
|---|---|---|---|---|
| Threads | -t, --threads |
int | 4 |
Number of CPU threads used for computation |
Outputs
All outputs are written to the specified --output directory.
Output structure
<output_dir>/
├── kmer/
│ └── <basename>_<k>mer_matrix.tsv
├── matrices/
│ └── <basename>_<k>mer_matrix_<norm>.npy
├── dr/
│ └── <norm>/
│ └── <method>/
│ ├── <embedding files>
│ └── parameter_screen/ (optional)
└── kmerord.sqlite
1. k-mer count table
/kmer-ord project produces a k-mer count table as tab-separated matrix:
- Columns: canonical k-mers (lexicographically sorted)
- Rows: sequences from the input FASTA
- Values:
uint32counts
2. Preprocessed matrices
For each normalisation method, the preprocessed feature matrix is saved as:
These files store the transformed feature matrices used as input for dimensionality reduction.
3. Projection outputs
Each file contains coordinates for all reads in the selected embedding space.
Single embedding
Low-dimensional projections are written to:
for example
reads_6mer_matrix_clr_umap_2D.tsv
| reads_6mer_matrix_clr_umap_2D.tsv | |
|---|---|
Merged embeddings
in case multiple methods were ran together, a merged embedding file is created
for example
4. Parameter screening (optional)
If --screen_params is enabled, additional embedding files are generated:
Each file corresponds to a different hyperparameter configuration.
Examples:
reads_6mer_matrix_clr_umap_n5_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n15_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv
reads_6mer_matrix_clr_umap_n10_min1.0_2D.tsv
This allows comparison of embedding quality across parameter settings.
5. Database
The pipeline generates a SQLite/SpatiaLite database:
See: Database schema
See also: