Database schema

Overview

kmer-ord stores all processed data in a SQLite database with SpatiaLite extensions.

This database is the central data structure used by:

the project pipeline (creation)
the cluster pipeline (adding clustering results)
the visualise command (plot generation)
the bin interface (interactive exploration)

Why a database?

The database provides:

a single source for all reads and features
efficient querying of large datasets
spatial indexing for fast embedding queries
integration with the interactive Dash app

Tip

You can inspect the database using standard SQLite tools, Python, R etc...

Database structure

The database consists of several core tables:

Table	Description
`fasta`	Raw sequences and headers
`features`	Per-read features
`coordinates`	Spatial embeddings (SpatiaLite geometry)
`embedding_<method>`	Raw embedding tables
`Clustering`	Cluster assignments

1. `fasta` table

Stores raw sequences from input FASTA/FASTQ.

Column	Type	Description
`header`	TEXT (PK)	Short sequence ID
`full_header`	TEXT	Original FASTA/FASTQ header
`sequence`	TEXT	Nucleotide sequence
`qualities`	TEXT	Quality string (FASTQ only)

2. `features` table

Stores all computed features per sequence.

Column	Type	Description
`sequence_id`	TEXT (PK)	Sequence identifier
`<feature>`	INTEGER / REAL / TEXT	Feature values

Examples of features:

sequence length
GC content
k-mer statistics
external annotations (e.g. Tiara)

3. `coordinates` table (SpatiaLite)

Stores embeddings as geometry columns.

Column	Type	Description
`sequence_id`	TEXT (PK)	Sequence identifier
`<method>`	GEOMETRY	Embedding coordinates

Each dimensionality reduction method becomes a geometry column, for example:

umap
tsne
pacmap

These are stored as:

POINT(x y) for 2D
POINT Z(x y z) for 3D

Important

Spatial indices are created automatically for fast querying.

4. Embedding tables

Each embedding is also stored as a flat table:

Example:

Column	Description
`sequence_id`	ID
`umap_1`, `umap_2`	Coordinates

These tables are useful for:

exporting data
debugging
external analysis

5. `Clustering` table

Stores clustering results from one or more methods.

Column	Description
`sequence_id`	ID
`<method>`	Cluster assignment

Example columns:

hdbscan
leiden
dbscan

Multiple clustering outputs are merged into a single table.

Relationships

All tables are linked via sequence_id

Spatial capabilities

Because SpatiaLite is used, you can run spatial queries:

SELECT *
FROM coordinates
WHERE ST_Within(
    umap,
    ST_GeomFromText('POLYGON((...))')
);

This is how lasso selection works in the interactive app (which is basically a GUI for spatial SQL queries)

Inspecting the database

from kmer_ord.db.inspect import inspect_database

inspect_database("results/kmer-ord.sqlite", limit=5)

SQLite CLI

sqlite3 results/kmer-ord.sqlite

List tables:

.tables

Preview data:

SELECT * FROM features LIMIT 5;

Working with SpatiaLite geometry

Embedding coordinates are stored as geometry columns, not plain numeric columns

Its hence not possible to directly SELECT umap_1, umap_2. coordinates must be extracted using spatial function: ST_X(), ST_Y(), ST_Y() and ST_AsText()

Geometry columns require the SpatialLite extension to be loaded

Using Python

Basic connection

import sqlite3

conn = sqlite3.connect("results/kmer-ord.sqlite")
cursor = conn.cursor()

# enable spatialite
conn.enable_load_extension(True)
cursor.execute("SELECT load_extension('mod_spatialite');")

Extract coordinates

import pandas as pd

query = """
SELECT
    sequence_id,
    ST_X(umap) AS umap_1,
    ST_Y(umap) AS umap_2
FROM coordinates
LIMIT 5;
"""

df = pd.read_sql_query(query, conn)
print(df.head())

Extract mutliple embessings

query = """
SELECT
    sequence_id,
    ST_X(umap) AS umap_1,
    ST_Y(umap) AS umap_2,
    ST_X(tsne) AS tsne_1,
    ST_Y(tsne) AS tsne_2
FROM coordinates;
"""

df = pd.read_sql_query(query, conn)

Join with features

query = """
SELECT
    f.sequence_id,
    f.gc_content,
    f.length,
    ST_X(c.umap) AS umap_1,
    ST_Y(c.umap) AS umap_2
FROM features f
JOIN coordinates c USING(sequence_id);
"""

df = pd.read_sql_query(query, conn)

Using R

Connect and load SpatiaLite

library(RSQLite)

con <- dbConnect(SQLite(), "results/kmer-ord.sqlite")

# enable extensions
dbExecute(con, "PRAGMA enable_load_extension = 1;")

# load spatialite (path may vary)
dbExecute(con, "SELECT load_extension('mod_spatialite')")

you may need to locate mod_spatialite manually

find /usr -name "mod_spatialite*"

and provide the full path if needed

dbExecute(con, "SELECT load_extension('/path/mod_spatialite')")