Compositional data

Normalisation methods

\[ \operatorname{clr}(x_i) = \log\left(\frac{x_i}{g(\mathbf{x})}\right), \qquad g(\mathbf{x}) = \left( \prod_{j=1}^{D} x_j \right)^{1/D} \]

where

\( \mathbf{x} = (x_1, \dots, x_D) \) is the k-mer count vector for a single sample
\( x_i \) is the count of the \(i\)-th k-mer
\( D \) is the total number of k-mers
\( g(\mathbf{x}) \) is the geometric mean of the components of \( \mathbf{x} \)

TSS normalizes each sample to unit sum, removing sequencing depth effects.

\[ x_i^{\mathrm{TSS}} = \frac{x_i}{\sum_{j=1}^{D} x_j} \]

where

\[ x_i^{\log\text{-}\mathrm{TSS}} = \log\left( \frac{x_i}{\sum_{j=1}^{D} x_j} \right) \]

where

\[ x_i^{\log} = \log\left( \frac{x_i}{\sum_{j=1}^{D} x_j} \right) \]

where

(This transformation is mathematically equivalent to log-TSS?? check, either just conceptually different.)