Skip to content

Compositional data

Normalisation methods

Centered log-ratio (CLR) transformation

\[ \operatorname{clr}(x_i) = \log\left(\frac{x_i}{g(\mathbf{x})}\right), \qquad g(\mathbf{x}) = \left( \prod_{j=1}^{D} x_j \right)^{1/D} \]

where

  • \( \mathbf{x} = (x_1, \dots, x_D) \) is the k-mer count vector for a single sample
  • \( x_i \) is the count of the \(i\)-th k-mer
  • \( D \) is the total number of k-mers
  • \( g(\mathbf{x}) \) is the geometric mean of the components of \( \mathbf{x} \)

Total Sum Scaling (TSS)

TSS normalizes each sample to unit sum, removing sequencing depth effects.

\[ x_i^{\mathrm{TSS}} = \frac{x_i}{\sum_{j=1}^{D} x_j} \]

where

  • \( x_i \) is the count of the \(i\)-th k-mer
  • \( D \) is the number of k-mers
  • \( \sum_{j=1}^{D} x_j \) is the total count of all k-mers in the sample

Log-TSS transformation

\[ x_i^{\log\text{-}\mathrm{TSS}} = \log\left( \frac{x_i}{\sum_{j=1}^{D} x_j} \right) \]

where

  • \( x_i \) is the count of the \(i\)-th k-mer
  • \( D \) is the number of k-mers
  • the logarithm is applied element-wise

Row-wise log normalisation

\[ x_i^{\log} = \log\left( \frac{x_i}{\sum_{j=1}^{D} x_j} \right) \]

where

  • \( x_i \) is the count of the \(i\)-th k-mer
  • normalization is performed independently for each sample

(This transformation is mathematically equivalent to log-TSS?? check, either just conceptually different.)