Generalized Embedding and Cluster Assignment

Abraheem ‘Abe’ Khouqeer

2024-12-16

Overview

This script provides a generalized method to:

Generate an embedding for a reference dataset using UMAP.
Project new samples onto the existing embedding space.
Assign new samples to clusters in the reference dataset based on similarity metrics.

Required Libraries

library(dplyr)
library(umap)
library(lsa) # For cosine similarity

Define Parameters

# Parameters for the script
columns_to_embed <- c("feature1", "feature2", "feature3") # Adjust as needed

These are the markers used to create the UMAP embedding. You can adjust the columns_to_embed vector to include the features you want to embed. In my case, I used the markers from my AML panel to generate the embedding.

Separate Reference and New Data

# Separate reference and new datasets
reference_data <- full_dataset %>% filter(dataset_type == "Reference")
new_data <- full_dataset %>% filter(dataset_type == "New")

This assuming that the full_dataset contains both the reference (healthy donor) and new data (AML), with a column dataset_type indicating which samples belong to the reference set and which are new. For my data, I imported all the FCS files into R, and made a large dataframe with a column indicating the dataset type.

Generate UMAP Embedding for Reference Data

# Create UMAP embedding for the reference dataset
umap_model <- umap(reference_data %>% select(all_of(columns_to_embed)))

# Add UMAP coordinates to the reference dataset
reference_data <- reference_data %>% mutate(
  umap1 = umap_model$layout[, 1],
  umap2 = umap_model$layout[, 2]
)

Euclidean Distance for Anchoring and Projecting

The UMAP embedding minimizes the discrepancy between high-dimensional distances and low-dimensional projections. For two points \(\mathbf{u}\) and \(\mathbf{v}\) in the original space, the Euclidean distance is calculated as: \[ d(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^n (u_i - v_i)^2} \]

This distance is preserved as much as possible when projecting onto the UMAP embedding space, ensuring local neighborhoods in the high-dimensional space are represented in the lower-dimensional space. The model learns a transformation that minimizes the discrepancy between these distances.

The transformation matrix \(T\) derived from the UMAP model is used to project new data points onto the existing UMAP space. Specifically, it maps a high-dimensional input vector \(\mathbf{x}\) to its corresponding low-dimensional representation \(\mathbf{y}\):

\[ \mathbf{y} = T(\mathbf{x}) \]

Where:

\(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\) is the high-dimensional input vector with \(n\) features.
\(T\) is the transformation matrix, which has dimensions \(m \times n\), where \(m\) is the number of dimensions in the low-dimensional space (e.g., 2 for a 2D UMAP projection), and \(n\) is the number of dimensions in the high-dimensional space (e.g., the number of features in your dataset).

The transformation matrix \(T\) is ‘learned’ during the UMAP model training (using the healthy donor embedding) and contains weights that transform the input space. It can be expressed as:

\[ T = \begin{bmatrix} t_{11} & t_{12} & \cdots & t_{1n} \\ t_{21} & t_{22} & \cdots & t_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ t_{m1} & t_{m2} & \cdots & t_{mn} \end{bmatrix} \]

Where:

\(t_{ij}\) represents the element in the \(i\)-th row and \(j\)-th column of the transformation matrix.
\(\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}\) is the resulting low-dimensional representation, typically of size \(m\) (e.g., 2D in the case of a 2D UMAP embedding).

The low-dimensional embedding \(\mathbf{y}\) is calculated by performing matrix multiplication:

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} t_{11} & t_{12} & \cdots & t_{1n} \\ t_{21} & t_{22} & \cdots & t_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ t_{m1} & t_{m2} & \cdots & t_{mn} \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \]

This matrix multiplication maps the high-dimensional point \(\mathbf{x}\) into the lower-dimensional space \(\mathbf{y}\), ensuring that the new data points are represented in the same embedding space as the reference data. This is done by the predict() function in the next step.

Project New Data onto the UMAP Space

predict_embedding <- function(new_data, umap_model) {
  # Predict embeddings for new data
  embedding <- predict(umap_model, new_data %>% select(all_of(columns_to_embed)))
  new_data <- new_data %>% mutate(
    umap1 = embedding[, 1],
    umap2 = embedding[, 2]
  )
  return(new_data)
}

# Apply the function to project new data
new_data <- predict_embedding(new_data, umap_model)

The predict() function takes the transformation matrix from the original UMAP model and applies it to the new dataset. This matrix \(T\) was derived during training (using the healthy donor data) and maps high-dimensional points \(\mathbf{x}\) to their corresponding low-dimensional coordinates \(\mathbf{y}\): \[ \mathbf{y} = T(\mathbf{x}) \] The new points are embedded in a manner consistent with the original reference embedding. In other words, the AML cells are projected into the same UMAP space as the healthy donor cells.

Assign Clusters to New Data

I use the centroid of each reference cluster (healthy donor clusters) as a representative point and assign new samples to the cluster with the closest centroid. The cosine similarity metric is used to measure the similarity between the UMAP coordinates of the AML samples and the centroids of the reference clusters. This is streamlined by using the lsa package in R, specifically the cosine() function.

Compute Centroids of Reference Clusters

# Calculate cluster centroids in the reference dataset
reference_centroids <- reference_data %>%
  group_by(phenograph_cluster) %>%
  summarize(
    centroid_umap1 = mean(umap1),
    centroid_umap2 = mean(umap2)
  )

Define Cosine Similarity Function

cosine_distance <- function(x, y) {
  1 - cosine(as.numeric(x), as.numeric(y))
}

Assign Clusters Based on Centroid Similarity

# Initialize cluster assignment column
new_data$assigned_cluster <- NA

# Assign clusters based on similarity to centroids
for (i in seq_len(nrow(new_data))) {
  # Extract the UMAP coordinates of the current sample
  sample_point <- new_data[i, c("umap1", "umap2")]

  # Compute cosine distances to all centroids
  distances <- apply(reference_centroids[, c("centroid_umap1", "centroid_umap2")], 1, function(centroid) {
    cosine_distance(sample_point, centroid)
  })

  # Assign the cluster of the nearest centroid
  nearest_cluster <- reference_centroids$phenograph_cluster[which.min(distances)]
  new_data$assigned_cluster[i] <- nearest_cluster
}