Overview
This script provides a generalized method to:
- Generate an embedding for a reference dataset using UMAP.
- Project new samples onto the existing embedding space.
- Assign new samples to clusters in the reference dataset based on similarity metrics.
Required Libraries
library(dplyr)
library(umap)
library(lsa) # For cosine similarity
Define Parameters
# Parameters for the script
columns_to_embed <- c("feature1", "feature2", "feature3") # Adjust as needed
These are the markers used to create the UMAP embedding. You can
adjust the columns_to_embed vector to include the features
you want to embed. In my case, I used the markers from my AML panel to
generate the embedding.
Separate Reference and New Data
# Separate reference and new datasets
reference_data <- full_dataset %>% filter(dataset_type == "Reference")
new_data <- full_dataset %>% filter(dataset_type == "New")
This assuming that the full_dataset contains both the
reference (healthy donor) and new data (AML), with a column
dataset_type indicating which samples belong to the
reference set and which are new. For my data, I imported all the FCS
files into R, and made a large dataframe with a column indicating the
dataset type.
Generate UMAP Embedding for Reference Data
# Create UMAP embedding for the reference dataset
umap_model <- umap(reference_data %>% select(all_of(columns_to_embed)))
# Add UMAP coordinates to the reference dataset
reference_data <- reference_data %>% mutate(
umap1 = umap_model$layout[, 1],
umap2 = umap_model$layout[, 2]
)
Euclidean Distance for Anchoring and Projecting
The UMAP embedding minimizes the discrepancy between high-dimensional distances and low-dimensional projections. For two points \(\mathbf{u}\) and \(\mathbf{v}\) in the original space, the Euclidean distance is calculated as: \[ d(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^n (u_i - v_i)^2} \]
This distance is preserved as much as possible when projecting onto the UMAP embedding space, ensuring local neighborhoods in the high-dimensional space are represented in the lower-dimensional space. The model learns a transformation that minimizes the discrepancy between these distances.
The transformation matrix \(T\) derived from the UMAP model is used to project new data points onto the existing UMAP space. Specifically, it maps a high-dimensional input vector \(\mathbf{x}\) to its corresponding low-dimensional representation \(\mathbf{y}\):
\[ \mathbf{y} = T(\mathbf{x}) \]
Where:
- \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\) is the high-dimensional input vector with \(n\) features.
- \(T\) is the transformation matrix, which has dimensions \(m \times n\), where \(m\) is the number of dimensions in the low-dimensional space (e.g., 2 for a 2D UMAP projection), and \(n\) is the number of dimensions in the high-dimensional space (e.g., the number of features in your dataset).
The transformation matrix \(T\) is ‘learned’ during the UMAP model training (using the healthy donor embedding) and contains weights that transform the input space. It can be expressed as:
\[ T = \begin{bmatrix} t_{11} & t_{12} & \cdots & t_{1n} \\ t_{21} & t_{22} & \cdots & t_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ t_{m1} & t_{m2} & \cdots & t_{mn} \end{bmatrix} \]
Where:
- \(t_{ij}\) represents the element in the \(i\)-th row and \(j\)-th column of the transformation matrix.
- \(\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}\) is the resulting low-dimensional representation, typically of size \(m\) (e.g., 2D in the case of a 2D UMAP embedding).
The low-dimensional embedding \(\mathbf{y}\) is calculated by performing matrix multiplication:
\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} t_{11} & t_{12} & \cdots & t_{1n} \\ t_{21} & t_{22} & \cdots & t_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ t_{m1} & t_{m2} & \cdots & t_{mn} \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \]
This matrix multiplication maps the high-dimensional point \(\mathbf{x}\) into the lower-dimensional
space \(\mathbf{y}\), ensuring that the
new data points are represented in the same embedding space as the
reference data. This is done by the predict() function in
the next step.
Project New Data onto the UMAP Space
predict_embedding <- function(new_data, umap_model) {
# Predict embeddings for new data
embedding <- predict(umap_model, new_data %>% select(all_of(columns_to_embed)))
new_data <- new_data %>% mutate(
umap1 = embedding[, 1],
umap2 = embedding[, 2]
)
return(new_data)
}
# Apply the function to project new data
new_data <- predict_embedding(new_data, umap_model)
The predict() function takes the transformation matrix
from the original UMAP model and applies it to the new dataset. This
matrix \(T\) was derived during
training (using the healthy donor data) and maps high-dimensional points
\(\mathbf{x}\) to their corresponding
low-dimensional coordinates \(\mathbf{y}\): \[
\mathbf{y} = T(\mathbf{x})
\] The new points are embedded in a manner consistent with the
original reference embedding. In other words, the AML cells are
projected into the same UMAP space as the healthy donor cells.
Assign Clusters to New Data
I use the centroid of each reference cluster (healthy donor clusters)
as a representative point and assign new samples to the cluster with the
closest centroid. The cosine similarity metric is used to measure the
similarity between the UMAP coordinates of the AML samples and the
centroids of the reference clusters. This is streamlined by using the
lsa package in R, specifically the cosine()
function.
Compute Centroids of Reference Clusters
# Calculate cluster centroids in the reference dataset
reference_centroids <- reference_data %>%
group_by(phenograph_cluster) %>%
summarize(
centroid_umap1 = mean(umap1),
centroid_umap2 = mean(umap2)
)
Define Cosine Similarity Function
cosine_distance <- function(x, y) {
1 - cosine(as.numeric(x), as.numeric(y))
}
Assign Clusters Based on Centroid Similarity
# Initialize cluster assignment column
new_data$assigned_cluster <- NA
# Assign clusters based on similarity to centroids
for (i in seq_len(nrow(new_data))) {
# Extract the UMAP coordinates of the current sample
sample_point <- new_data[i, c("umap1", "umap2")]
# Compute cosine distances to all centroids
distances <- apply(reference_centroids[, c("centroid_umap1", "centroid_umap2")], 1, function(centroid) {
cosine_distance(sample_point, centroid)
})
# Assign the cluster of the nearest centroid
nearest_cluster <- reference_centroids$phenograph_cluster[which.min(distances)]
new_data$assigned_cluster[i] <- nearest_cluster
}