Implementing Partial Least Squares Regression with Base R

Introduction

As data analysis and machine learning continue to advance in fields such as medicine, finance, and climate science, the need for effective statistical models to predict outcomes from large datasets has become increasingly important. Among these tools is Partial Least Squares Regression (PLS), a widely used technique for predicting continuous responses based on multiple predictor variables.

In this blog post, we will explore how to implement PLS regression using only base R and no additional packages. This task may seem daunting at first due to the complexity of the algorithm involved, but with a clear understanding of its components and implementation details, it is definitely achievable.

Background

PLS regression combines elements from both linear and multivariate analysis techniques. It starts by finding the principal components of both the response (target) variable and the predictor variables using singular value decomposition (SVD). Then, it selects the combination of original variables that explains the most variance in the target variable. The selected component is then used to predict the target variable.

The equation for PLS regression can be stated as follows:

X = U * σ^2 / σ^2 + Σ (Y - Y^2)

where X is a matrix of predictor variables, U is a matrix of principal components of the response variables, σ^2 is the variance explained by the first component, and Y represents the target response variable.

Implementation

We will implement PLS regression manually using base R without any additional packages. We’ll create our own pls() function to calculate the explained sum of squares for each component, perform SVD on both X and Y, and determine which combination of original predictor variables maximizes the explained variance in the target response variable.

Calculating Explained Sum of Squares

The first step is calculating how much variance is explained by each component. We’ll achieve this by taking the ratio of the sum of squared residuals to the total sum of squared residuals for both the predictor and response variables.

## Step 1: Define Pls Function
```r
# Calculate sum of squares and mean of residuals for Y
sum_squares_Y <- function(y) {
  # Calculate variance (mean squared difference from zero)
  variance <- mean((y - mean(y))^2)
  
  # Calculate the total sum of squares (mean squared difference from mean)
  ss_total_Y <- var(Y) * nrow(X) * ncol(X)
  
  return(ss_total_Y - variance)
}

# Function to calculate explained sums of squares
function pls_explained_sums_of_squares(X, Y) {
  # Calculate explained sum of squares for the first component.
  sse_first_X <- pls_sum_squares(X)
  ss_total_X <- pls_sum_squares(X)
  
  # Perform SVD on both X and Y.
  # Note that we will perform SVD manually without a library.
  ssv_X, sv_X, _ <- svd(as.matrix(X)) # Assuming X is a matrix
  ssv_Y, _, _ <- svd(as.matrix(Y))
  
  # Calculate explained sums of squares for each component.
  var_explained <- rep(0, length(sv_X$eigenvalues))
  n_components = min(length(sv_X$singularvalues),length(sv_Y$singularvalues)) # Determine the number of principal components to keep
  
  for (i in 1:n_components) {
    explained_var <- (sum_squares_Y(as.matrix(Y)[, i]) / ss_total_Y)^2
    var_explained[i] <- explained_var
    
    # Keep track of which original variables contribute to each component.
    # Note that the svd function doesn't return these explicitly; we must infer them manually
    index_components_X <- seq(0, ncol(X) - 1)
    index_components_Y <- seq(0, ncol(Y) - 1)
    
    # Find which original variables contribute to this component.
    for (k in index_components_X) {
      if (sv_X$eigenvalues[i] > 0) { 
        if (index_components_X[k] >= 1 & index_components_X[k] <= ncol(X)) {
          X_contribution_component_i <- k
        }
      }
      
      for (j in index_components_Y) {
        if (sv_Y$singularvalues[i] > 0) {
          if (index_components_Y[j] >= 1 & index_components_Y[j] <= ncol(Y)) {
            Y_contribution_component_i <- j
          } 
        }
      }
    }
    
    # Record the values for later use in our output.
    var_explained_component[i] <- explained_var
    
    if (i == 0) {
      X_component_sum_squares[i] <- sse_first_X
      Y_component_sum_squares[i] <- sum_squares_Y(as.matrix(Y)[, i])
      
    } else {
      # Update the sums of squares for each component.
      X_component_sum_squares[i] <- X_component_sum_squares[i - 1] + pls_sum_squares(X[, index_components_X[ X_component_sum_squares[i-1]==X_component_sum_squares[i]]])
      Y_component_sum_squares[i] <- Y_component_sum_squares[i - 1] + sum_squares_Y(as.matrix(Y)[, index_components_Y[Y_component_sum_squares[i-1]==Y_component_sum_squares[i]]])  
    }
  }
  
  return(var_explained)
}

Performing SVD Manually

To manually perform the singular value decomposition (SVD) of our data matrices without using a library or package, we can leverage base R functions for matrix inversion and multiplication.

## Step 2: Define svd Function
```r
# Perform SVD on a matrix A.
function svd(A) {
  # Calculate the singular values and their corresponding left and right singular vectors 
  # Note that these calculations are highly dependent on how you choose to manually calculate them
  
  u <- solve(t(A)) %*% A
  
  s <- sqrt(diag(u))
  
  v <- solve(A %*% u)
  
  return(list(eigenvalues = s, singularvalues = s, left_singular_vectors = u, right_singular_vectors = v))
}

Steps to Find Best Combination of Predictor Variables

The goal is to find the combination of predictor variables that explains the most variance in our target response variable Y.

## Step 3: Iterate Through Different Combinations of Original Predictors

We will use a brute-force approach by iterating through each possible subset of original predictor variables. For this example, we’ll assume our dataset X is a matrix and each column represents an explanatory MIR spectrum variable. We will select all columns to start (maximum number of predictors), then systematically exclude one at a time until only two predictors are left.

# Function to check if we should include this predictor in the model.
function pls_predictor_inclusion(predictors) {
  # Calculate the explained sums of squares for these predictors.
  # Note that our current function returns the sum of squared residuals 
  # for the first component only. This needs extension to handle multiple components and
  # their cumulative sums of squares for accurate comparison across different models.
  
  if (length(predictors) == 0) { return(false)}
  pls_explained_sums_of_squares(X[, predictors], Y)
}

# Main function to find the best combination of original predictor variables.
function pls_find_best_model() {
  # Check all possible combinations of predictors
  max_r_squared <- -Inf
  best_predictors <- NULL
  
  for (i in 0:ncol(X) - 1) {
    # Include all columns up to this point.
    if (pls_predictor_inclusion(seq(0, i))) {
      pls_explained_sums_of_squares(X[, seq(0, i)], Y)
      
      var_explained <- pls_explained_sums_of_squares(X[, seq(0, i)], Y)$var_explained
      if (sum(var_explained) > max_r_squared) {
        max_r_squared <- sum(var_explained)
        best_predictors <- seq(0, i)
      }
    }
  }
  
  # Record the result for later use.
  model_sum_squares <- pls_explained_sums_of_squares(X[, best_predictors], Y)$X_component_sum_squares
  return(c(model_sum_squares = model_sum_squares, predictors = best_predictors))
}

We can now find our optimal model by using pls_find_best_model(). This function returns the model sum of squares for the best combination of original predictor variables found so far and which ones they are.

# Use pls_find_best_model to find your optimal model.

This example code illustrates one possible approach to perform multi-variable analysis without relying on specialized libraries like caret or lrm. Note, however, that it lacks some of the convenience features (such as automatic calculation of model sums of squares) and validation steps typically found in more robust packages.

You can extend this process by adding additional checks for model validity, handling non-normal residuals, and incorporating techniques like cross-validation for model comparison.


Last modified on 2025-02-13