Performing the Cramer-Von Mises Test: A Step-by-Step Guide for Comparing Two Distributions in R

Understanding Cramer-Von Mises Test

The Cramer-Von Mises test is a statistical method used to compare two distributions. It is commonly used for non-parametric tests, meaning it doesn’t require any specific distribution of the data. The test can be used on a variety of types of data and is particularly useful when comparing the shape of two continuous distributions.

Cramer-Von Mises Test Formula

The formula for calculating the Cramer-Von Mises statistic involves finding the differences between observed frequencies in each class interval (bins) and expected frequencies if the distributions were identical. These frequencies are then squared, summed up, and finally divided by the number of observations.

For example, let’s say we have two distributions A and B with a combined dataset size N. The distribution can be split into discrete classes or bins. For simplicity, let’s assume each class has an equal width (i.e., frequency bins are uniform). Let ni represent the number of observations in the ith bin, pi_i represent the proportion of total observations in the ith bin if both distributions were identical, and d_i = ni - N * pi_i. The Cramer-Von Mises statistic for this dataset is given by:

$$D_c = \frac{1}{12} * \sum_{i=1}^{K} d_i^2$$

where $K$ represents the number of classes (bins).

Two-Sample Cramer-Von Mises Test in R

The two-sample Cramer-Von Mises test can be performed in R using the cvm_test function from the twosamples package. This function compares the distributions of two independent samples.

Let’s start with a basic understanding of how to use this function:

library(twosamples)

This is the first step in preparing for our test.

The Problem: Running the Test in a Loop

In our problem, we are comparing each column of a matrix against a vector distribution. We need to loop over these columns and perform the two-sample Cramer-Von Mises test on them individually.

Let’s explore how this could be done:

# Create a sample matrix for demonstration purposes
matrix_1 <- matrix(c(27, 38, 94, 40, 4,
                      69, 16, 85, 2, 15,
                      30, 35, 64, 95, 6,
                      20, 33, 77, 98, 55,
                      20, 44, 60, 33, 89,
                      12, 88, 87, 44, 38), nrow = 5)

# Define the vector distribution
vector_a <- c(1000/3, 1000/2.5, 1000/1.8, 1000/1.6, 1000/2)

# Run the two-sample Cramer-Von Mises test in a loop over columns of matrix_1
sapply(1:ncol(matrix_1), function(i) cvm_test(as.vector(matrix_1[,1:i]), vector_a))

However, this approach has an error as explained in the Stack Overflow post.

Solving the Error

To fix the error and get the p-value for each column comparison, we need to access the [[2]] component of the result returned by the cvm_test function. This component contains only the p-value.

The corrected approach:

sapply(1:ncol(matrix_1), function(i) cvm_test(as.vector(matrix_1[,1:i]), vector_a)[[2]])

sapply(1:ncol(matrix_1), function(i) cvm_test(as.vector(matrix_1[,1:i]), vector_a)[2])

This approach correctly returns the p-value for each column comparison.

Understanding How the Results Were Returned

The cvm_test function in R returns a named list containing the test statistic and the p-value. The [[2]] syntax accesses the component of this result that holds the p-value, which is indexed by [2]. If you’re not familiar with how to access elements of an R list or vector, it can be confusing.

cvm_test(as.vector(matrix_1[,1:i]), vector_a)

This returns a named vector containing test.stat and p.value, which are the test statistic and p-value respectively.

Conclusion

The Cramer-Von Mises test is a useful method for comparing two distributions. The test can be performed on a variety of data types, making it versatile.

Running this test in a loop to compare each column of a matrix against a vector distribution involves using the cvm_test function from R’s twosamples package and looping over columns of the matrix with sapply. We’ve demonstrated how to correctly run this test for each column comparison by adjusting our syntax slightly.

The p-value returned by the cvm_test function can be accessed directly, using either [2] or [[2]], depending on your familiarity with accessing elements in R lists and vectors.

Last modified on 2023-09-20