cmahalanobis: An R Package for Computing the Mahalanobis Distance Between Factors

Introduction

Statistical matching or statistical fusion of data is a technique widely used in various fields, such as impact evaluation, public policy analysis, market research, biostatistics and others. This technique consists of integrating two data sources that refer to the same target population and that share some variables, but not all. The aim is to obtain a synthetic dataset that contains all the variables of interest from both sources, and that preserves the statistical properties of the original data.

One of the key steps in statistical matching or statistical fusion of data is to ensure the similarity or dissimilarity between the units of two data sources, based on common variables. A common measure of dissimilarity is the Mahalanobis distance, which is a measure of dissimilarity between two vectors of multivariate random variables, based on the covariance matrix. This distance takes into account the correlation between variables, and gives more weight to variables that have more variance.

However, calculating the Mahalanobis distance requires complex mathematical operations, such as inverting the covariance matrix, which can be difficult to implement and computationally expensive, especially when working with large amounts of data or many variables. Furthermore, the calculation of the Mahalanobis distance can be affected by problems such as missing data, nonnormality of variables, or multicollinearity. These problems can lead to inaccurate or unreliable results, and appropriate methods are needed to handle them.

We present the cmahalanobis package, an R package that provides a function to compute the Mahalanobis distance between every pair of species in a list of data frames. Each data frame contains the observations of a species with some variables. The cmahalanobis package is based on the formula for Mahalanobis distance and exploits the Mahalanobis functions of “stats” R package for matrix computation. The cmahalanobis package offers several options for handling missing data, standardizing variables, and selecting relevant variables. The cmahalanobis package differs from other similar packages in terms of its simplicity, flexibility, and speed.

We provide an effective and practical tool for calculating the Mahalanobis distance, and we show some applications in real and simulated cases. We illustrate the results with graphs and tables, and we comment on the implications and limitations of our approach. We conclude that the cmahalanobis package is a useful and valuable resource for statistical matching or statistical fusion of data.

Advantages of the cmahalanobis package

Main advantages of package are easyness of use, well documentation, automatic handling of NA or NaN, automatic plot and the possibility to calculate Mahalanobis distance with more distances using less time.

Other properties:

“cmahalanobis” function allows the calculation in presence of NA and/or NaN. Replacing them with a mean of the variable where the value is present.

First, we have to download the “cmahalanobis” package from the CRAN with the following code:

install.packages(”cmahalanobis”)

Then:

library(cmahalanobis)
num_observations <- 100
num_variables <- 5
group1 <- data.frame(a = c(1,2,NA,4), b = c(3,4,5,NA))
group2 <- data.frame(a = c(2,5,NA,3), b = c(3,4,5,5))
groups <- list(group1, group2)
distances <- cmahalanobis(groups, plot = TRUE, p.value = TRUE)

distances
#> $distances
#>          [,1]     [,2]
#> [1,] 1.500000 2.580882
#> [2,] 1.880282 1.500000
#> 
#> $p_values
#>           [,1]      [,2]
#> [1,]        NA 0.2751494
#> [2,] 0.3905728        NA
group1 <- data.frame(a = c(1,2,NaN,4), b = c(3,4,5,NaN))
group2 <- data.frame(a = c(2,5,NaN,3), b = c(3,4,5,5))
groups <- list(group1, group2)
distances <- cmahalanobis(groups, plot = FALSE, p.value = TRUE)
distances
#> $distances
#>          [,1]     [,2]
#> [1,] 1.500000 2.580882
#> [2,] 1.880282 1.500000
#> 
#> $p_values
#>           [,1]      [,2]
#> [1,]        NA 0.2751494
#> [2,] 0.3905728        NA

Advantages on particular dataset

The “cmahalanobis” package is particularly suitable for datasets that contain multiple groups of multivariate data for which you want to calculate the Mahalanobis distance. Here are some scenarios where “cmahalanobis” proves to be very useful:

“cmahalanobis” is useful when you need a function that can automatically handles the calculation of means and covariance matrices for each group, simplifying the process of calculating distances between multiple groups. This makes it an excellent tool for analyses involving multi-group comparisons in a multivariate context.

Application

Application of cmahalanobis to the iris dataset

Once the download is successful, we can apply our function to the iris dataset which is a built-in dataset in R. The iris dataset contains 150 observations of three species of iris flowers (setosa, versicolor, and virginica), with four variables: sepal length, sepal width, petal length, and petal width. Before we can use our function, we have to create a list of data frames for each species, with the following code:

iris_list <- split(iris, iris$Species)

This code splits the iris data frame into three data frames, one for each species, and stores them in a list called the “iris_list”. We can print the “iris_list” to visualize its structure and content:

# Print iris_list
iris_list
#> $setosa
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> 11          5.4         3.7          1.5         0.2  setosa
#> 12          4.8         3.4          1.6         0.2  setosa
#> 13          4.8         3.0          1.4         0.1  setosa
#> 14          4.3         3.0          1.1         0.1  setosa
#> 15          5.8         4.0          1.2         0.2  setosa
#> 16          5.7         4.4          1.5         0.4  setosa
#> 17          5.4         3.9          1.3         0.4  setosa
#> 18          5.1         3.5          1.4         0.3  setosa
#> 19          5.7         3.8          1.7         0.3  setosa
#> 20          5.1         3.8          1.5         0.3  setosa
#> 21          5.4         3.4          1.7         0.2  setosa
#> 22          5.1         3.7          1.5         0.4  setosa
#> 23          4.6         3.6          1.0         0.2  setosa
#> 24          5.1         3.3          1.7         0.5  setosa
#> 25          4.8         3.4          1.9         0.2  setosa
#> 26          5.0         3.0          1.6         0.2  setosa
#> 27          5.0         3.4          1.6         0.4  setosa
#> 28          5.2         3.5          1.5         0.2  setosa
#> 29          5.2         3.4          1.4         0.2  setosa
#> 30          4.7         3.2          1.6         0.2  setosa
#> 31          4.8         3.1          1.6         0.2  setosa
#> 32          5.4         3.4          1.5         0.4  setosa
#> 33          5.2         4.1          1.5         0.1  setosa
#> 34          5.5         4.2          1.4         0.2  setosa
#> 35          4.9         3.1          1.5         0.2  setosa
#> 36          5.0         3.2          1.2         0.2  setosa
#> 37          5.5         3.5          1.3         0.2  setosa
#> 38          4.9         3.6          1.4         0.1  setosa
#> 39          4.4         3.0          1.3         0.2  setosa
#> 40          5.1         3.4          1.5         0.2  setosa
#> 41          5.0         3.5          1.3         0.3  setosa
#> 42          4.5         2.3          1.3         0.3  setosa
#> 43          4.4         3.2          1.3         0.2  setosa
#> 44          5.0         3.5          1.6         0.6  setosa
#> 45          5.1         3.8          1.9         0.4  setosa
#> 46          4.8         3.0          1.4         0.3  setosa
#> 47          5.1         3.8          1.6         0.2  setosa
#> 48          4.6         3.2          1.4         0.2  setosa
#> 49          5.3         3.7          1.5         0.2  setosa
#> 50          5.0         3.3          1.4         0.2  setosa
#> 
#> $versicolor
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 51           7.0         3.2          4.7         1.4 versicolor
#> 52           6.4         3.2          4.5         1.5 versicolor
#> 53           6.9         3.1          4.9         1.5 versicolor
#> 54           5.5         2.3          4.0         1.3 versicolor
#> 55           6.5         2.8          4.6         1.5 versicolor
#> 56           5.7         2.8          4.5         1.3 versicolor
#> 57           6.3         3.3          4.7         1.6 versicolor
#> 58           4.9         2.4          3.3         1.0 versicolor
#> 59           6.6         2.9          4.6         1.3 versicolor
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 61           5.0         2.0          3.5         1.0 versicolor
#> 62           5.9         3.0          4.2         1.5 versicolor
#> 63           6.0         2.2          4.0         1.0 versicolor
#> 64           6.1         2.9          4.7         1.4 versicolor
#> 65           5.6         2.9          3.6         1.3 versicolor
#> 66           6.7         3.1          4.4         1.4 versicolor
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 68           5.8         2.7          4.1         1.0 versicolor
#> 69           6.2         2.2          4.5         1.5 versicolor
#> 70           5.6         2.5          3.9         1.1 versicolor
#> 71           5.9         3.2          4.8         1.8 versicolor
#> 72           6.1         2.8          4.0         1.3 versicolor
#> 73           6.3         2.5          4.9         1.5 versicolor
#> 74           6.1         2.8          4.7         1.2 versicolor
#> 75           6.4         2.9          4.3         1.3 versicolor
#> 76           6.6         3.0          4.4         1.4 versicolor
#> 77           6.8         2.8          4.8         1.4 versicolor
#> 78           6.7         3.0          5.0         1.7 versicolor
#> 79           6.0         2.9          4.5         1.5 versicolor
#> 80           5.7         2.6          3.5         1.0 versicolor
#> 81           5.5         2.4          3.8         1.1 versicolor
#> 82           5.5         2.4          3.7         1.0 versicolor
#> 83           5.8         2.7          3.9         1.2 versicolor
#> 84           6.0         2.7          5.1         1.6 versicolor
#> 85           5.4         3.0          4.5         1.5 versicolor
#> 86           6.0         3.4          4.5         1.6 versicolor
#> 87           6.7         3.1          4.7         1.5 versicolor
#> 88           6.3         2.3          4.4         1.3 versicolor
#> 89           5.6         3.0          4.1         1.3 versicolor
#> 90           5.5         2.5          4.0         1.3 versicolor
#> 91           5.5         2.6          4.4         1.2 versicolor
#> 92           6.1         3.0          4.6         1.4 versicolor
#> 93           5.8         2.6          4.0         1.2 versicolor
#> 94           5.0         2.3          3.3         1.0 versicolor
#> 95           5.6         2.7          4.2         1.3 versicolor
#> 96           5.7         3.0          4.2         1.2 versicolor
#> 97           5.7         2.9          4.2         1.3 versicolor
#> 98           6.2         2.9          4.3         1.3 versicolor
#> 99           5.1         2.5          3.0         1.1 versicolor
#> 100          5.7         2.8          4.1         1.3 versicolor
#> 
#> $virginica
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 101          6.3         3.3          6.0         2.5 virginica
#> 102          5.8         2.7          5.1         1.9 virginica
#> 103          7.1         3.0          5.9         2.1 virginica
#> 104          6.3         2.9          5.6         1.8 virginica
#> 105          6.5         3.0          5.8         2.2 virginica
#> 106          7.6         3.0          6.6         2.1 virginica
#> 107          4.9         2.5          4.5         1.7 virginica
#> 108          7.3         2.9          6.3         1.8 virginica
#> 109          6.7         2.5          5.8         1.8 virginica
#> 110          7.2         3.6          6.1         2.5 virginica
#> 111          6.5         3.2          5.1         2.0 virginica
#> 112          6.4         2.7          5.3         1.9 virginica
#> 113          6.8         3.0          5.5         2.1 virginica
#> 114          5.7         2.5          5.0         2.0 virginica
#> 115          5.8         2.8          5.1         2.4 virginica
#> 116          6.4         3.2          5.3         2.3 virginica
#> 117          6.5         3.0          5.5         1.8 virginica
#> 118          7.7         3.8          6.7         2.2 virginica
#> 119          7.7         2.6          6.9         2.3 virginica
#> 120          6.0         2.2          5.0         1.5 virginica
#> 121          6.9         3.2          5.7         2.3 virginica
#> 122          5.6         2.8          4.9         2.0 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 124          6.3         2.7          4.9         1.8 virginica
#> 125          6.7         3.3          5.7         2.1 virginica
#> 126          7.2         3.2          6.0         1.8 virginica
#> 127          6.2         2.8          4.8         1.8 virginica
#> 128          6.1         3.0          4.9         1.8 virginica
#> 129          6.4         2.8          5.6         2.1 virginica
#> 130          7.2         3.0          5.8         1.6 virginica
#> 131          7.4         2.8          6.1         1.9 virginica
#> 132          7.9         3.8          6.4         2.0 virginica
#> 133          6.4         2.8          5.6         2.2 virginica
#> 134          6.3         2.8          5.1         1.5 virginica
#> 135          6.1         2.6          5.6         1.4 virginica
#> 136          7.7         3.0          6.1         2.3 virginica
#> 137          6.3         3.4          5.6         2.4 virginica
#> 138          6.4         3.1          5.5         1.8 virginica
#> 139          6.0         3.0          4.8         1.8 virginica
#> 140          6.9         3.1          5.4         2.1 virginica
#> 141          6.7         3.1          5.6         2.4 virginica
#> 142          6.9         3.1          5.1         2.3 virginica
#> 143          5.8         2.7          5.1         1.9 virginica
#> 144          6.8         3.2          5.9         2.3 virginica
#> 145          6.7         3.3          5.7         2.5 virginica
#> 146          6.7         3.0          5.2         2.3 virginica
#> 147          6.3         2.5          5.0         1.9 virginica
#> 148          6.5         3.0          5.2         2.0 virginica
#> 149          6.2         3.4          5.4         2.3 virginica
#> 150          5.9         3.0          5.1         1.8 virginica

Now that we have created our list of dataframes, we can apply the function “cmahalanobis()”.

res <- cmahalanobis(iris_list, p.value = TRUE)

Then, we use:

res
#> $distances
#>          [,1]      [,2]      [,3]
#> [1,]   3.9200 335.19989 727.42056
#> [2,] 107.1736   3.92000  26.71618
#> [3,] 171.7689  16.88654   3.92000
#> 
#> $p_values
#>              [,1]         [,2]          [,3]
#> [1,]           NA 2.748687e-71 4.023276e-156
#> [2,] 2.915001e-22           NA  2.268568e-05
#> [3,] 4.363119e-36 2.033555e-03            NA

This code shows the Mahalanobis distance between each iris species present in the iris dataframe: setosa, versicolor and virginica. The main diagonal of the matrix shows the Mahalanobis distance between each species and itself, which is always 3.9. The other elements of matrix revealed the Mahalanobis distance between the different species. For example: the Mahalanobis distance between setosa and versicolor was 335.19989, instead between versicolor and virginica was 16.88654. This means that the versicolor and virginica species are more similar then the setosa species, according to the variables measured in the iris dataframe. The values in the rows, from 1 to 3, indicate the distance between the mean of that row’s group and the mean of each group in the columns. For example, in [1,2], it represents the distance between the mean of group 1 and all points in group 2; in [3,2], it represents the mean of group 3 and all points in group 2, and so on.

Figure 1 shows the same results as those for the matrix, but graphically.

The p-value matrix that we have obtained by the application of our package to iris “dataset” supplies information statistically significant about Mahalanobis distance between groups of data. We have calculated p-values using chi-squared distribution. Here what p-values show:

In summary, our p-value matrix suggests that all three groups in the ‘iris’ dataset are significantly different from each other based on the provided measurements. This is consistent with what one would expect from the ‘iris’ dataset, which contains measurements of three different species of iris flowers (setosa, versicolor, and virginica), each with distinctive characteristics.

Application of cmahalanobis to the mtcars dataset

Now I will apply the function “cmahalanobis” to the mtcars dataset using the variable “am”. First, we have to split the data into 0, that is, automatic transmission, and 1, that is, manual transmission, thus:

# Create a dataframe where only ”am = 0” is present
auto <- subset(mtcars, am == 0)
# Remove the variable ”am = 0”
auto <- auto [, -9]
# Create a dataframe where only ”am = 1” is present
manual <- subset(mtcars, am == 1)
# Remove the variable ”am = 1”
manual <- manual[, -9]
# Create a list with the two groups of cars
groups <- list(auto, manual)

Inside the vector groups:

# Print groups
groups
#> [[1]]
#>                      mpg cyl  disp  hp drat    wt  qsec vs gear carb
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0    3    4
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0    3    2
#> 
#> [[2]]
#>                 mpg cyl  disp  hp drat    wt  qsec vs gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0    4    4
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1    4    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0    5    8
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1    4    2

Finally we apply:

res <- cmahalanobis(groups, plot = TRUE, p.value = TRUE)

res
#> $distances
#>            [,1]       [,2]
#> [1,]   9.473684 156.116255
#> [2,] 735.591909   9.230769
#> 
#> $p_values
#>               [,1]         [,2]
#> [1,]            NA 2.050145e-28
#> [2,] 1.429549e-151           NA

Figure 2 shows that the distance between the mean of factor 1 and all data of itself is 9.47, the same distance is 9.23 for factor 2. We observe that the distance between cars with “am = 0” and cars with “am = 1” is 156.11, under the perspective of “am = 0”. At “am = 1”, the distance between the mean of this variable and all data of “am = 0” is 735.591909, which is high. Therefore, cars with “am = 1” and with “am = 0” are truly different considering other variables: that is, “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “gear” and “carb”. The p-value matrix, which values are significantly below 0.05, confirms it.

Simulation study

Here is an example of a simulation study using the “cmahalanobis” package in R. In this study, we generated three simulated data groups, each with a set of normally distributed variables, and then calculated the Mahalanobis distance between these groups using the “cmahalanobis” function.

# Load cmahalanobis package
library(cmahalanobis)
# Define the number of observations and variables for each groups
num_observations <- 100
num_variables <- 5
# We generate three groups of simulated data with normal distribution
set.seed(123) # For the reproducibility of results
group1 <- as.data.frame(matrix(rnorm(num_observations * num_variables), 
                               nrow = num_observations))
group2 <- as.data.frame(matrix(rnorm(num_observations * num_variables), 
                               nrow = num_observations))
group3 <- as.data.frame(matrix(rnorm(num_observations * num_variables), 
                               nrow = num_observations))
# Create a list of three groups of data
groups <- list(group1, group2, group3)
# Calculate Mahalanobis distance with cmahalanobis function
distances <- cmahalanobis(groups, p.value = TRUE)

# Visualize the distance matrix
distances
#> $distances
#>          [,1]     [,2]     [,3]
#> [1,] 4.950000 5.639257 5.567479
#> [2,] 4.722923 4.950000 5.029954
#> [3,] 5.329901 5.783087 4.950000
#> 
#> $p_values
#>           [,1]      [,2]      [,3]
#> [1,]        NA 0.3429174 0.3506032
#> [2,] 0.4506217        NA 0.4122355
#> [3,] 0.3769584 0.3279009        NA

In this script:

The distances matrix shows Mahalanobis distance between the mean of each group on rows and all data on columns. P-value matrix shows only values above 0.05. So, distances are low and groups are slightly different.

This simulation study can serve as a starting point for more complex analyses, such as group comparisons in biodiversity studies, pattern analysis in multivariate data, or testing the effectiveness of statistical methods on simulated data. We advise that the results will be different each time you run the script due to the random nature of data generation, unless we set a fixed seed using “set.seed”.

Conclusion

In this work, I have presented my package “cmahalanobis”, which I created to calculate the Mahalanobis distance between two or more groups of multivariate data. The Mahalanobis distance is a measure of dissimilarity between two vectors of multivariate random variables, based on the covariance matrix. This distance is useful for matching or statistical data fusion, i.e., integrating two data sources that refer to the same target population and share some variables. My main goal was to compare the Mahalanobis distances between different types of data and explore patterns and relationships in the data. To create and use my package “cmahalanobis”, I used the programming language R and followed the guidelines for writing R packages provided by CRAN. I implemented my function “cmahalanobis”, which takes a list of data frames as input and returns a matrix with the Mahalanobis distances between the dataframes, plot and p-value matrix. I also created documentation for my package using the “roxygen2” package. To apply my function “cmahalanobis”, I used three datasets: iris, mtcars, and simulated dataset. The iris dataset contains measurements of sepal and petal length and width for three species of iris. The mtcars dataset contains the characteristics of 32 cars, including the type of transmission. The results showed that the Mahalanobis distance between different types of data varied depending on the considered variables and the direction of comparison. For example, in the iris dataset, the highest distance concern the mean of setosa and all data of virginica; the lowest distance with same statistics concerns virginica and versicolor, respectively. In the mtcars dataset, the Mahalanobis distance between cars with automatic and manual transmission significantly differs, indicating a strong dissimilarity between the two types of cars based on the other measured variables. The simulation study showed the functioning of the package on random data. My conclusions are that the Mahalanobis distance is a useful metric for matching or statistical data fusion because it accounts for variable correlations and ellipsoidal data shapes.