Statistical matching or statistical fusion of data is a technique widely used in various fields, such as impact evaluation, public policy analysis, market research, biostatistics and others. This technique consists of integrating two data sources that refer to the same target population and that share some variables, but not all. The aim is to obtain a synthetic dataset that contains all the variables of interest from both sources, and that preserves the statistical properties of the original data.
One of the key steps in statistical matching or statistical fusion of data is to ensure the similarity or dissimilarity between the units of two data sources, based on common variables. A common measure of dissimilarity is the Mahalanobis distance, which is a measure of dissimilarity between two vectors of multivariate random variables, based on the covariance matrix. This distance takes into account the correlation between variables, and gives more weight to variables that have more variance.
However, calculating the Mahalanobis distance requires complex mathematical operations, such as inverting the covariance matrix, which can be difficult to implement and computationally expensive, especially when working with large amounts of data or many variables. Furthermore, the calculation of the Mahalanobis distance can be affected by problems such as missing data, nonnormality of variables, or multicollinearity. These problems can lead to inaccurate or unreliable results, and appropriate methods are needed to handle them.
We present the cmahalanobis package, an R package that provides a function to compute the Mahalanobis distance between every pair of species in a list of data frames. Each data frame contains the observations of a species with some variables. The cmahalanobis package is based on the formula for Mahalanobis distance and exploits the Mahalanobis functions of “stats” R package for matrix computation. The cmahalanobis package offers several options for handling missing data, standardizing variables, and selecting relevant variables. The cmahalanobis package differs from other similar packages in terms of its simplicity, flexibility, and speed.
We provide an effective and practical tool for calculating the Mahalanobis distance, and we show some applications in real and simulated cases. We illustrate the results with graphs and tables, and we comment on the implications and limitations of our approach. We conclude that the cmahalanobis package is a useful and valuable resource for statistical matching or statistical fusion of data.
Main advantages of package are easyness of use, well documentation, automatic handling of NA or NaN, automatic plot and the possibility to calculate Mahalanobis distance with more distances using less time.
Other properties:
Automation: The “cmahalanobis” function automates the calculation of mean and covariance matrix for each data group, simplifying the process;
Convenience: You don’t need to manually compute the center or covariance matrix for each data group, making the function more convenient for situations where you have multiple data groups to compare;
Ease of use: “cmahalanobis” is easier to use when you want to calculate Mahalanobis distances across multiple data groups without having to handle the calculation details.
Automatic plot: plots are created just users apply the function to a list of dataframes, if they don’t want, they have to specify “plot = FALSE”
Automatic p-values: p-values are printed just users write the name of the vector where it is been applied the function to a list of dataframes, if they write “p.value = TRUE”
“cmahalanobis” function allows the calculation in presence of NA and/or NaN. Replacing them with a mean of the variable where the value is present.
First, we have to download the “cmahalanobis” package from the CRAN with the following code:
Then:
library(cmahalanobis)
num_observations <- 100
num_variables <- 5
group1 <- data.frame(a = c(1,2,NA,4), b = c(3,4,5,NA))
group2 <- data.frame(a = c(2,5,NA,3), b = c(3,4,5,5))
groups <- list(group1, group2)
distances <- cmahalanobis(groups, plot = TRUE, p.value = TRUE)
#> $distances
#> [,1] [,2]
#> [1,] 1.500000 2.580882
#> [2,] 1.880282 1.500000
#>
#> $p_values
#> [,1] [,2]
#> [1,] NA 0.2751494
#> [2,] 0.3905728 NA
group1 <- data.frame(a = c(1,2,NaN,4), b = c(3,4,5,NaN))
group2 <- data.frame(a = c(2,5,NaN,3), b = c(3,4,5,5))
groups <- list(group1, group2)
distances <- cmahalanobis(groups, plot = FALSE, p.value = TRUE)
distances
#> $distances
#> [,1] [,2]
#> [1,] 1.500000 2.580882
#> [2,] 1.880282 1.500000
#>
#> $p_values
#> [,1] [,2]
#> [1,] NA 0.2751494
#> [2,] 0.3905728 NA
The “cmahalanobis” package is particularly suitable for datasets that contain multiple groups of multivariate data for which you want to calculate the Mahalanobis distance. Here are some scenarios where “cmahalanobis” proves to be very useful:
“cmahalanobis” is useful when you need a function that can automatically handles the calculation of means and covariance matrices for each group, simplifying the process of calculating distances between multiple groups. This makes it an excellent tool for analyses involving multi-group comparisons in a multivariate context.
Once the download is successful, we can apply our function to the iris dataset which is a built-in dataset in R. The iris dataset contains 150 observations of three species of iris flowers (setosa, versicolor, and virginica), with four variables: sepal length, sepal width, petal length, and petal width. Before we can use our function, we have to create a list of data frames for each species, with the following code:
This code splits the iris data frame into three data frames, one for each species, and stores them in a list called the “iris_list”. We can print the “iris_list” to visualize its structure and content:
#> $setosa
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 11 5.4 3.7 1.5 0.2 setosa
#> 12 4.8 3.4 1.6 0.2 setosa
#> 13 4.8 3.0 1.4 0.1 setosa
#> 14 4.3 3.0 1.1 0.1 setosa
#> 15 5.8 4.0 1.2 0.2 setosa
#> 16 5.7 4.4 1.5 0.4 setosa
#> 17 5.4 3.9 1.3 0.4 setosa
#> 18 5.1 3.5 1.4 0.3 setosa
#> 19 5.7 3.8 1.7 0.3 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
#> 21 5.4 3.4 1.7 0.2 setosa
#> 22 5.1 3.7 1.5 0.4 setosa
#> 23 4.6 3.6 1.0 0.2 setosa
#> 24 5.1 3.3 1.7 0.5 setosa
#> 25 4.8 3.4 1.9 0.2 setosa
#> 26 5.0 3.0 1.6 0.2 setosa
#> 27 5.0 3.4 1.6 0.4 setosa
#> 28 5.2 3.5 1.5 0.2 setosa
#> 29 5.2 3.4 1.4 0.2 setosa
#> 30 4.7 3.2 1.6 0.2 setosa
#> 31 4.8 3.1 1.6 0.2 setosa
#> 32 5.4 3.4 1.5 0.4 setosa
#> 33 5.2 4.1 1.5 0.1 setosa
#> 34 5.5 4.2 1.4 0.2 setosa
#> 35 4.9 3.1 1.5 0.2 setosa
#> 36 5.0 3.2 1.2 0.2 setosa
#> 37 5.5 3.5 1.3 0.2 setosa
#> 38 4.9 3.6 1.4 0.1 setosa
#> 39 4.4 3.0 1.3 0.2 setosa
#> 40 5.1 3.4 1.5 0.2 setosa
#> 41 5.0 3.5 1.3 0.3 setosa
#> 42 4.5 2.3 1.3 0.3 setosa
#> 43 4.4 3.2 1.3 0.2 setosa
#> 44 5.0 3.5 1.6 0.6 setosa
#> 45 5.1 3.8 1.9 0.4 setosa
#> 46 4.8 3.0 1.4 0.3 setosa
#> 47 5.1 3.8 1.6 0.2 setosa
#> 48 4.6 3.2 1.4 0.2 setosa
#> 49 5.3 3.7 1.5 0.2 setosa
#> 50 5.0 3.3 1.4 0.2 setosa
#>
#> $versicolor
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 51 7.0 3.2 4.7 1.4 versicolor
#> 52 6.4 3.2 4.5 1.5 versicolor
#> 53 6.9 3.1 4.9 1.5 versicolor
#> 54 5.5 2.3 4.0 1.3 versicolor
#> 55 6.5 2.8 4.6 1.5 versicolor
#> 56 5.7 2.8 4.5 1.3 versicolor
#> 57 6.3 3.3 4.7 1.6 versicolor
#> 58 4.9 2.4 3.3 1.0 versicolor
#> 59 6.6 2.9 4.6 1.3 versicolor
#> 60 5.2 2.7 3.9 1.4 versicolor
#> 61 5.0 2.0 3.5 1.0 versicolor
#> 62 5.9 3.0 4.2 1.5 versicolor
#> 63 6.0 2.2 4.0 1.0 versicolor
#> 64 6.1 2.9 4.7 1.4 versicolor
#> 65 5.6 2.9 3.6 1.3 versicolor
#> 66 6.7 3.1 4.4 1.4 versicolor
#> 67 5.6 3.0 4.5 1.5 versicolor
#> 68 5.8 2.7 4.1 1.0 versicolor
#> 69 6.2 2.2 4.5 1.5 versicolor
#> 70 5.6 2.5 3.9 1.1 versicolor
#> 71 5.9 3.2 4.8 1.8 versicolor
#> 72 6.1 2.8 4.0 1.3 versicolor
#> 73 6.3 2.5 4.9 1.5 versicolor
#> 74 6.1 2.8 4.7 1.2 versicolor
#> 75 6.4 2.9 4.3 1.3 versicolor
#> 76 6.6 3.0 4.4 1.4 versicolor
#> 77 6.8 2.8 4.8 1.4 versicolor
#> 78 6.7 3.0 5.0 1.7 versicolor
#> 79 6.0 2.9 4.5 1.5 versicolor
#> 80 5.7 2.6 3.5 1.0 versicolor
#> 81 5.5 2.4 3.8 1.1 versicolor
#> 82 5.5 2.4 3.7 1.0 versicolor
#> 83 5.8 2.7 3.9 1.2 versicolor
#> 84 6.0 2.7 5.1 1.6 versicolor
#> 85 5.4 3.0 4.5 1.5 versicolor
#> 86 6.0 3.4 4.5 1.6 versicolor
#> 87 6.7 3.1 4.7 1.5 versicolor
#> 88 6.3 2.3 4.4 1.3 versicolor
#> 89 5.6 3.0 4.1 1.3 versicolor
#> 90 5.5 2.5 4.0 1.3 versicolor
#> 91 5.5 2.6 4.4 1.2 versicolor
#> 92 6.1 3.0 4.6 1.4 versicolor
#> 93 5.8 2.6 4.0 1.2 versicolor
#> 94 5.0 2.3 3.3 1.0 versicolor
#> 95 5.6 2.7 4.2 1.3 versicolor
#> 96 5.7 3.0 4.2 1.2 versicolor
#> 97 5.7 2.9 4.2 1.3 versicolor
#> 98 6.2 2.9 4.3 1.3 versicolor
#> 99 5.1 2.5 3.0 1.1 versicolor
#> 100 5.7 2.8 4.1 1.3 versicolor
#>
#> $virginica
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 101 6.3 3.3 6.0 2.5 virginica
#> 102 5.8 2.7 5.1 1.9 virginica
#> 103 7.1 3.0 5.9 2.1 virginica
#> 104 6.3 2.9 5.6 1.8 virginica
#> 105 6.5 3.0 5.8 2.2 virginica
#> 106 7.6 3.0 6.6 2.1 virginica
#> 107 4.9 2.5 4.5 1.7 virginica
#> 108 7.3 2.9 6.3 1.8 virginica
#> 109 6.7 2.5 5.8 1.8 virginica
#> 110 7.2 3.6 6.1 2.5 virginica
#> 111 6.5 3.2 5.1 2.0 virginica
#> 112 6.4 2.7 5.3 1.9 virginica
#> 113 6.8 3.0 5.5 2.1 virginica
#> 114 5.7 2.5 5.0 2.0 virginica
#> 115 5.8 2.8 5.1 2.4 virginica
#> 116 6.4 3.2 5.3 2.3 virginica
#> 117 6.5 3.0 5.5 1.8 virginica
#> 118 7.7 3.8 6.7 2.2 virginica
#> 119 7.7 2.6 6.9 2.3 virginica
#> 120 6.0 2.2 5.0 1.5 virginica
#> 121 6.9 3.2 5.7 2.3 virginica
#> 122 5.6 2.8 4.9 2.0 virginica
#> 123 7.7 2.8 6.7 2.0 virginica
#> 124 6.3 2.7 4.9 1.8 virginica
#> 125 6.7 3.3 5.7 2.1 virginica
#> 126 7.2 3.2 6.0 1.8 virginica
#> 127 6.2 2.8 4.8 1.8 virginica
#> 128 6.1 3.0 4.9 1.8 virginica
#> 129 6.4 2.8 5.6 2.1 virginica
#> 130 7.2 3.0 5.8 1.6 virginica
#> 131 7.4 2.8 6.1 1.9 virginica
#> 132 7.9 3.8 6.4 2.0 virginica
#> 133 6.4 2.8 5.6 2.2 virginica
#> 134 6.3 2.8 5.1 1.5 virginica
#> 135 6.1 2.6 5.6 1.4 virginica
#> 136 7.7 3.0 6.1 2.3 virginica
#> 137 6.3 3.4 5.6 2.4 virginica
#> 138 6.4 3.1 5.5 1.8 virginica
#> 139 6.0 3.0 4.8 1.8 virginica
#> 140 6.9 3.1 5.4 2.1 virginica
#> 141 6.7 3.1 5.6 2.4 virginica
#> 142 6.9 3.1 5.1 2.3 virginica
#> 143 5.8 2.7 5.1 1.9 virginica
#> 144 6.8 3.2 5.9 2.3 virginica
#> 145 6.7 3.3 5.7 2.5 virginica
#> 146 6.7 3.0 5.2 2.3 virginica
#> 147 6.3 2.5 5.0 1.9 virginica
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
Now that we have created our list of dataframes, we can apply the function “cmahalanobis()”.
Then, we use:
#> $distances
#> [,1] [,2] [,3]
#> [1,] 3.9200 335.19989 727.42056
#> [2,] 107.1736 3.92000 26.71618
#> [3,] 171.7689 16.88654 3.92000
#>
#> $p_values
#> [,1] [,2] [,3]
#> [1,] NA 2.748687e-71 4.023276e-156
#> [2,] 2.915001e-22 NA 2.268568e-05
#> [3,] 4.363119e-36 2.033555e-03 NA
This code shows the Mahalanobis distance between each iris species present in the iris dataframe: setosa, versicolor and virginica. The main diagonal of the matrix shows the Mahalanobis distance between each species and itself, which is always 3.9. The other elements of matrix revealed the Mahalanobis distance between the different species. For example: the Mahalanobis distance between setosa and versicolor was 335.19989, instead between versicolor and virginica was 16.88654. This means that the versicolor and virginica species are more similar then the setosa species, according to the variables measured in the iris dataframe. The values in the rows, from 1 to 3, indicate the distance between the mean of that row’s group and the mean of each group in the columns. For example, in [1,2], it represents the distance between the mean of group 1 and all points in group 2; in [3,2], it represents the mean of group 3 and all points in group 2, and so on.
Figure 1 shows the same results as those for the matrix, but graphically.
The p-value matrix that we have obtained by the application of our package to iris “dataset” supplies information statistically significant about Mahalanobis distance between groups of data. We have calculated p-values using chi-squared distribution. Here what p-values show:
P-value under 0.05: A very small p-value (such as those we see in your matrix) indicates that the Mahalanobis distance between groups is statistically significant. In other words, the compared groups are significantly different from each other in terms of their measured characteristics.
NA (Not Applicable): NA values on the diagonal indicate that a p-value has not been calculated for comparing a group with itself, as the p-value is meaningless when calculating the difference between a group and itself.
In summary, our p-value matrix suggests that all three groups in the ‘iris’ dataset are significantly different from each other based on the provided measurements. This is consistent with what one would expect from the ‘iris’ dataset, which contains measurements of three different species of iris flowers (setosa, versicolor, and virginica), each with distinctive characteristics.
Now I will apply the function “cmahalanobis” to the mtcars dataset using the variable “am”. First, we have to split the data into 0, that is, automatic transmission, and 1, that is, manual transmission, thus:
# Create a dataframe where only ”am = 0” is present
auto <- subset(mtcars, am == 0)
# Remove the variable ”am = 0”
auto <- auto [, -9]
# Create a dataframe where only ”am = 1” is present
manual <- subset(mtcars, am == 1)
# Remove the variable ”am = 1”
manual <- manual[, -9]
# Create a list with the two groups of cars
groups <- list(auto, manual)
Inside the vector groups:
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs gear carb
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
Finally we apply:
#> $distances
#> [,1] [,2]
#> [1,] 9.473684 156.116255
#> [2,] 735.591909 9.230769
#>
#> $p_values
#> [,1] [,2]
#> [1,] NA 2.050145e-28
#> [2,] 1.429549e-151 NA
Figure 2 shows that the distance between the mean of factor 1 and all data of itself is 9.47, the same distance is 9.23 for factor 2. We observe that the distance between cars with “am = 0” and cars with “am = 1” is 156.11, under the perspective of “am = 0”. At “am = 1”, the distance between the mean of this variable and all data of “am = 0” is 735.591909, which is high. Therefore, cars with “am = 1” and with “am = 0” are truly different considering other variables: that is, “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “gear” and “carb”. The p-value matrix, which values are significantly below 0.05, confirms it.
Here is an example of a simulation study using the “cmahalanobis” package in R. In this study, we generated three simulated data groups, each with a set of normally distributed variables, and then calculated the Mahalanobis distance between these groups using the “cmahalanobis” function.
# Load cmahalanobis package
library(cmahalanobis)
# Define the number of observations and variables for each groups
num_observations <- 100
num_variables <- 5
# We generate three groups of simulated data with normal distribution
set.seed(123) # For the reproducibility of results
group1 <- as.data.frame(matrix(rnorm(num_observations * num_variables),
nrow = num_observations))
group2 <- as.data.frame(matrix(rnorm(num_observations * num_variables),
nrow = num_observations))
group3 <- as.data.frame(matrix(rnorm(num_observations * num_variables),
nrow = num_observations))
# Create a list of three groups of data
groups <- list(group1, group2, group3)
# Calculate Mahalanobis distance with cmahalanobis function
distances <- cmahalanobis(groups, p.value = TRUE)
#> $distances
#> [,1] [,2] [,3]
#> [1,] 4.950000 5.639257 5.567479
#> [2,] 4.722923 4.950000 5.029954
#> [3,] 5.329901 5.783087 4.950000
#>
#> $p_values
#> [,1] [,2] [,3]
#> [1,] NA 0.3429174 0.3506032
#> [2,] 0.4506217 NA 0.4122355
#> [3,] 0.3769584 0.3279009 NA
In this script:
The distances matrix shows Mahalanobis distance between the mean of each group on rows and all data on columns. P-value matrix shows only values above 0.05. So, distances are low and groups are slightly different.
This simulation study can serve as a starting point for more complex analyses, such as group comparisons in biodiversity studies, pattern analysis in multivariate data, or testing the effectiveness of statistical methods on simulated data. We advise that the results will be different each time you run the script due to the random nature of data generation, unless we set a fixed seed using “set.seed”.
In this work, I have presented my package “cmahalanobis”, which I created to calculate the Mahalanobis distance between two or more groups of multivariate data. The Mahalanobis distance is a measure of dissimilarity between two vectors of multivariate random variables, based on the covariance matrix. This distance is useful for matching or statistical data fusion, i.e., integrating two data sources that refer to the same target population and share some variables. My main goal was to compare the Mahalanobis distances between different types of data and explore patterns and relationships in the data. To create and use my package “cmahalanobis”, I used the programming language R and followed the guidelines for writing R packages provided by CRAN. I implemented my function “cmahalanobis”, which takes a list of data frames as input and returns a matrix with the Mahalanobis distances between the dataframes, plot and p-value matrix. I also created documentation for my package using the “roxygen2” package. To apply my function “cmahalanobis”, I used three datasets: iris, mtcars, and simulated dataset. The iris dataset contains measurements of sepal and petal length and width for three species of iris. The mtcars dataset contains the characteristics of 32 cars, including the type of transmission. The results showed that the Mahalanobis distance between different types of data varied depending on the considered variables and the direction of comparison. For example, in the iris dataset, the highest distance concern the mean of setosa and all data of virginica; the lowest distance with same statistics concerns virginica and versicolor, respectively. In the mtcars dataset, the Mahalanobis distance between cars with automatic and manual transmission significantly differs, indicating a strong dissimilarity between the two types of cars based on the other measured variables. The simulation study showed the functioning of the package on random data. My conclusions are that the Mahalanobis distance is a useful metric for matching or statistical data fusion because it accounts for variable correlations and ellipsoidal data shapes.