An R package for random-forest-empowered imputation of missing Data

`RfEmpImp`

is an R package for multiple imputation using
chained random forests (RF).

This R package provides prediction-based and node-based multiple
imputation algorithms using random forests, and currently operates under
the multiple imputation computation framework `mice`

.

For more details of the implemented imputation algorithms, please refer
to: arXiv:2004.14823
(further updates soon).

Users can install the CRAN version of `RfEmpImp`

from
CRAN, or the latest development version of `RfEmpImp`

from
GitHub:

```
# Install from CRAN
install.packages("RfEmpImp")
# Install from GitHub online
if(!"remotes" %in% installed.packages()) install.packages("remotes")
::install_github("shangzhi-hong/RfEmpImp")
remotes# Install from released source package
install.packages(path_to_source_file, repos = NULL, type = "source")
# Attach
library(RfEmpImp)
```

For data with mixed types of variables, users can call function
`imp.rfemp()`

to use `RfEmp`

method, for using
`RfPred.Emp`

method for continuous variables, and using
`RfPred.Cate`

method for categorical variables (of type
`logical`

or `factor`

, etc.).

Starting with version `2.0.0`

, the names of parameters were
further simplified, please refer to the documentation for details.

For continuous variables, in `RfPred.Emp`

method, the
empirical distribution of random forest’s out-of-bag prediction errors
is used when constructing the conditional distributions of the variable
under imputation, providing conditional distributions with better
quality. Users can set `method = "rfpred.emp"`

in function
call to `mice`

to use it.

Also, in `RfPred.Norm`

method, normality was assumed for
RF prediction errors, as proposed by Shah *et al.*, and users can
set `method = "rfpred.norm"`

in function call to
`mice`

to use it.

For categorical variables, in `RfPred.Cate`

method, the
probability machine theory is used, and the predictions of missing
categories are based on the predicted probabilities for each missing
observation. Users can set `method = "rfpred.cate"`

in
function call to `mice`

to use it.

```
# Prepare data
<- conv.factor(nhanes, c("age", "hyp"))
df # Do imputation
<- imp.rfemp(df)
imp # Do analyses
<- with(imp, lm(chl ~ bmi + hyp))
regObj # Pool analyzed results
<- pool(regObj)
poolObj # Extract estimates
<- reg.ests(poolObj) res
```

For continuous or categorical variables, the observations under the
predicting nodes of random forest are used as candidates for
imputation.

Two methods are now available for the `RfNode`

algorithm
series.

It should be noted that categorical variables should be of types of
`logical`

or `factor`

, etc.

Users can call function `imp.rfnode.cond()`

to use
`RfNode.Cond`

method, performing imputation using the
conditional distribution formed by the prediction nodes.

The weight changes of observations caused by the bootstrapping of random
forest are considered, and only the “in-bag” observations are used as
candidates for imputation.

Also, users can set `method = "rfnode.cond"`

in function call
to `mice`

to use it.

Users can call function `imp.rfnode.prox()`

to use
`RfNode.Prox`

method, performing imputation using the
proximity matrices of random forests.

All the observations fall under the same predicting nodes are used as
candidates for imputation, including the out-of-bag ones.

Also, users can set `method = "rfnode.prox"`

in function call
to `mice`

to use it.

```
# Prepare data
<- conv.factor(nhanes, c("age", "hyp"))
df # Do imputation
<- imp.rfnode.cond(df)
imp # Or: imp <- imp.rfnode.prox(df)
# Do analyses
<- with(imp, lm(chl ~ bmi + hyp))
regObj # Pool analyzed results
<- pool(regObj)
poolObj # Extract estimates
<- reg.ests(poolObj) res
```

Type | Impute function | Univariate sampler | Variable type |
---|---|---|---|

Prediction-based imputation | imp.emp() | mice.impute.rfemp() | Mixed |

/ | mice.impute.rfpred.emp() | Continuous | |

/ | mice.impute.rfpred.norm() | Continuous | |

/ | mice.impute.rfpred.cate() | Categorical | |

Node-based imputation | imp.node.cond() | mice.impute.rfnode.cond() | Mixed |

imp.node.prox() | mice.impute.rfnode.prox() | Mixed | |

/ | mice.impute.rfnode() | Mixed |

The figure below shows how the imputation functions are organized in
this R package.

As random forest can be compute-intensive itself, and during multiple
imputation process, random forest models will be built for the variables
containing missing data for a certain number of iterations (usually 5 to
10 times) repeatedly (usually 5 to 20 times, for the number of
imputations performed). Thus, computational efficiency is of crucial
importance for multiple imputation using chained random forests,
especially for large data sets.

So in `RfEmpImp`

, the random forest model building process is
accelerated using parallel computation powered by `ranger`

.
The ranger R package provides support for parallel computation using
native C++. In our simulations, parallel computation can provide
impressive performance boost for imputation process (about 4x faster on
a quad-core laptop).

- Hong, Shangzhi, et al. “Multiple imputation using chained random forests.” Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
- Zhang, Haozhe, et al. “Random forest prediction intervals.” The American Statistician (2019): 1-15.
- Wright, Marvin N., and Andreas Ziegler. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77.i01 (2017).
- Shah, Anoop D., et al. “Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.” American Journal of Epidemiology 179.6 (2014): 764-774.
- Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. “Recursive partitioning for missing data imputation in the presence of interaction effects.” Computational Statistics & Data Analysis 72 (2014): 92-104.
- Malley, James D., et al. “Probability machines.” Methods of information in medicine 51.01 (2012): 74-81.
- Van Buuren, Stef, and Karin Groothuis-Oudshoorn. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45.i03 (2011).