Missing data are a common issue in many fields of empirical research. An popular approach to handling missing data is the method of multiple imputation (MI). Multiple imputation involves replacing missing values by a number of imputations, creating multiple imputed datasets. Each completed dataset is then analysed as usual, and estimates and standard errors are combined across imputations using rules developed by Rubin.

The most popular approach to imputation uses parametric models for the missing variables given the observed. Multiple imputation gives valid inferences provided that the missing data satisfy the so called missing at random (MAR) assumption and that the imputation models used are correctly specified.

When multiple variables are affected by missingness, the traditional approach to imputation is to specify a joint (or multivariate model) for the partially observed variables. One of the earliest examples of this was MI using the multivariate normal model. Rather than specifying a joint model directly, a popular alternative is the fully conditional specification (FCS), or chained equations approach. In FCS MI separate conditional models are specified for each partially observed variable. In each of these conditional models, by default all of the variables serve as predictors. For an overview of the FCS MI and an implementation of it in R, see van Buuren and Groothuis-Oudshoorn.

When missing values are imputed from a misspecified model, in general
invalid inferences will result. One way in which misspecification can
occur is when the imputation and substantive (analysis) model of
interest are incompatible. Loosely speaking, this means there exists no
joint model which contains the imputation model and the substantive
model as the corresponding conditionals. In this case, as described by
Bartlett *et
al* (2015), assuming that the substantive model is correctly
specified, unless the imputation and substantive models can be made
compatible by imposing a restriction on the imputation model,
incompatibility implies the imputation model is misspecified.

Such incompatibility between the imputation model used to impute a partially observed covariate and the substantive/outcome model can arise for example when the latter includes interactions or non-linear effects of variables. A further example is when the substantive model is a Cox proportional hazards model for a censored time to event outcome. In these cases, it may be difficult or impossible to specify an imputation model for a covariate which is compatible with the model for the outcome (the substantive model) using standard imputation models as available in existing packages.

The substantive model compatible modification of FCS MI (SMC-FCS),
proposed by Bartlett
*et al* (2015), ensures that each partially observed variable
is imputed from an imputation model which is compatible with a user
specified model for the outcome (which is typically the substantive
model of interest, although see below regarding auxiliary variables). As
described in further detail in the linked paper, for each partially
observed variable, e.g. `x1`

, in SMC-FCS a model is specified
for the conditional distribution of `x1`

given the other
partially observed variables `x2,x3,..,xp`

and fully observed
covariates `z`

. This, together with the specified substantive
model (a model for the outcome `y`

) defines an imputation
model for `x1`

which is guaranteed to be compatible with this
specified substantive model.

Unfortunately, the resulting imputation model for each partially
observed variable generally does not belong to a standard parametric
family, complicating the imputation of missing values. To overcome this,
`smcfcs`

uses the method of rejection sampling, which is more
computationally intensive than direct sampling methods.

SMC-FCS ensures compatibility between each partially observed
covariate’s imputation model with the substantive model. However, when
there is more than one partially observed variable, it does not
guarantee that the corresponding different imputation models are
mutually compatible. Consequently, as described further by
Bartlett *et
al* (2015), only in special cases does SMC-FCS generate
imputations from a well defined Bayesian joint model. Nonetheless, by
ensuring compatibility between each partially observed variable’s
imputation model and the substantive model, it arguably overcomes
(compared to standard FCS MI) the type of model incompatibility which is
most likely to adversely affect inferences.

In certain situations it may be advantageous to use SMC-FCS rather
than traditional FCS MI. Important examples, as mentioned previously,
include situations where the substantive (outcome) model includes
interactions or non-linear effects of some of the covariates, or where
the outcome model is itself non-linear, such as a Cox proportional
hazards model. See
Bartlett *et
al* (2015) for simulation results comparing the two approaches
in these situations.

`smcfcs`

packageThe `smcfcs`

function in the `smcfcs`

package
implements the SMC-FCS procedure. Currently linear, logistic and Cox
proportional hazards substantive models. Competing risks outcome data
can also be accommodated, with a Cox proportional hazards model used to
model each cause specific hazard function. Partially observed variables
can be imputed using normal linear regression, logistic regression (for
binary variables), proportional odds regression (sometimes known as
ordinal logistic regression, suitable for ordered categorical
variables), multinomial logistic regression (for unordered categorical
variables), and Poisson regression (for count variables). In the
following we describe some of the important aspects of using
`smcfcs`

by way of an example data frame.

To illustrate the package, we use the simple example data frame
`ex_linquad`

, which is included with the package. This data
frame was simulated for `n=1000`

independent rows. For each
row, variables `y,x,z,v`

were intended to be collected, but
there are missing values in `x`

. The values have been made
artificially missing, with the probability of missingness dependent on
(the fully observed) `y`

variable. Below the first 10 rows of
the data frame are shown:

```
## y z x xsq v
## 1 -0.3404639 -1.2053334 -1.2070657 1.45700772 -2.18088437
## 2 2.1699185 0.3014667 0.2774292 0.07696698 0.17779805
## 3 2.0293128 -1.5391452 1.0844412 1.17601267 0.97370618
## 4 6.6311247 0.6353707 -2.3456977 5.50229771 -1.15350311
## 5 3.9096291 0.7029518 0.4291247 0.18414800 -1.22676124
## 6 -0.5019313 -1.9058829 NA NA -0.53958740
## 7 0.5816303 0.9389214 NA NA -2.31497909
## 8 1.0236009 -0.2244921 NA NA -0.03351108
## 9 -1.2942170 -0.6738168 NA NA -1.01040885
## 10 1.9041271 0.4457874 -0.8900378 0.79216734 -2.72923160
```

As shown, the `xsq`

variable is equal to the square of the
`x`

variable. Since the latter has missing values, so does
the former. We now impute the missing values in `x`

and
`xsq`

, compatibly with a substantive model for the outcome
`y`

which is specified as a linear regression, with
`z`

, `x`

and `xsq`

as covariates:

```
set.seed(123)
# impute missing values in x, compatibly with quadratic substantive model
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+xsq", method = c("", "", "norm", "x^2", ""))
```

```
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
```

As demonstrated here, the minimal arguments to pass to
`smcfcs`

are the data frame to be used, the substantive model
type, the substantive model formula, and a method vector. The
substantive model type specifies whether the model for the outcome is
linear, logistic or Cox regression, or a competing risks analysis (see
documentation). The `smformula`

specifies the linear
predictor of the substantive/outcome model. Here we specified that the
outcome `y`

is assumed to follow a linear regression model,
with `z`

, `x`

and `xsq`

as
predictors.

Lastly, we passed a vector of strings as the `method`

argument. This specifies, for each column in the data frame, the method
to use for imputation. As in the example, empty strings should be passed
for those columns which are fully observed and thus are not to be
imputed. For `x`

we specify `norm`

, in order to
impute using a normal linear regression model. See the help for
`smcfcs`

for the syntax for other imputation model types. For
`xsq`

we specify `"x^2"`

as the imputation method.
This instructs `smcfcs`

to impute `xsq`

by simply
squaring the imputed values of `x`

. Such a specification
could also be used with the `mice`

package, which implements
standard FCS MI. Note however that here, through specifying the
substantive model as including an effect of `xsq`

,
`smcfcs`

is imputing the missing values in `x`

which allows for a quadratic effect on `y`

.

Having generated the imputed datasets, we can now fit our substantive
model of interest. Here we make use of the `mitools`

package
to fit our substantive model to each imputed dataset, collect the
results, and combine them using Rubin’s rules:

```
# fit substantive model
library(mitools)
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + xsq))
summary(MIcombine(models))
```

```
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9516536 0.04140617 0.8701207 1.033187 13 %
## z 1.0052919 0.03394341 0.9386482 1.071936 8 %
## x 0.9881325 0.03831183 0.9105610 1.065704 36 %
## xsq 1.0366961 0.02365518 0.9897408 1.083651 22 %
```

Here the data were simulated such that the coefficients of
`z`

, `x`

and `xsq`

are all 1. The
estimates we have obtained are (reassuringly) close to these true
parameter values. To illustrate the dangers of imputing a covariate
using an imputation model which is not compatible with the substantive
model, we now re-impute `x`

, but this time imputing
compatibly with a model for `y`

which does not allow for the
quadratic effect:

```
# impute missing values in x, compatibly with model for y which omits the quadratic effect
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x", method = c("", "", "norm", "x^2", ""))
```

```
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
```

`smcfcs`

has issued some warnings about rejection
sampling. We discuss this later in this vignette, but here we will
continue and proceed to fit a model for `y`

which includes
both `x`

and `xsq`

(plus `z`

) as
covariates:

```
# fit substantive model
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + xsq))
summary(MIcombine(models))
```

```
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 1.2643033 0.08816139 1.0768640 1.4517425 56 %
## z 1.0608663 0.05966953 0.9412078 1.1805248 30 %
## x 0.8667122 0.06640497 0.7221981 1.0112263 63 %
## xsq 0.5599486 0.07246169 0.3777946 0.7421025 89 %
```

Now we have an estimate of the coefficient of `xsq`

of
0.60, which is considerably smaller than the true value 1 used to
simulate the data. This bias is due to the imputation model we have just
used for `x`

being misspecified. In particular, it was
misspecified due to the fact it wrongly assumed a linear dependence of
`y`

on `x`

, rather than allowing a quadratic
dependence.

`smcfcs`

One of the strengths of multiple imputation in general is the possibility to use variables in imputation models which are subsequently not involved in the substantive model. This may be useful in order to condition or adjust for variables which are predictive of missingness, but which are not used in the substantive model of interest. Moreover, adjusting for auxiliary variables which are strongly correlated with one or more variables which are being imputed improves efficiency.

When using `smcfcs`

to impute missing covariates,
auxiliary variables `v`

can be included by adding them as an
additional covariate in the substantive model, as passed using the
`smformula`

argument. Here we are imputing `x`

compatibly with a certain specification of model for the outcome. Our
substantive model of interest is then a simpler model which omits
`v`

. For example, in the quadratic example dataset, we can
add the auxiliary variable `v`

using:

```
# impute, including v as a covariate in the substantive/outcome model
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+xsq+v", method = c("", "", "norm", "x^2", ""))
```

```
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z,v"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
```

```
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 1 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
```

```
# fit substantive model, which omits v
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + xsq))
summary(MIcombine(models))
```

```
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9240718 0.04254518 0.8397715 1.008372 20 %
## z 1.0268391 0.03383699 0.9603464 1.093332 10 %
## x 0.9919787 0.03875471 0.9130131 1.070944 39 %
## xsq 1.0377357 0.02230502 0.9938523 1.081619 12 %
```

For outcome models other than linear regression, this approach is not
entirely justifiable due to the lack of collapsibility of non-linear
models. For example, if a Cox model is assumed for a failure time given
variables `x`

and `v`

, the hazard function given
only `x`

(i.e. omitting `v`

from the model) is no
longer a Cox model. Further research is warranted to explore how this
might affect the resulting inferences.

It is also possible to include the auxiliary variable `v`

without adding it to the outcome model (as given in the
`smformula`

argument), through specification of the
`predictorMatrix`

argument. Doing so conditions on
`v`

, but assumes that the outcome is independent of
`v`

, conditional on whatever covariates are specified in
`smformula`

. This should thus only be used when the latter
assumption is justified. When it is, inferences will in general be more
efficient. To make this assumption when imputing `x`

in the
`ex_linquad`

data, we define a `predictorMatrix`

which will specify that `x`

be imputed using both
`z`

and `v`

, but we omit `v`

from the
`smformula`

argument:

```
predMatrix <- array(0, dim = c(ncol(ex_linquad), ncol(ex_linquad)))
predMatrix[3, ] <- c(0, 1, 0, 0, 1)
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+xsq", method = c("", "", "norm", "x^2", ""), predictorMatrix = predMatrix)
```

```
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
```

```
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + xsq))
summary(MIcombine(models))
```

```
## Multiple imputation results:
## with(impobj, lm(y ~ z + x + xsq))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 0.9411917 0.04212991 0.8579961 1.024387 17 %
## z 1.0159744 0.03395149 0.9492893 1.082660 9 %
## x 0.9962831 0.03602444 0.9244417 1.068124 26 %
## xsq 1.0394828 0.02227231 0.9957422 1.083223 8 %
```

Sometimes when running `smcfcs`

you may receive warnings
that the rejection sampling that `smcfcs`

uses has failed to
draw from the required distribution on a couple of occasions. Upon
receiving this warning, it is generally good idea to re-run
`smcfcs`

, specifying a value for `rjlimit`

which
is larger than the default, until the warning is no longer issued.
Having said that, when only a small number of warnings are issued, it
may be fine to ignore the warnings, especially when the dataset is
large.

Like standard chained equations or FCS imputation, the SMC-FCS
algorithm must be run for a sufficient number of iterations for the
process to converge to its stationary distribution. The default number
of iterations used is 10, but this may not be sufficient in any given
dataset and model specification To assess convergence, the object
returned by `smcfcs`

includes an object called
`smCoefIter`

. This matrix contains the parameter estimates of
the substantive model, and is indexed by imputation number, parameter
number, and iteration number. To assess convergence, one can call smcfcs
with `m=1`

and `numit`

suitably chosen
(e.g. `numit=100`

). The values in the resulting smCoefIter
matrix can then be plotted to assess convergence. To illustrate, we
re-run the imputation model used previously with the example data, but
asking for only `m=1`

imputation to be generated, and with
100 iterations.

```
# impute once with a larger number of iterations than the default 10
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+xsq", method = c("", "", "norm", "x^2", ""), predictorMatrix = predMatrix, m = 1, numit = 100)
```

```
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation 1"
## [1] "Imputing: x using z,v plus outcome"
```

```
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 1 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
```

The plot shows that the process appears to converge rapidly, such
that the default choice of `numit=10`

is probably fine
here.

Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research, 2015; 24(4):462-487

van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3)