Multiple Imputation via Chained Equations (MICE)
I think it might be a wise idea to learn about MICE (since I already have knowledge of multiple imputation) just in case I encounter a specific missing data problem in the future that I can’t easily use IPW (e.g. multiple missing variables, systematically missing data, etc.)
As a caveat (and something that took me way too long to learn), missing data methods are primarily to reduce bias (especially in a missing at random scenario) while maintaining the uncertainity (variance) with the number of observed data, i.e. we cannot improve statistical power.
The setup
Imagine a simple scenario with three variables \((C_1, C_2, B_1)\), where C is continuous and B is binary, with some degree of missingness for all three (not systematic, or else that would be MNAR I think). We will look at one imputed dataset, but these steps need to be done for \(m\) copies of the original data (guidelines say this should be the ceiling of the percentage of missing data present?). Before we get started, each variable gets a temporary imputation for every missing value. For continuous data, the mean is a solid choice. For binary, I would also think the mean (i.e. the proportion of 1’s), but I am not entirely sure.
The following steps should be done for each feature with missing data.
- Get rid of the temporary imputations for \(C_1\)
- Sample (with replacement) on the data with observed \(C_1\) (I think? This is kind of what happens in Multiple Imputation too)
- Generate a linear regression model \(C_1 \sim C_2 + B_1\) to be the imputation model for \(C_1\) for this imputed dataset. Notice that \(C_2\) and \(B_1\) still have the temporary imputations.
Iterate this process through the three variables. The reason why we call this “chained equations” is because the imputed values for \(C_1\) become the actual imputed values for the rest of the iterations. The \(m\) iterations will give us \(m\) imputed datasets, which we can fit the same statistical model for each and combine the same parameter estimatates via rubin’s rules to get our final parameter estimate. Hooray!
I think this is what it means by when they say MICE uses correlations between variables to predict missing values rather than relying on a complete-case analysis (like MI), provided the data are MAR. It’s an iterative process.
Some confusions I need to square away
Since I am not doing much with MICE (and I doubt I will), I have some questions that I still need to get answered. Not a huge priority, but it’s there for whenever I pick this back up!
-
The output of MI or MICE or whatever are \(m\) imputed datasets. We don’t immediately need to fit a model. I think I knew this, but I had some confusion about this over the summer… oops.
-
Does 80% missing data mean that one column can have 80% missing data, or in the overall data, 80% of observations can have some missing data? I don’t think it can be the first one because MICE with one variable is essentially the same as MI, which can only go up to 80% in very specific scenarios.
-
Specific order? Least to most missing data?
-
What about other types of variables? Do we just use GLMs as the imputation model? Also is the temporary choice for imputation always the mean? Does it matter?
Enjoy Reading This Article?
Here are some more articles you might like to read next:
- Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra
- Displaying External Posts on Your al-folio Blog
- MCMC for Approximating Distributions
- Frechet and Gateaux Derivatives
- Banach and Hilbert Spaces (Part 2)
- Banach and Hilbert Spaces (Part 1)
- Functionals
- Bayesian Regression
- Firths Penalized Logistic Regression
- Optimal Transportation