Multiple Imputation of Missing Data: A Simulation Study on a Binary Response
Multiple Imputation of Missing Data: A Simulation Study on a Binary Response
Key takeaways
When n=200 would not recommend imputing more than 40% of the data. They don't simulate datasets >200 but generally speaking larger datasets would be more comfortable with higher % missing.
Bibliography: Hardt, J., Herke, M., Brian, T., Laubach, W., 2013. Multiple Imputation of Missing Data: A Simulation Study on a Binary Response. OJS 03, 370–378. https://doi.org/10.4236/ojs.2013.35043
Authors:: Jochen Hardt, Max Herke, Tamara Brian, Wilfried Laubach
Collections:: Methods
First-page:
Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.
content: "@hardtMultipleImputationMissing2013" -file:@hardtMultipleImputationMissing2013
Reading notes
Annotations
(08/05/2024, 21:07:38)
“However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.” (Hardt et al., 2013, p. 370)
“The proportion of missing values varies widely between and within studies, ranging from almost zero to far above 50% for some variables in some studies.” (Hardt et al., 2013, p. 370)
“While multiple imputations were originnally developed for larger datasets with small proportions of missing data, i.e. in data for public use, [e.g. 6,7] today they are also considered for application in moderate to small samples of n = 100 to n = 20 [8,9], or when the rates of missing data are extremely high [up to 95%: 10].” (Hardt et al., 2013, p. 370)
“currently it is still unclear how large a sample needs to be so that these advantages become apparent, how much missing data can be substituted, and how far complex estimates, such as coefficients for interaction terms, are affected by the substitution.” (Hardt et al., 2013, p. 370)
“Methods for single substitution replace the missing data with the mean (or mode), conditional mean or other prognostic equations. They have been shown to bias regression coefficients and to underestimate the variances [e.g. 12,13].” (Hardt et al., 2013, p. 371)
“in datasets with very small proportions of missing data, e.g. less than 5% per variable, they may perform well enough for most practical applications in the social sciences and medicine [5,14,15].” (Hardt et al., 2013, p. 371)
“Hot-deck imputations substitute every missing datum with the nearest observed value of a neighbour—they vary based on how the latter is defined” (Hardt et al., 2013, p. 371)
“Rubin [6, p. 114] suggested that a number of m = 3 data-sets with imputed missing data serve well for most purposes but recent studies have sometimes suggested that more imputed datasets may perform better [20,21]. Meng (1995) recommended creating 30 datasets.” (Hardt et al., 2013, p. 371)
“It can be seen that the simulation revealed generally unbiased estimates. However, the more missing data were introduced, the lower the precision of the estimates became. The distribution of the estimated coefficients was approximately normal. When 63% of data were missing, one would probably prefer not to perform any analysis with data from the present condition, the standard deviation lies around one then. This means that about one third of the observed regression coefficients would be much lower or higher than the true value.” (Hardt et al., 2013, p. 374)
“Regarding a main effect with a sample size of n = 200, we would recommend substituting no more than 40% of the missing data under the conditions simulated here.” (Hardt et al., 2013, p. 374)
“With n = 100, the simulation results became completely imprecise when 42% of the data were missing, and we would suggest the following limit: imputations should be performed with a maximum of missing data of about 30%. With n = 50, the breakdown occurred at 33%, and whether to substitute if more than 20% of the data are missing in such a small sample should be considered carefully.” (Hardt et al., 2013, p. 374)
“In the present simulations, it resulted in unbiased estimates even when relatively large proportions of the data were missing, this held true for non-zero and zero coefficients, and for main effects as well as for interactions. However, depending on the proportion of missing data, a large amount of random variance became vi-” (Hardt et al., 2013, p. 375)
“sible. Only a small amount of this variance stemmed from the imputation, the major part came from the missing data themself (data not shown).” (Hardt et al., 2013, p. 376)
“The latter cannot be controlled by the researcher, and can lead to severe misinterpretation of the data. This is particularly the case for interaction effects, and the probability of estimating them precisely decreases drastically when large amounts of data are missing—and the one for finding false positive effects increases.” (Hardt et al., 2013, p. 376)
“However, including too many auxiliary variables leads to over-parameterization, which will lead to a situation where all associations become biased downwards.” (Hardt et al., 2013, p. 376)