The proportion of missing data should not be used to guide decisions on multiple imputation

#Methods #MissingData #MultipleImputation

The proportion of missing data should not be used to guide decisions on multiple imputation

Key takeaways

An appropriately fitting MI model with MAR mechanism and 90% missingness can achieve a 99.97 per cent reduction in standard error bias compared to the CCA model. Though again this means nothing if both the CCA and MI model are worthless...

(file:///C:\Users\scott\Zotero\storage\VKNAG3LU\Madley-Dowd%20et%20al_2019_The%20proportion%20of%20missing%20data%20should%20not%20be%20used%20to%20guide%20decisions%20on.pdf)

Bibliography: Madley-Dowd, P., Hughes, R., Tilling, K., Heron, J., 2019. The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology 110, 63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016

Authors:: Paul Madley-Dowd, Rachael Hughes, Kate Tilling, Jon Heron

Collections:: Methods

First-page:

Abstract

Objectives: Researchers are concerned whether multiple imputation (MI) or complete case analysis should be used when a large proportion of data are missing. We aimed to provide guidance for drawing conclusions from data with a large proportion of missingness. Study Design and Setting: Via simulations, we investigated how the proportion of missing data, the fraction of missing information (FMI), and availability of auxiliary variables affected MI performance. Outcome data were missing completely at random or missing at random (MAR). Results: Provided sufﬁcient auxiliary information was available; MI was beneﬁcial in terms of bias and never detrimental in terms of efﬁciency. Models with similar FMI values, but differing proportions of missing data, also had similar precision for effect estimates. In the absence of bias, the FMI was a better guide to the efﬁciency gains using MI than the proportion of missing data. Conclusion: We provide evidence that for MAR data, valid MI reduces bias even when the proportion of missingness is large. We advise researchers to use FMI to guide choice of auxiliary variables for efﬁciency gain in imputation analyses, and that sensitivity analyses including different imputation models may be needed if the number of complete cases is small. Ó 2019 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Citations

content: "@madley-dowdProportionMissingData2019" -file:@madley-dowdProportionMissingData2019

Reading notes

Annotations

(08/05/2024, 21:27:58)

“Researchers are concerned whether multiple imputation (MI) or complete case analysis should be used when a large proportion of data are missing. We aimed to provide guidance for drawing conclusions from data with a large proportion of missingness.” (Madley-Dowd et al., 2019, p. 63)

“Provided sufficient auxiliary information was available; MI was beneficial in terms of bias and never detrimental in terms of efficiency. Models with similar FMI values, but differing proportions of missing data, also had similar precision for effect estimates. In the absence of bias, the FMI was a better guide to the efficiency gains using MI than the proportion of missing data.” (Madley-Dowd et al., 2019, p. 63)

“We provide evidence that for MAR data, valid MI reduces bias even when the proportion of missingness is large. We advise researchers to use FMI to guide choice of auxiliary variables for efficiency gain in imputation analyses, and that sensitivity analyses including different imputation models may be needed if the number of complete cases is small.” (Madley-Dowd et al., 2019, p. 63)

“Unbiased results can be obtained even with large proportions of missing data (up to 90% shown in our simulation study), provided the imputation model is properly specified and data are missing at random.” (Madley-Dowd et al., 2019, p. 64)

“The fraction of missing information was better as a guide to the efficiency gains from MI than the proportion of missing data.” (Madley-Dowd et al., 2019, p. 64)

“Increasing the number of auxiliary variables included in an imputation model does not always result in efficiency gains.” (Madley-Dowd et al., 2019, p. 64)

“The proportion of missing data should not be used as a guide to inform decisions about whether to perform multiple imputation or not. The fraction of missing information should be used to guide the choice of auxiliary variables in imputation analyses.” (Madley-Dowd et al., 2019, p. 64)

“Researchers in a variety of fields often ask what proportion of missing data warrants the use of MI [12e15]. Varying guidance exists; in the literature, 5% missingness has been suggested as a lower threshold below which MI provides negligible benefit [16]. In contrast, one online tutorial has stated that 5% missing data is the maximum upper threshold for large data sets [17]. Statistical guidance articles have stated that bias is likely in analyses with more than 10% missingness and that if more than 40% data are missing in important variables then results should only be considered as hypothesis generating [18,19].” (Madley-Dowd et al., 2019, p. 64)

“A small number of studies have investigated bias and efficiency in data sets with increasing proportions of missing data. This has commonly been done with a maximum of 50% missing data in studies that showed increasing variability of effect estimates with increased missingness [20e22]; mixed results were found for bias.” (Madley-Dowd et al., 2019, p. 64)

“The proportion of missing data is a common measure of how much information has been lost because of missing values in a data set. However, it does not reflect the information retained by auxiliary variables. Alternative measures such as the fraction of missing information (FMI) may be more useful as a tool for determining potential efficiency gains from MI. The FMI is a parameter-specific measure that is able to quantify the loss of information due to missingness, while accounting for the amount of information retained by other variables within a data set [11,26].” (Madley-Dowd et al., 2019, p. 64)

“The FMI, derived from MI theory [5,27], can be interpreted as the fraction of the total variance (including both between and within imputation variance, see Supplementary material)ofaparameter,suchasa regression coefficient, that is attributable to between imputation variance, for large numbers of imputations m .” (Madley-Dowd et al., 2019, p. 64)

“A large FMI (close to 1) indicates high variability between imputed data sets; that is, the observed data in the imputation model do not provide much information about the missing values.” (Madley-Dowd et al., 2019, p. 64) really important point about FMI to integrate into thesis.

“MI can be used to provide unbiased estimates with improved efficiency compared to CCA at any proportion of missing data and (2) the utility of the FMI as a guide to the likely efficiency gains from using MI.” (Madley-Dowd et al., 2019, p. 64)

“99.97%” (Madley-Dowd et al., 2019, p. 67)