Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations
Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations
Key takeaways
A bit basic, but does cover some intro basic concepts that might be worth re-visiting.
Bibliography: Johnson, D.R., Young, R., 2011. Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations. Journal of Marriage and Family 73, 926–945. https://doi.org/10.1111/j.1741-3737.2011.00861.x
Authors:: David R. Johnson, Rebekah Young
Tags: #missing-data, #multiple-imputation, #maximum-likelihood, #methods, #National-Survey-of-Families-and-Households, #regression
Collections:: Methods
First-page:
Although several methods have been developed to allow for the analysis of data in the presence of missing values, no clear guide exists to help family researchers in choosing among the many options and procedures available. We delineate these options and examine the sensitivity of the findings in a regression model estimated in three random samples from the National Survey of Families and Households (n = 250–2,000). These results, combined with findings from simulation studies, are used to guide answers to a set of 10 common questions asked by researchers when selecting a missing data approach. Modern missing data techniques were found to perform better than traditional ones, but differences between the types of modern approaches had minor effects on the estimates and substantive conclusions. Our findings suggest that the researcher has considerable flexibility in selecting among modern options for handling missing data.
content: "@johnsonBestPracticesAnalyzing2011" -file:@johnsonBestPracticesAnalyzing2011
Reading notes
Annotations
(08/05/2024, 21:17:59)
“There appears to be an emerging consensus in recent literature that the application of MI and FIML methods are superior to other approaches when analyzing datasets with missing values (Acock, 2005; Howell, 2008; Schafer & Graham, 2002).” (Johnson and Young, 2011, p. 928)
“When choosing between these missing data strategies, a researcher must make many decisions according to the specific research situation. For example, researchers must select software, which variables to include in the model, the number of imputed datasets, whether to tailor the model to the measurement level of the variables (or to use the fully normal model), and whether to impute the dependent variable.” (Johnson and Young, 2011, p. 928)
“Close examination of the two modern approaches found them to be remarkably consistent in the values of b-coefficients, the magnitude of standard errors, and the level of significance obtained. For researchers worried that MI is tantamount to ‘‘making up data,’’ the nearly identical results produced by MI and FIML, which does not impute values, should alleviate that concern.” (Johnson and Young, 2011, p. 932)
“FIML deals with missing data and parameter estimation in one step, eliminating the need to create imputed values.” (Johnson and Young, 2011, p. 932)
“To date, the MI approach has the advantages of being more flexible and applicable to a wider variety of models.” (Johnson and Young, 2011, p. 933)
“we conclude that the FIML approach or any of the multiple imputation software approaches tested here would yield similar substantive conclusions, at least in data analyses with the degree of missing data found in many of the large, national family surveys.” (Johnson and Young, 2011, p. 935)
“Our results suggest that using more than 10 imputed datasets can improve the stability of estimates, but the researcher is unlikely to make errors in the substantive interpretation of the findings even if as few as 5 are used with a large sample size.” (Johnson and Young, 2011, p. 936)
“The general consensus is that the missing data model should be at least as complete as the analysis model (Acock, 2005; Collins et al., 2001; Graham, 2003). When a variable in the analysis model is not used to inform the missing data estimates, the imputed values for that variable are uncorrelated with other variables in the model and the covariances are underestimated.” (Johnson and Young, 2011, p. 936)
“Although the literature suggests the importance of including auxiliary variables (Collins et al., 2001), especially those that are highly correlated with variables in the model, our analysis with a dataset and variables commonly used by family researchers found little difference in the substantive conclusions that would be drawn with or without taking auxiliary variables into account.” (Johnson and Young, 2011, p. 938)
“FIML and MI methods perform well even when the proportion missing is substantially higher than in our example. Many simulation studies test missing data approaches with 50% or more missing values on variables in the model (e.g., Allison, 2001; Collins et al., 2001).” (Johnson and Young, 2011, p. 941)