Handbook of survey research

#Methods #MissingData #MultipleImputation

Handbook of survey research

Key takeaways

This includes a good discussion about 'reverse imputation' where data at wave 5 can and should be used to impute data from waves <4. I don't like this, but it does have a strong methodological argument and until I find something saying otherwise, have to accept this.

(file:///C:\Users\scott\Downloads\Allison_MissingData_Handbook.pdf)

Bibliography: Marsden, P.V., Wright, J.D. (Eds.), 2010. Handbook of survey research, Second edition. ed. Emerald Group Publ, Bingley.

Authors:: Peter V. Marsden, James D. Wright

Collections:: Methods

First-page:

Abstract

Citations

content: "@marsdenHandbookSurveyResearch2010" -file:@marsdenHandbookSurveyResearch2010

Reading notes

Annotations

(08/05/2024, 21:34:11)

“Three broad classes of missing data methods have good statistical properties: maximum likelihood (ML), multiple imputation (MI), and inverse probability weighting. ML and MI can handle a wide array of applications, and many commercial software packages implement some version of these methods. As of this writing, inverse probability weighting is much more limited in its applications, and easy-to-use software is not readily available.” (“Handbook of survey research”, 2010, p. 631)

“Suppose that only one variable Y has missing data, and that another set of variables, represented by the vector X, is always observed. The data are missing completely at random (MCAR) if the probability that Y is missing does not depend on X or on Y itself (Rubin, 1976).” (“Handbook of survey research”, 2010, p. 635)

“A natural question to ask at this point is, what variables can be or should be in the X vector? The answer is quite simple. The only variables that must be in X are those that are part of the model to be estimated. Suppose, for example, that we seek only to estimate the mean income for some population, and 20% of the cases are missing data on income. In that case, we need not consider any X variables for the MCAR condition.” (“Handbook of survey research”, 2010, p. 635)

“How can we test the MCAR assumption? Testing for whether missingness on Y depends on some observed variable X is easy. For example, we can test whether missingness on income depends on gender by testing whether the proportions of men and women who report their income differ. More generally, we could run a logistic regression in which the dependent variable is the response indicator R and the independent variables are all X variables in the model to be estimated. Significant coefficients would suggest a violation of MCAR.” (“Handbook of survey research”, 2010, p. 635)

“On the other hand, it is not so easy to test the other part of MCAR, that missingness on Y does not depend on Y itself. For example, the only way to test whether people with high incomes are less likely to report their incomes is to find some other measure of income (e.g., tax records) that has no missing data. But this is rarely possible.” (“Handbook of survey research”, 2010, p. 635)

“The MCAR assumption is very strong, and is unlikely to be completely satisfied unless data are missing by design (Graham, Hofer, & MacKinnon, 1996).” (“Handbook of survey research”, 2010, p. 635)

“A considerably weaker (but still strong) assumption is that data are missing at random (MAR). Again, this is most easily defined in the case where only a single variable Y has missing data, and another set of variables X has no missing data. We say that data on Y are missing at random if the probability that Y is missing does not depend on Y, once we control for X.” (“Handbook of survey research”, 2010, p. 636)

“As with MCAR, the only variables that must go into X are the variables in the model to be estimated. But under MAR, there can be substantial gains from including other variables as well. Suppose, for example, that we believe that people with high income are less likely to report their income. That would violate both MCAR and MAR. However, by adjusting for other variables that are correlated with income for example, education, occupation, gender, age, mean income in zipcode — we may be able to greatly reduce the dependence of missingness of income on income itself.” (“Handbook of survey research”, 2010, p. 636)

“We say that data are not missing at random (NMAR) if the MAR assumption is violated, that is, if the probability that Y is missing depends on Y itself, after adjusting for X. There are often strong reasons for suspecting that the data are” (“Handbook of survey research”, 2010, p. 636)

“NMAR, for example, people who have been arrested may be less likely to report their arrest status.” (“Handbook of survey research”, 2010, p. 637)

“Another popular approach to handling missing data on predictors in regression analysis is dummy variable adjustment (Cohen & Cohen, 1985). Its mechanics are simple and intuitive” (“Handbook of survey research”, 2010, p. 638)

“The appeal of this method is that it deletes no cases, and incorporates all available information into the regression model. But Jones (1996) proved that dummy variable adjustment yields biased parameter estimates even when the data are MCAR, which pretty much rules it out. Jones also demonstrated that a related method for nominal predictors produces biased estimates. That method treats missing cases for a categorical variable simply as another category, creating an additional dummy variable for that category.” (“Handbook of survey research”, 2010, p. 639)

“It is well known that mean substitution produces biased estimates for most parameters, even under MCAR (Haitovsky, 1968).” (“Handbook of survey research”, 2010, p. 639)

“The three basic steps to multiple imputation are as follows: 1. Introduce random variation into the imputation process, and generate several data sets, each with slightly different imputed values. 2. Perform an analysis on each of the data sets. 3. Combine the results into a single set of parameter estimates, standard errors, and test statistics.” (“Handbook of survey research”, 2010, p. 640)

“When I do MI, I like every df to be at least 100. At that point, the t distribution approaches the normal distribution, and little is to be gained from additional data sets. For our divorce example, the lowest df was 179, suggesting no need for additional data sets.” (“Handbook of survey research”, 2010, p. 645)

“A general principle of MI is that any population quantity can be estimated by simply averaging its estimates over the repeated data sets. Besides regression coefficients, this includes summary statistics like R2 and root mean squared error, although MI software often does not report these. It is never correct to average test statistics, like t, F, or chi-square statistics, however. Special methods are required to combine such statistics across multiple data sets.” (“Handbook of survey research”, 2010, p. 645)

“For MI to perform optimally, the model used to impute the data must be ‘‘congenial’’ in some sense with the model intended for analysis (Rubin, 1987; Meng, 1994). The models need not be identical, but the imputation model must generate imputations that reproduce the major features of the data that are the focus of the analysis. That is the main reason I recommend that the imputation model include all variables in the model of interest” (“Handbook of survey research”, 2010, p. 646)

“Nevertheless, I prefer ML whenever it can be implemented for several reasons. First, ML produces a deterministic result while MI gives a different result every time it is used, because of its random draws from posterior distributions. Second, MI is often vulnerable to bias introduced by lack of congeniality between the imputation model and the analysis model. No such conflict is possible with ML because its estimates are based on a single, comprehensive model. Third, ML is generally a much ‘‘cleaner’’ method, requiring many fewer decisions about implementation.” (“Handbook of survey research”, 2010, p. 648)

“One popular method for maximizing the likelihood when data are missing is the EM algorithm (Dempster, Laird, & Rubin, 1977). This iterative algorithm consists of two steps: 1. In the E (expectation) step, one finds the expected value of the log-likelihood, where the expectation is taken over the variables with missing data, based on the current values of the parameters. 2. In the M (maximization) step, the expected log-likelihood is maximized to produce new values of the parameters.” (“Handbook of survey research”, 2010, p. 649)

“A better method is direct ML, also known as raw ML (because it requires raw data rather than a covariance matrix) or full-information ML (Arbuckle, 1996; Allison, 2003). This method directly specifies the likelihood for the model to be estimated, and then maximizes it by conventional numerical methods (like Newton–Raphson) that produce standard errors as a by-product.” (“Handbook of survey research”, 2010, p. 650)

“Longitudinal studies are particularly prone to missing data problems because it is difficult to follow individuals over substantial periods of time (see Stafford, this volume). Some people stop participating, some cannot be located, others may be away at the time of re-contact. Either MI or ML can handle missing data in longitudinal studies quite well. These methods usually can be implemented for longitudinal data in a fairly straightforward manner.” (“Handbook of survey research”, 2010, p. 652)

“Whatever method is used, it is important to use all available information over time in order to minimize bias and standard errors. For example, suppose that one wishes to estimate a random-effects regression model using panel data for 1000 people and five time points per person, but some predictor variables have missing data. Most random-effects software requires a separate observational record for each person and point in time (the so-called ‘‘long’’ form), with a common ID number for all observations for the same person. One should not do multiple imputation on those 5000 records. That would impute missing values using only information obtained at the same point in time.” (“Handbook of survey research”, 2010, p. 652)

“A much better method is to restructure the data so that there is one record per person (the ‘‘wide’’ form): a variable like income measured at five points in time would be represented by five different variables. Then multiple imputation with a variable list including all the variables in the model at all five time points would impute values for any variable with missing data using all the other variables at all time points, including (especially) the same variable measured at other time points.” (“Handbook of survey research”, 2010, p. 652)

“Making imputations using data from all time points can substantially reduce the standard errors of the parameter estimates, and also reduce bias. If, for example, the dependent variable is a measure of depression, and people who are depressed at time 1 are more likely to drop out at time 2, then imputing the time 2 depression score using depression at time 1 can be very helpful in correcting for possible selection bias. Using data at later time points to impute missing data at earlier ones may seem unsettling, since it seems to violate conventional notions of causal direction. But imputation has nothing to do with causality. It merely seeks to generate imputed values that are consistent with all the observed relationships among the variables.” (“Handbook of survey research”, 2010, p. 652) I'm not entirely convinced by this argument but don't have an adequete response to Allison just yet.

“Some may also be troubled by the fact that this method generates imputed values for all variables at all time points, even if a person dropped out after the first interview. Is that too much imputation? If a person died after the first interview, a reasonable case could be made for excluding records for times after death. But if someone simply dropped out of the study, imputing all subsequent missing data is better because selection bias may be substantially reduced. Remember that both MI and ML account completely for the fact that the some data are imputed when calculating standard errors, so the imputation of later waves does not artificially inflate the sample size.” (“Handbook of survey research”, 2010, p. 652)