@vandenakkerPreregistrationSecondaryData2019

Preregistration of secondary data analysis: A template and tutorial

(2019) - Olmo Van den Akker, Sara J Weston, Lorne Campbell, William J. Chopik, Rodica I. Damian, Pamela Davis-Kean, Andrew Nolan Hall, Jessica Elizabeth Kosie, Elliott Tyler Kruse, Jerome Olsen, Stuart James Ritchie, Kathrene D Valentine, AnnaElisabeth van't Veer, Marjan Bakker

Journal:
Link:: https://osf.io/hvfmr
DOI:: 10.31234/osf.io/hvfmr
Links::
Tags:: #paper #Pre-Analysis
Cite Key:: [@vandenakkerPreregistrationSecondaryData2019]

Abstract

Preregistration has been lauded as one of the solutions to the so-called ‘crisis of confidence’ in the social sciences and has therefore gained popularity in recent years. However, the current guidelines for preregistration have been developed primarily for studies where new data will be collected. Yet, preregistering secondary data analyses— where new analyses are proposed for existing data—is just as important, given that researchers’ hypotheses and analyses may be biased by their prior knowledge of the data. The need for proper guidance in this area is especially desirable now that data is increasingly shared publicly. In this tutorial, we present a template specifically designed for the preregistration of secondary data analyses and provide comments and a worked example that may help with using the template effectively. Through this illustration, we show that completing such a template is feasible, helps limit researcher degrees of freedom, and may make researchers more deliberate in their data selection and analysis efforts.

Notes

"researchers frequently analyze the same dataset multiple times to answer different research questions. Researchers are therefore not likely to come to a dataset with completely fresh eyes, and may have insight regarding associations between at least some of the variables in the dataset." (Van den Akker et al 2019:2)

"Such prior knowledge may steer the researchers toward a hypothesis that they already know is in line with the data. This practice is called HARKing (Hypothesizing After Results Are Known; Kerr, 1998) and can lead to false positive results (Rubin, 2017)." (Van den Akker et al 2019:2)

"our template includes specific questions about defining and handling outliers, and the specification of robustness checks, both of which give leeway for data-driven decisions in secondary data analyses (Weston et al., 2019)" (Van den Akker et al 2019:3)

"Second, our template comes with elaborate comments and a worked example that we hope makes the preregistration of secondary data analysis more concrete" (Van den Akker et al 2019:3)

"Question 1: Provide the working title of your study." (Van den Akker et al 2019:3)

"Question 2: Name the authors of this preregistration." (Van den Akker et al 2019:3)

"Question 3: List each research question included in this study." (Van den Akker et al 2019:3)

"Question 4: Please provide the hypotheses of your secondary data analysis. Make sure they are specific and testable, and make it clear what your statistical framework is (e.g., Bayesian inference, NHST). In case your hypothesis is directional, do not forget to state the direction. Please also provide a rationale for each hypothesis." (Van den Akker et al 2019:3)

"Question 5: Name and describe the dataset(s), and if applicable, the subset(s) of the data you plan to use. Useful information to include here is the type of data (e.g., cross-sectional or longitudinal), the general content of the questions, and some details about the respondents. In the case of longitudinal data, information about the survey's waves is useful as well." (Van den Akker et al 2019:4)

"Question 6: Specify the extent to which the dataset is open or publicly available. Make note of any barriers to accessing the data, even if it is publicly available" (Van den Akker et al 2019:4)

"Question 7: How can the data be accessed? Provide a persistent identifier or link if the data are available online or give a description of how you obtained the dataset." (Van den Akker et al 2019:5)

"Question 8: Specify the date of download and/or access for each author." (Van den Akker et al 2019:5)

"Question 9: If the data collection procedure is well documented, provide a link to that information. If the data collection procedure is not well documented, describe, to the best of your ability, how data were collected" (Van den Akker et al 2019:5)

"Question 10: Some studies offer codebooks to describe their data. If such a codebook is publicly available, link to it here or upload the document. If not, provide other available documentation. Also provide guidance on what parts of the codebook or other documentation are most relevant" (Van den Akker et al 2019:6)

"Question 11: If you are going to use any manipulated variables, identify them here. Describe the variables and the levels or treatment arms of each" (Van den Akker et al 2019:6)

"variable (note that this is not applicable for observational studies and meta-analyses). If you are collapsing groups across variables this should be explicitly stated, including the relevant formula. If your further analysis is contingent on a manipulation check, describe your decisions rules here" (Van den Akker et al 2019:6)

"Question 12: If you are going to use measured variables, identify them here. Describe both outcome measures as well as predictors and covariates and label them accordingly. If you are using a scale or an index, state the construct the scale/index represents, which items the scale/index will consist of, how these items will be aggregated, and whether this aggregation is based on a recommendation from the study codebook or validation research. When the aggregation of the items is based on exploratory factor analysis (EFA) or confirmatory factor analysis (CFA), also specify the relevant details (EFA: rotation, how the number of factors will be determined, how best fit will be selected, CFA: how loadings will be specified, how fit will be assessed, which residuals variance terms will be correlated). If you are using any categorical variables, state how you will code them in the statistical analyses." (Van den Akker et al 2019:6)

"Question 13: Which units of analysis (respondents, cases, etc.) will be included or excluded in your study? Taking these inclusion/exclusion criteria into account, indicate the (expected) sample size of the data you'll be using for your statistical analyses to the best of your knowledge. In the next few questions, you will be asked to refine this sample size estimation based on your judgments about missing data and outliers" (Van den Akker et al 2019:8)

"Question 14: What do you know about missing data in the dataset (i.e., overall missingness rate, information about differential dropout)? How will you deal with incomplete or missing data? Based on this information, provide a new expected sample size." (Van den Akker et al 2019:9)

"Question 15: If you plan to remove outliers, how will you define what a statistical outlier is in your data? Please also provide a new expected sample size. Note that this will be the definitive expected sample size for your study and you will use this number to do any power analyses" (Van den Akker et al 2019:11)

"Question 16: Are there sampling weights available with this dataset? If so, are you using them or are you using your own sampling weights?" (Van den Akker et al 2019:11)

"Question 17: List the publications, working papers (in preparation, unpublished, preprints), and conference presentations (talks, posters) you have worked on that are based on the dataset you will use. For each work, list the variables you analyzed, but limit yourself to variables that are relevant to the proposed analysis. If the dataset is longitudinal, also state which wave of the dataset you analyzed. Importantly, some of your team members may have used this dataset, and others may not have. It is therefore important to specify the previous works for every co-author separately. Also mention relevant work on this dataset by researchers you are affiliated with as their knowledge of the data may have been spilled over to you. When the provider of the data also has an overview of all the work that has been done using the dataset, link to that overview." (Van den Akker et al 2019:11)

"Question 18: What prior knowledge do you have about the dataset that may be relevant for the proposed analysis? Your prior knowledge could stem from working with the data first-hand, from reading previously published research, or from codebooks. Also provide any relevant knowledge of subsets of the data you will not be using. Provide prior knowledge for every author separately" (Van den Akker et al 2019:12)

"Question 19: For each hypothesis, describe the statistical model you will use to test the hypothesis. Include the type of model (e.g., ANOVA, multiple regression, SEM) and the specification of the model. Specify any interactions and post-hoc analyses and remember that any test not included here must be labeled as an exploratory test in the final paper." (Van den Akker et al 2019:13)

"Question 20: If applicable, specify a predicted effect size or a minimum effect size of interest for all the effects tested in your statistical analyses." (Van den Akker et al 2019:13)

"Question 21: Present the statistical power available to detect the predicted effect size(s) or the smallest effect size(s) of interest OR present the accuracy that will be obtained for estimation. Use the sample size after updating for missing data and outliers, and justify the assumptions and parameters used (e.g., give an explanation of why anything smaller than the smallest effect size of interest would be theoretically or practically unimportant" (Van den Akker et al 2019:14)

"Question 22: What criteria will you use to make inferences? Describe the information you will use (e.g., specify the p-values, effect sizes, confidence intervals, Bayes factors, specific model fit indices), as well as cut-off criteria, where appropriate. Will you be using oneor two-tailed tests for each of your analyses? If you are comparing multiple conditions or testing multiple hypotheses, will you account for this, and if so, how?" (Van den Akker et al 2019:14)

"Question 23: What will you do should your data violate assumptions, your model not converge, or some other analytic problem arises?" (Van den Akker et al 2019:15)

"Question 24: Provide a series of decisions about evaluating the strength, reliability, or robustness of your focal hypothesis test. This may include within-study replication attempts, additional covariates, cross-validation efforts (out-of-sample replication, split/hold-out sample), applying weights, selectively applying constraints in an SEM context (e.g., comparing model fit statistics), overfitting adjustment techniques used (e.g., regularization approaches such as ridge regression), or some other" (Van den Akker et al 2019:15)

"simulation/sampling/bootstrapping method" (Van den Akker et al 2019:15)

"stion 25: If you plan to explore your dataset to look for unexpected differences or relationships, describe those tests here, or add them to the final paper under a heading that clearly differentiates this exploratory part of your study from the confirmatory part." (Van den Akker et al 2019:15)

"Part 6: Statement of integrity" (Van den Akker et al 2019:15)