
In such cases, time-consuming corrections may not be valid given the regression model used in the analysis.Įventually, the Principal Investigator (PI) and the RA or FC will have a common understanding on what correction decisions to make without involving the PI. This is often inefficient, as different regression models and/or PI preferences may require different corrections. However, many RAs or FCs spend too much time trying to fix irregularities and, in turn, do not have enough time to identify and document them completely. It is never bad to suggest corrections to irregularities. Research Assistants (RAs) and Field Coordinators (FCs) should prioritize their time on identifying and documenting irregularities in the data rather than correcting them. A usable and understandable dataset will not only help you and your research team in the future, but also other researchers who use the dataset down the road. Carefully documenting this knowledge often makes the difference between a good analysis and a great analysis. At the time of the data collection and data cleaning, you know the dataset much better than you will at any time in the future.
IF ELSE STATA CODE
The second goal of the data cleaning is to code and document the dataset to make it as self-explanatory as possible. Making the Dataset Usable and Understandable The researcher leading the analysis is trained in the other granular details and knowledge necessary for the specific regression models. While many more things can also bias a regression, this conceptualization provides a good starting place for anyone cleaning a dataset for the first time. While it may be difficult to have an intuition for the math behind a regression, it easy to have an intuition for the math behind a mean.Īnything that biases a mean will bias a regression: outliers, missing values, typos, erroneous survey codes, illogical values, duplicates, etc. While this is, of course, an extreme simplification, it may provide a useful framework and perspective to an RA cleaning a dataset for the first time. In essence, one can think of regression analysis as an advanced comparison of means. RCT analysis typically rely to regressions to test for statistical differences between the means of the control and treatment groups. A really good data cleaning process should also result in documented insights about the data and data collection to inform future data collection – either for a different round of the same project or for other future projects. The data cleaning process seeks to fulfill two goals: (1) to ensure valid analysis by cleaning individual data points that bias the analysis, and (2) to make the dataset easily usable and understandable for researchers both within and outside of the research team. See this data cleaning checklist to ensure that common cleaning actions have been completed.This article provides a very good place to start. There is no such thing as an exhaustive list of what to do during data cleaning: each project will have individual cleaning needs.The quality of the analysis will never be better than the quality of data cleaning.The goal of data cleaning is to clean individual data points and to make the dataset easily usable and understandable for the research team and external users.2.2 Making the Dataset Usable and Understandable.
