Data cleaning is possibly the most critical step in running statistical analyses. A general rule of thumb is to spend 80% of your time cleaning data and the remaining 20% on data analyses. It is important to carefully clean your data because it takes only one error to impact the results of your data analyses. At Magnolia Consulting, driven by our values of integrity, excellence, and utilization of results, we have developed processes to ensure that we provide our clients with valid and reliable findings.
Based on our experience, here are some key tips for effective and consistent data cleaning:
1. Create a checklist. We recommend creating a data cleaning checklist for two reasons. First, creating and following a checklist ensures that you have taken all necessary steps in the process. Without a checklist, it can be easy to accidentally skip a step or overlook an error. Second, different people may have alternative approaches to data cleaning. A checklist can streamline the data cleaning process and ensure consistency across different team members. At Magnolia Consulting, having a checklist has helped us to align data cleaning approaches, making it easier to address issues and to check each other’s work.
2. Check your data early. Provide yourself plenty of time to explore the data and identify questions. By checking the data early, you will improve your chances of obtaining any missing data or clarifying inconsistencies. Sometimes you cannot avoid receiving data late, but at least you have given yourself ample time to identify all potential issues rather than letting crucial ones go unnoticed.
3. Take your time. Take time to fully understand the context of your data. This includes knowing what to expect before you receive the data. For example, will you be looking at student assessment data, student demographic data, pre- and post-test data, or something else? Understanding the context will make it easier for you to understand and identify any inconsistencies, such as duplicate cases.
4. Consult with others. While cleaning data, you are often forced into “playing detective.” Before making a judgement call, identify all the information that you know so that you can have a fruitful conversation with your team. These conversations will help to acquaint other team members with the data should they be involved in the analysis or reporting phases.
5. Keep a thorough record. Data cleaning can involve a significant amount of changes and decisions, making it difficult to remember everything you did. Remembering the smallest action might be important in answering questions later when you need to revisit previous decisions. Ultimately, creating a detailed data record will save you time and spare you frustration. It will also allow you to replicate data cleaning processes in the future. Versioning is also useful—saving every version of your database makes the process more efficient. If you make a mistake, instead of starting over, you can easily return to a previous version.