According to a 2016 CrowdFlower survey, data scientists spend 80% of their time cleaning data. While it does not require a lot of innovation, cleaning data properly is critical because any margin of error may result in unreliable information that affects decision-making based on insights. In this post, I will offer five tips that any data scientist or analyst can use as data checks and a way to second guess any assumptions that may creep through in data cleaning.
1. Perform a Uniqueness Check
A uniqueness check is useful for understanding the structure of data and ensuring that there are no duplicate observations in the data. For example, if customers purchase multiple products from a store every day, it is reasonable to assume that the daily data will be unique at the level of customer-product. However, if there’s a product among all the products sold by the store that’s labeled the same but is available in different colors, the data will be unique at the level of customer-product-color instead of our initial assumption of customer-product level. Checking for uniqueness before merging data is also useful so as to avoid merging duplicate values.
2. Detect and Properly Handle Outliers
This seems obvious, but I noticed while correcting a couple of analyses at work that results change dramatically when outliers are removed. I won’t go into the specifics, but outliers become even more important in trend analytics and predictions. Depending upon the nature of analysis, outliers will have to be treated differently. However, it is always a useful check to keep in mind while reading the results. Simple summary statistics of key variables can reveal outliers.
3. Identify Time Series Variation
Whether or not I am predicting time series data, one of the first things I look for when dealing with any kind of temporal data is the variation in outcome variable by each time period. For any time series prediction, this is essentially the seasonality that must be incorporated in predictions and having a sense of seasonality from the beginning is always advantageous. However, even for descriptive and insight-driven analytics, the seasonality can define how you set up the data to get accurate results.
4. Use Descriptive Checks on Categorical Variables
Descriptive checks are particularly useful in defining the number of categories I want to keep in the analysis or whether it makes sense to combine some categories as one. These decisions are mainly guided by the sample counts for each category, which are also a good indicator for choosing a reference category in regression analyses. Most machine learning algorithms do well to deal with inconsistent sample sizes within a categorical variable. However, depending upon the data, pre-processing categorical variables can sometimes help improve results.
5. Check Correlation
Correlation checks, in my opinion, are one of the most underrated statistical tools in data science. One reason for that is the non-parametric models are not affected by correlation between variables used in predictions. This is because non-parametric statistics are either free from the distribution of parameters or have distributions without specified parameters. Despite that, I like to keep the correlation results handy before implementing any model so that I have a more intuitive sense of data and the prediction results. Furthermore, I find it a good practice to implement the same during non-predictive analysis where high correlations greatly impact modeling features.
Whether you use R or Python or any other language, chances are there are multiple ways to perform these checks. However, when working with big data, it is always recommended to time the processing speed on a smaller subset of data for different methods of conducting these tests. This will help you save on the overall time taken to process your code.
These checks have helped me discover numerous errors in the past. Now with the New Year kicking off, it’s as good a time as any to do some housekeeping!