With more of our decisions and activities becoming data driven, we need to ensure the quality of the data that we’re using. Data cleaning (or data cleansing, data scrubbing) broadly refers to the processes that have been developed to help organizations have better data. These processes have a wide range of benefits for any organization that chooses to implement them, but better decision making may be the one that comes to mind first.
Some common questions related to data cleaning that we cover in this post include:
What Exactly is Data Cleaning?
Data cleaning, or cleansing, is the process of correcting and deleting inaccurate records from a database or table. Broadly speaking data cleaning or cleansing consists of identifying and replacing incomplete, inaccurate, irrelevant, or otherwise problematic (‘dirty’) data and records.
With effective cleansing, all data sets should be consistent and free of any errors that could be problematic during later use or analysis.
Why do we need data cleaning?
Data is arguably one of the most vital assets an organization has to help support and guide its success. Some examples of problems that can arise out of inaccurate data are:
- Marketing: An ad campaign using low quality data and reaching out to users with irrelevant offers. This not only reduces customer satisfaction but also misses a significant sales opportunity.
- Sales: A sales representative failing to contact previous customers, because of not having their complete, accurate data.
- Compliance: Any online business receiving penalties from the government by not meeting data privacy rules for its customers. Facebook could be receiving such a penalty in the wake of Cambridge Analytica scandal.
- Operations: Configuring robots and other production machines based on low quality operational data, can cause causes major problems for manufacturing companies
The Benefits of Data Cleaning
Better quality data impacts every activity that includes data. Almost all modern business processes involve data. Subsequently, when data cleaning is seen as an important organizational effort, it can lead to a wide range of benefits for all. Some of the biggest advantages include:
- Streamlined business practices: Imagine if there are no duplicates, errors, or inconsistencies in any of your records. How much more efficient would all of your key daily activities become?
- Increased productivity: Being able to focus on key work task instead of finding the right data or having to make corrections because of incorrect data is essential. Having access to clean high quality data, with the help of effective knowledge management can be a game changer.
- Faster sales cycle: Marketing decisions depend on data. Giving your marketing department the best quality data possible means better and more leads for your sales team to convert. The same concept applies to B2C relationships too!
- Better decisions: We touched on this before, but it’s important enough that it’s worth repeating. Better data = better decisions.
These different benefits in conjunction generally lead to a business that is more profitable. This is not only because of better external sales efforts, but also because of more efficient internal efforts and operations.
image source: Analytics India Magazine
Defining High Quality Data
There are a few criteria that help to qualify data as high quality. They are:
- Validity: How closely the data meets defined business rules or constraints. Some common constraints include:
- Mandatory constraints: Certain columns cannot be empty
- Data-type constraints: Values in a column must be of a certain data type
- Range constraints: Minimum and maximum values for numbers or dates
- Foreign-key constraints: A set of values in a column are defined in the column of another table containing unique values
- Unique constraints: A field or fields must be unique in a dataset
- Accuracy: How closely data conforms to a standard or a true value.
- Completeness: How thorough or comprehensive the data and related measures are known
- Consistency: The equivalency of measures across systems and subjects
- Uniformity: Ensuring that the same units of measure are used in all systems
- Traceability: Being able to find (and access) the source of the data
- Timeliness: How quickly and recently the data has been updated
5 Steps to Cleaner Data
Want cleaner data? Sometimes it can be helpful to consider bringing on an outside consultant to help you begin. However, before you do that, there are a few general steps that any organization can follow to start getting into a better data cleaning mindset:
1. Develop a data quality plan. It is essential to first understand where the majority of errors occur so that the root cause can be identified and a plan built to manage it. Remember that effective data cleaning practices will have an overarching impact throughout an organization, so it is important to remain as open and communicative as possible. A plan needs to include
- Responsibles: A C-Level executive, Chief Data Officer (CDO) if the company already appointed such an executive. Additionally, business and tech responsibles need to be assigned for different data
- Metrics: Ideally, data quality should be summarizable as a single number on a 1-100 scale. While different data can have different data quality, having an overall number can help the organization measure its constant improvement. This overall number can give more weight to data that are critical to the companies success, helping prioritize data quality initiatives that impact important data.
- Actions: A clear set of actions should be identified to kick off the data quality plan. Over time, these actions will need to be updated as data quality changes and as companies priorities change.
2. Correct data at the source: If data can be fixed before it becomes an erroneous (or duplicated) entry in the system, it saves hours of time and stress down the line. For example, if your forms are overcrowded and require too many fields to be filled, you will get data quality issues from those forms. Given that businesses are constantly producing more data, it is crucial to fix data at the source.
3. Measure data accuracy: Invest in the time, tools, and research necessary to measure the accuracy of your data in real time.
4. Manage data and duplicates. If some duplicates do sneak past your new entry practices, be sure to actively detect and remove them. After removing any duplicate entries, it is important to also consider the following:
- Standardizing: Confirming that the same type of data exists in each column.
- Normalizing: Ensuring that all data is recorded consistently.
- Merging: When data is scattered across multiple datasets, merging is the act of combining relevant parts of those datasets to create a new file.
- Aggregating: Sorting data and expressing it in a summary form.
- Filtering: Narrowing down a dataset to only include the information we want
5. Append data. An append is a process that helps organizations to define and complete missing information. Reliable third party sources are often one of the best options for managing this practice.
Upon completing these 5 steps, your data will be ready to be exported and analyzed as needed. Remember that with large datasets, 100% cleanliness is next to impossible to achieve.
Making Big Data, Clean Data
As is the case with many other actions, ensuring the cleanliness of big data presents its own unique set of considerations. Subsequently, there are a number of techniques that have been developed to assist in cleaning big data:
Conversion tables: When certain data issues are already known (for example, that the names included in a dataset are written in several ways), it can be sorted by the relevant key and then lookups can be used in order to make the conversion.
Histograms: These allow for the identification of values that occur less frequently and may be invalid.
Tools: Every day major vendors are coming out with new and better tools to manage big data and the complexities that can accompany it.
Algorithms: Such as spell check or phonetic algorithms can be useful – but they can also make the wrong suggestion.
About Manual Data Interventions
Today it is almost never economic to manually edit data for improvement. However, in case of extremely valuable data or when millions of labeled data points are needed as in the case of image recognition systems, manual data updates may make sense. If manual updates will be made on the data, some best practices to keep in mind include:
- Be sure to sort data by different attributes
- In the case of larger datasets, try breaking them down in down into smaller sets to increase iteration speed
- Consider creating a set of utility functions such as remapping values based on CSV file or regex search-and-replace
- Keep records of every cleaning operation
- Sampling can be a great way to assess quality. Once you know your data quality tolerance limits, these can help you decide sample size to assess quality. For example, if you have 1,000 rows and need to make sure that a data quality problem is no more common than 5%, checking 10% of cases
- Analyze summary statistics such as standard deviation or number of missing values to quickly locate the most common issues
Best Practices in Data Cleaning
There are several best practices that should be kept in mind throughout any data cleaning endeavor. They are:
- Consider your data in the most holistic way possible – thinking about not only who will be doing the analysis but also who will be using the results derived from it
- Increased controls on database inputs can ensure that cleaner data is what ends up being used in the system
- Choose software solutions that are able to highlight and potentially even resolve faulty data before it becomes problematic
- In the case of large datasets, be sure to limit your sample size in order to minimize prep time and accelerate performance
- Spot check throughout to prevent any errors from being replicated
Challenges with Data Cleaning
Image source: Preact CRM
Data cleaning, though essential for the ongoing success of your organization, is not without its own challenges. Some of the most common include:
- Limited knowledge about what is causing anomalies, creating difficulties in creating the right transformations
- Data deletion, where a loss of information leads to incomplete data that cannot be accurately ‘filled in’
- Ongoing maintenance can be expensive and time consuming
- It is difficult to build a data cleansing graph to assist with the process ahead of time
Having clean data is just one part of a more comprehensive data management effort, but its importance for the best decision making possible should not be underestimated. If you are interested in learning more about how technology, data, and organizations are all evolving, be sure to check out our blog’s data section.