LUTZ BUSINESS INSIGHTS
dirty data? 6 ways to cleanse a database
tony desantis, data analytics director
Organizations can only make truly effective strategic decisions when the data they review is dependably accurate. This level of accuracy is challenging to obtain and maintain, as data integrity tends to degrade over time. Inaccuracies tend to creep in through lack of controls within a system or process, staff error or inaccurate changes made by users, customers, etc. Businesses must be proactive to ensure they do everything possible to keep their database information clean and accurate.
What is Dirty Data?
Dirty data is a name given to a specific type of information held in a database. It is data that may have spelling or punctuation errors. It could also be outdated, incomplete, or missing altogether. Bad data could be duplicate or inconsistent data, such as when the name of a state is spelled out vs. abbreviated in others.
How Dirty Data Causes Issues
Once you understand what comprises dirty data, it becomes clear how it can cause major issues for firms (most of whom heavily rely on this information for their strategic decision-making). With bad data, even minor reports may be significantly inaccurate, data security and privacy issues can arise, it becomes more difficult to aggregate information, and larger accounting and financial reports could contain significant inaccuracies.
How Accurate is Your Data?
If your firm seems to be struggling with data issues, you are not alone. Many organizations face problems that arise from dirty data. Companies who struggle with nonexistent data, unreliable data, or data discrepancies (no single source of truth) may want to consider whether their database(s) no longer have a high level of accuracy.
Strategizing the Data Cleansing Process
After recognizing you have a problem with dirty data, the next step is to develop a data cleansing strategy. Some points to consider when organizing this initiative include:
- Deciding how clean is “clean enough.”
- Identifying what data is actually dirty and how it affects reporting and analytics strategies.
- Identifying the cause(s) of dirty data.
Not every company will require the same level of data cleanliness. For example, some will be willing to accept street addresses that abbreviate words such as “Street” or “Drive.” Others will have more strict requirements, for example, changing all abbreviated street addresses to reflect their full description.
Companies must pay particular attention to data that is incorporated into their reporting and analysis efforts. There may be some instances of bad data that does not negatively affect reporting and analytics enough to justify a stringent cleaning effort. In other cases, it may be “mission-critical” to thoroughly cleanse bad data, as well as employ methods to ensure the newly cleansed data maintains its accuracy for these areas.
Lastly, if data is important enough to go through the cleansing process, it must be determined how it became dirty in the first place. Any processes that allowed the insertion of bad data must be corrected. Generally, the most common underlying issue is inaccurate data entry. Other areas that may need to be addressed are whether there are multiple sources that capture the same data, how data is captured (e.g., free-form text fields, vs. drop-down lists and checkboxes), as well as other processes with “holes” that allowed the input of bad data.
6 Ways to Clean Data
Not every step in the list below will be necessary for every company that needs to cleanse its database. The methods a company decides to employ will depend on the type of data they are trying to scrub.
How to clean a database:
- Leverage in-house skill sets to write programs or scripts to address bad data. Tools such as Python, R, and SQL can all be used to merge, update, and delete data.
- Employ automated programs to identify and merge duplicate fields, records, etc.
- Consider the value of purging a database of specific records at a pre-determined time. This can help clean the remaining data and reduce the scope of data to be evaluated.
- Update outdated records and fields with newer data and determine whether it is necessary to retain old data. If you are deleting old information, create a change log or history file to store the outdated data.
- Enrich missing data by employing easily accessible in-house data. Some firms may want to consider employing a data service that already has the needed data, to allow for quicker updates.
- Standardize data by developing naming convention standards for data sources across the board. Also, determine which fields should be standardized, then employ tools that automatically look for opportunities to standardize fields (e.g., St vs. St. vs. Street). Once clean data is in place, ensure the input processes for the newly cleansed fields will only allow for the insertion of data in a specified format.
In summary, having accurate data to draw from when making business decisions is vital. By diligently addressing all the issues associated with dirty data, companies will greatly improve the accuracy of their reporting and analysis efforts, both now and in the future.
ABOUT THE AUTHOR
TONY DESANTIS + DATA ANALYTICS DIRECTOR
Tony DeSantis is the Data Analytics Director at Lutz. He began his career in 1999. He is responsible for interpreting and analyzing data, as well as designing report visuals in support of client engagements. In addition, he specializes in data management and the application of artificial intelligence to simplify business processes.
AREAS OF FOCUS
- Data Analytics
- Data Visualization
- Data Management
- Artificial Intelligence
- Forensic Analytics
- BS in Finance and Operations, Minor in Management Information Systems, University of Delaware, Newark, DE
- Junior Achievement, Volunteer
- Gilda's Club Chicago, Past Board Member
SIGN UP FOR OUR NEWSLETTERS!
We tap into the vast knowledge and experience within our organization to provide you with monthly content on topics and ideas that drive and challenge your company every day.