So you have a bioengineering vision for a data warehouse, you have executive support, and you are sitting down to write up your budget proposal. Once you have enough of a subject area scope defined to know roughly which source systems, and which tables your DW is going to extract from, the very next thing you need to do is some data profiling.
Data profiling should be done as early as possible.
Poor data quality is a serious issue that can undermine every aspect of a DW project. First and foremost, if the data has quality issues the actual analysis done by users of the data warehouse will not be of value.
But more directly important to a project manager is the impact data quality issues in source systems can have on the duration and quality of ETL (extract, transform and load) job development.
Blindly trusting the meta data and estimating ETL development time without looking at the source system data risks having development schedules run over as the developers discover that the simple extraction code they have written fails when it comes into contact with real world, messy data.
It is one thing to know that a given text field is supposed to contain product category code- it is quite another to know how many rows are actually populated, and if the values conform to established categories, or fall into the correct hierarchy.
At a minimum, all data sources should be profiled for the following:
The most efficient way to profile data is through the use of an automated data profiling tool. While some custom SQL queries may be necessary for very specific tests, attempting to do all checks by developing queries is extremely inefficient compared to the use of a data profiler.Strategies based on the results
Once the result of the profiling is available, it is critical to be realistic about what your analysis tells you.
Faced with data quality issues, you have a number of options:
Reduce scope to avoid data that simply is not viable
Nobody will want to hear that some key information that was advertised during the project justification phase is simply not viable, but its better to break the news now, than after spending thousands developing ETL jobs and reports that are worthless. Cut your losses if you have to.
Isolate problem areas and mark them
In some cases, its not necessary to abandon bad data completely- it might be possible to isolate bad records and ensure that they are clearly identified. If users know that for certain analysis they are only able to use subsets of the data, it might still be of value to them to run certain analysis on a portion. For example, if sales data from a subsidiary is missing certain fields, at least the full analysis can be run on a portion of the sales not affected by the problem. Reports need to clearly state “Excluding XYZ”.
Cleanse issue areas
It may be possible to cleanse data with issues early.
For example:
Revise your project budget if neededIf you have established a project budget before you undertake your data quality investigation, as a project manager you need to honestly look yourself in the mirror and answer the question “Based on what I now know, is the approved budget still viable?”
Ideally, don’t set the detailed budget until AFTER you have done the data quality investigation. Without understanding the raw materials available, it is difficult to design the data warehouse, and even harder to accurately estimate the effort involved.
Profile your data early and often. Data quality is an often overlooked key aspect of data warehouse project success.
James Standen is the Founder of nModal Solutions Inc., the creator of the Datamartist tool. Datamartist is an easy to use, visual data profiling and transformation tool.