How to Use Data Profiling to Avoid a Data Warehouse Quality Train Wreck

So you have a bioengineering vision for a data warehouse, you have executive support, and you are sitting down to write up your budget proposal. Once you have enough of a subject area scope defined to know roughly which source systems, and which tables your DW is going to extract from, the very next thing you need to do is some data profiling.

Data profiling should be done as early as possible.

Poor data quality is a serious issue that can undermine every aspect of a DW project. First and foremost, if the data has quality issues the actual analysis done by users of the data warehouse will not be of value.

But more directly important to a project manager is the impact data quality issues in source systems can have on the duration and quality of ETL (extract, transform and load) job development.

Blindly trusting the meta data and estimating ETL development time without looking at the source system data risks having development schedules run over as the developers discover that the simple extraction code they have written fails when it comes into contact with real world, messy data.

It is one thing to know that a given text field is supposed to contain product category code- it is quite another to know how many rows are actually populated, and if the values conform to established categories, or fall into the correct hierarchy.

At a minimum, all data sources should be profiled for the following:

  1. Value Completeness- is there a value in the field.
  2. Historical Completeness- is the historical data complete- missing data can corrupt results, but might not be noticed at a summary level.
  3. Data format – are phone numbers phone numbers? Are email addresses properly formed? Do postal codes have the correct structure?
  4. Cross column consistency- Even if all the values in the columns are correct, its important to profile across columns. Do product codes match the product categories? Are geographical tags consistent? Do the same products appear under different categories depending on transaction date?
  5. Value outliers- By looking for extremely high or low values, “killer rows” can be identified. One row with a value that is orders of magnitude off will skew averages and totals and surprisingly might not be noticed through the normal course of report validation.

The most efficient way to profile data is through the use of an automated data profiling tool. While some custom SQL queries may be necessary for very specific tests, attempting to do all checks by developing queries is extremely inefficient compared to the use of a data profiler.Strategies based on the results

Once the result of the profiling is available, it is critical to be realistic about what your analysis tells you.

Faced with data quality issues, you have a number of options:

Reduce scope to avoid data that simply is not viable

Nobody will want to hear that some key information that was advertised during the project justification phase is simply not viable, but its better to break the news now, than after spending thousands developing ETL jobs and reports that are worthless. Cut your losses if you have to.

Isolate problem areas and mark them

In some cases, its not necessary to abandon bad data completely- it might be possible to isolate bad records and ensure that they are clearly identified. If users know that for certain analysis they are only able to use subsets of the data, it might still be of value to them to run certain analysis on a portion. For example, if sales data from a subsidiary is missing certain fields, at least the full analysis can be run on a portion of the sales not affected by the problem. Reports need to clearly state “Excluding XYZ”.

Cleanse issue areas

It may be possible to cleanse data with issues early.

For example:

  1. Remove duplicates from Master files such as the customer list using matching software
  2. Identify missing information and launch data collection and/or entry sub-projects to populate the key tables
  3. Use external data sources to enhance existing or sparse data sets. For example, with an external service it might be possible to populate missing longitude and latitude information from street address to enable map based visualizations.

Revise your project budget if neededIf you have established a project budget before you undertake your data quality investigation, as a project manager you need to honestly look yourself in the mirror and answer the question “Based on what I now know, is the approved budget still viable?”

Ideally, don’t set the detailed budget until AFTER you have done the data quality investigation. Without understanding the raw materials available, it is difficult to design the data warehouse, and even harder to accurately estimate the effort involved.

Profile your data early and often. Data quality is an often overlooked key aspect of data warehouse project success.

James Standen is the Founder of nModal Solutions Inc., the creator of the Datamartist tool. Datamartist is an easy to use, visual data profiling and transformation tool.

 

Comments are closed.