A few weeks ago one of our users told me he is having the “white page syndrome”, usually a known phenomena exists for authors, but this time it was the case of data governance.
“I know Quality Gates is a great tool for monitoring data quality, but I am not sure what kind of validation should I do…” he said.
Therefor I decided to write this post and share with him and all the rest of our users, explaining what is the best practice of data governance and data quality during development and on going.
Quality Gates Best Practice in BI Projects:
The illustration bellow demonstrate the 9 steps we are implementing in almost every project. This of course can be extended to more or be less depends on the implementation done.
To make it more clear I will give a brief explanation for each one of the steps:
Step 1: Get Table Statistic
This step is used in order to get the table statistics to have a rapid values overview, without making too many queries.. I am usually using this step when exploring a new table to load and after a table was loaded to the datawarehouse. When running this validation I am able to answer the following questions:
- Who are possible candidate for a functional key
- Do I have null or empty values
- How many different values a column contains. If it is only a single value, maybe it is not important to load
Step 2: Duplicate Entities
Most of the tables contain a primary key, this does not mean that there are no duplication of entities in the table. A client can be register several time in the system using each time a different technical key.
Using Quality Gates allows me to:
- Identify whether a functional key is being duplicated
- Validate primary keys in tables that does not enforce as constraint. This is usually the case in Big Data implementation.
- Assist in identifying a Parent Child valid relation (no 2 parent for a single child)
Step 3: Data Existence
This test is to make sure the source data that we are loading contains data.
For example I would expect to have at least one record or more each hour in an operational system. If this is not the case I would like to check if the source system is working well or my connection to it is still alive.
Step 4: Data Validation
In many cases the source system contains bad value (empty values, bad format, out of range, etc.). In order to track all those cases and notify the source system owner I will a Quality Gate test. Using this test I often discover:
- Null and empty values
- Values which are out of range (i.e. Negative values)
- Bad format values (i.e. Email address, postal code)
- Default values
- Relation between 2 different columns. (i.e. If column X equal some value then column Y should have a correlated value)
Step 5: Consistency between source
Some BI and datawarehouse platforms containing several layers (i.e. Operational – ERP – ODS – Staging – Datawarehouse – Data Marts – Cube – etc.) In many cases you will find the numbers are different between those layers. The reasons for this gap can be varied (I will list some of them below). In most of the case we are not aware of this problem, only weeks or month after the project is already live. To be pro-active on this matter, I am applying a consistency validation between each source and Weekly, Daily or Hourly I am alerted whether the different layers are aligned.
Here are some of the reason why there could be gaps:
- Different data type between the source cause truncation of values
- Incremental/ delta loading was not using the a good key – some data was not loaded
- Wrong query implementation when loading to a new source
- Query/ Cube using join with referential integrity issue is causing missing numbers
- Data is not constantly refreshed in one of the sources
Step 6: Validating Reference values
As I mentioned above there are cases where data in fact table have a value without a reference in a dimension table (i.e. a Client Id value in Fact Sales do not exist in the Client dimension! ).
Some project are already having mechanism to complete the missing reference values, but still there are those cases that a value was not completed for different technical reasons.
Implementing this steps always give me a good awareness of the referential integrity problem in the project and I can take an immediate action without waiting for an error from the cube or missing numbers in the reports.
Step 7: Monitor reject and log tables
I often ask BI manager whether they are using a reject tables. Some of them proudly sat that they are. Then usually my second question is, how often are you monitoring those table? Usually at that point the eyes start to wonder in the room…This is the same case for log table that suppose to monitor different process in the BI system.
The best practice as we know is to monitor those log and reject table and get alert when there are new problematic records in those table. I can also control the threshold of amount of records, getting an alert only when there are more then 10, 20, 100 problematic records…
Step 8: Reports validation
This step is to make sure that the user is getting the right numbers in their reports.
Morning meeting is the best time to know whether the users are satisfied with their reports and dashboards. I often hear about 2 returning problems: 1) The user immediately identify that the numbers in the report are out of range and cannot be correct. 2) One area or product suddenly disappear in the report!
In order to prevent this situation what we are doing, is taking the user’s knowledge and implement it in a test. This test will make the instinctive validation that the user is doing, before the user will get the report in the morning meeting.
Step 9: Alert!
Alerts are maybe the most important step. We would like to make sure the relevant project managers and developers are getting the alerts on time to handle the problem before the business users. Business users can register to alert as well and be aware of the situation at real time – in many organization I saw this is a positive action that keep the business users part of the process and increase confidence.