Welcome! GovernYourData.com is an open peer-to-peer community of data governance practitioners, evangelists, thought leaders, bloggers, analysts and vendors.

The goal of this community is to share best practices, methodologies, frameworks, education, and other tools to help data governance leaders succeed in their efforts.

Is it possible to have 100% trust in your data

I’ve been having discussions in recent months with colleagues about what it means to trust your data.  If trust is an important element, then we should be able to take it beyond a qualitative definition. In other words, what exactly does it mean to “trust” data in concrete terms. My conclusion is that in order to trust data you need four things: Transparency, Accountability, Verification and Change Control. These words sound like “motherhood and apple pie”, so let’s get more specific.

Transparency:

  • All business terms published in a searchable glossary
  • System of Record (SoR) or System of Source (SoS) and data lineage identified for each term

In plain English, this means that we know what data means and we know where it came from.  We should not have to call a meeting to figure out what a data items means and we shouldn’t have to dig through a host of Word or Excel documents stored in a shared folder (or someone’s desk drawer) to figure out how the data got to the warehouse.  This information should be readily available to anyone that needs to know using simple tools (like a web browser) with powerful search capabilities.  What data means and where it comes from should not be hard to find and should not be a mystery.

Accountability:

  • Business Owner and IT Owner assigned for every SoR/SoS
  • Data Steward assigned to each enterprise business term

Trust in data requires that there is clear ownership for data that is captured and stored by each application in the enterprise.  The owners are primarily responsible for data quality – if a DQ problem pops up, we need someone to be accountable for fixing it.  The business owner is responsible for business processes that impact data quality and for justifying resources (i.e. investments) for making changes.  The IT owner is responsible for managing the change process and ensuring that system service levels are maintained.

The Data Stewards on the other hand are responsible for data that is shared across applications. If a particular data element is used in one and only one application, then the business and IT owner for the system is sufficient, but as soon as the data is shared across multiple systems, you need someone to take explicit responsibility.

And oh by the way, the data warehouse is a system, and so is a data mart, the ETL hub, and the Enterprise Service Bus.  Each of these systems must have a business and IT owner.  Sometimes you may have an IT person play both roles, in particular when the system in question is one that is serving the needs of an IT process (such as an IT project management system), but in general each system should have two owners.

Verification:

  • Data quality scorecard for individual elements
  • SLA monitoring and alerting for SoR/SoS and middleware

IT systems operate in a extremely dynamic environment – the data itself can drive changes in how it is processed.  Just because we trust the data today, we need to constantly monitor it make sure that we can still trust it tomorrow.  If a particular BI report relies on data being available at 8:00 am every day, then the consumers of the information need to be notified on days when the deadline is missed, and the owners of the SoR and SoS need to be notified so that they can take appropriate actions.

Change Control:

  • Impact analysis and controls for end-to-end data lineage

The enterprise system-of-systems is constantly changing, and because it is an adaptive complex system, by its very nature it will demonstrate some unpredictable behaviors; a small change to one system may cause a problem with what had appeared to be a totally unrelated system.  In order to trust data, we need a robust change control process that a) strives to minimize “black swan” events (major unpredictable outages), and b) recover quickly when changes cause problems.

In summary, if we have these seven things: 1) searchable glossary, 2) metadata for SOR and linneage, 3) Business and IT owners for SOR’s, 4) Data Stewards for enterprise data, 5) Data Quality scorecards, 6) SLA monitoring, and 7) change controls, then we should have 100% trust in data.  This doesn’t mean that data won’t be wrong sometimes, but when it is, we will know about it quickly and will have clear accountability for correcting it.  What more could we ask for?

To the readers of this article, do you agree?  Have I missed anything?

Views: 363

Reply to This

Replies to This Discussion

Excellent discussion, John.

I definitely agree that it is possible to have 100% trust in data as long as trust is not equated with perfection.

As you explained, trusted data is not perfect data. Trusted data is transparent data, honest about its imperfections. You can achieve trust through transparency with pervasive data quality monitoring, since you can’t fix what you can’t see, but even more important, concealing or ignoring known data quality issues is only going to decrease users’ trust of the data.

As you also noted, comprehensive metadata management allows us to see our data quality challenges as they truly are, and facilitates the communication and collaboration necessary for success.

Documenting the trail of digital breadcrumbs allows tracing the data from a report all the way back to the source, providing an overview of any data transformations or data quality rules applied along the way, which is essential for troubleshooting existing issues and performing impact analysis on proposed changes, without which dangerous assumptions can be made about the business problems and related data challenges being well-understood by the collaborative team trying to solve them.

You can also achieve a mutual understanding through transparency with, as you noted, searchable glossaries providing a clear picture of the terminology (both business and technical) surrounding the data and its processes (again, both business and technical).

Often the root cause of untrusted data can be traced to the lack of a shared understanding of the roles and responsibilities involved, many of which you described (e.g., owners and stewards).

Data governance policies can help establish ownership and accountability for those roles, and provide the framework for establishing a pervasive program for ensuring that data is of sufficient quality so that it can be trusted to meet the current and evolving business needs of the organization.

Best Regards,

Jim Harris

Hi John,

Jim is right about perfection. As your data is a model of reality, you can be confident that your data is the best it can be. You can measure and profile it. You can manipulate, migrate and retire it. It is possible to know anything and everything about it. You have more control over your data than you have over reality.

My point is that as reality is always changing, there is always a latency between changes in reality and the model being updated to reflect them. As long as you are confident that your model is regularly checked against reality, and that your ability to measure reality is as good as your ability to measure your data, then you can be confident in your data. But it won't be perfect.

However, confidence in data covers many other subjects. For instance:

How confident are you that your data won't be stolen? Are you confident that your system is free of viruses, worms etc? Are you confident that the right people are allowed view your sensitive reports? Jim rightly touched upon roles and responsibilities. Are you confident that your data is archived and destroyed appropriately?

How about usage of data? How do you know that your data is being used correctly? How many different versions of the same report exist? Are you confident that your processes for transferring data are secure? 

To me, confidence comes from more than just the content of data, but also how it is managed from creation to destruction.

Great question, John. Interested what others think.

Hi John,

Great thread here.  I have often wondered if data quality perfection is achievable or even a good thing to try to achieve.  There is a peak in the cost-return curve here.

A good near-term business goal in my opinion would be to enable business-IT collaboration so that business people could provide context to IT, including the business impact of bad data in a project they are specifying.  Something along the lines of:

  • If this piece of data is wrong by 5%, it could put us out of business
  • If this piece of data is wrong by 5%, it would be annoying

Great comments!

Jim, I like the explicit connection you made to say that 100% trust doesn’t equate to perfection. Just like with human beings, you can have total trust in someone but that doesn’t mean they won’t mess up occasionally.

Richard, you make a couple of really good points. First is that data is a “model” of reality, and as such it is imperfect and there is often a lag before the model catches up with reality.  In some scenarios we can eliminate the lag by having reality follow the model.  For example, if we want to deploy a new BI report, we could model the changes in metadata, then have tools push the new report to production based on the manifest.  In other words, we change reality by changing the model first.  There are limited scenarios where this could work, but it’s a start.

The other points you highlighted are security and data life-cycle considerations such as archival and destruction.  These were implied in my original list, but I think they are important enough that they should be listed explicitly. So that brings the list of 7 up to 9 key points.

Your final point about appropriate use (or miss-use) of data is an interesting one.  I’m not sure how we can monitor and protect against stupidity, but it is a valid concern.  In Lean circles we talk about the 7 wastes as originally described by Taiichi Ohno, but there is also an 8th waste that we shouldn’t ignore; waste of unused human talent.  Maybe that becomes our 10th component for 100% in data – to not waste or miss-use good data.

I'm involved in a data lineage initiative.  I'm interested if people feel that retaining versioning is important for data lineage.

Here's an example.

Let's say my data lineage shows me that my report named "Monthly Revenue" comes from Datawarehouse A.  Datawarehouse A crunches data that comes from Old Legacy System Z.

But, that was only true until today.  Today, Old Legacy System Z was replaced by New System.  Today, my lineage (in English) becomes "New System feeds DataWarehouse A.  Datawarehouse A adds 5 lines of business together and gives me Monthly Revenue Report".

My question:  Next year, should I be able to go back in time and see where this change in my lineage took place?

Renee, the short answer is YES, but it also depends on the use case.  For the KPI scenario you described, it would indeed be important to explain differences in a Monthly Revenue Report if in fact the data sources or reference hierarchies changed.  For regulatory or compliance reporting it would be even more critical.  However if you use metadata for other scenarios like root cause analysis of a production outage or impact analysis of a change to a current system, in cases like these historical metadata is not relevant.

My understanding is that IMM does not support metadata versioning.  That would be useful in the case of regulatory reporting, as you mention.  Do you know if this is something that IMM would support in future?

Thanks.

John said:

Renee, the short answer is YES, but it also depends on the use case.  For the KPI scenario you described, it would indeed be important to explain differences in a Monthly Revenue Report if in fact the data sources or reference hierarchies changed.  For regulatory or compliance reporting it would be even more critical.  However if you use metadata for other scenarios like root cause analysis of a production outage or impact analysis of a change to a current system, in cases like these historical metadata is not relevant.

IMM reflects the production environment - that is it's strength.  Some users who have had a need for historical data linneage have implemented a simple procedure to do periodic extracts (typically quarterly) of production metadata and save it in a separate database. This also enables more sophisticated analytics and reports showing trends and changes over time.  Wether or not this functionality is built in to a future release is something I can't talk about in a public blog forum. But what I can say is that at any point in time, there will ALWAYS be some functionality that some users want that isn't available "out of the box". This has been the case ever since I started working with metadata almost 20 years and will continue to be the case going forward. Which is exactly why IMM designed to be flexible and extendable and why successful metadata implementations typically have a Metadata Management Office which includes metadata development skills on the team in addition to metadata administration and general management.

Reply to Discussion

RSS

Try Our Tools

© 2017   Sponsored and Hosted by informatica   Powered by

Badges  |  Contact Us  |  Terms of Service