Welcome to the most recent model of Ideas the Gap, a month-to-month column exploring smart approaches for enhancing data understanding and data utilization (and irrespective of else seems fascinating enough to share). Remaining month, we explored the rise of the data product. This month, we’ll check out data top quality vs. data well being.
All folks likes a pithy definition. Entrepreneurs describe them as “sticky,” or easy to remember. In truth, that doesn’t on a regular basis suggest they’re useful or totally appropriate. Data administration has a pair. Metadata is almost universally described as “data about data,” nonetheless I’d be eager to wager that you just simply rolled your eyes merely now. What variety of events have we seen metadata launched in that strategy, with the presenter or creator immediately apologizing after which shifting on to a further useful description.
Similarly, the data quality bumper sticker reads “fit for purpose.” You can probably already guess that I’m not a fan. Let’s pull out our DMBoK and see what it says:
The term data quality refers both to the characteristics associated with high quality data and the processes used to measure or improve the quality of data. [DMBoK-2, 644]
Characteristics and processes. Sounds good so far. Continuing:
Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes to which they want to apply it. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumer. [DMBoK-2, 644; emphasis added]
This definition has deep roots in the field of quality management, incorporating concepts articulated by some of its giants, including Joseph Juran (“fitness for use”), Philip Crosby (“conformance to requirements”), and W. Edwards Deming (“meeting or exceeding the customer’s expectation”). Their common thread is the focus on consumption, driven by precise and complete specifications. Quality improves as requirements and processes improve. Discipline in this approach is common with engineered and manufactured items like missiles, automobiles, and consumer goods.
It is less common with data.
Far be it for me to challenge the accumulated knowledge of our field, but I very strongly disagree with defining data quality as “fit for purpose.”
Imagine you’re looking to purchase a used car to get you back and forth to work. You don’t have a lot of money, but you have to drive only a couple miles each way. You find a car that’s extremely inexpensive, but the engine overheats after running for about a half hour. You buy it despite the engine problem because it satisfies your requirements: really low price and five-mile commute. It is fit for your purpose, and therefore, by the DMBoK definition, it is “high quality.”
One day, you want to visit family a couple hundred miles away. You set out in your “high-quality” car and haven’t even completed 10% of the trip when you have to stop and let the engine cool. At this rate, the journey will take days. You curse this piece of junk. The car is now “low quality” because it does not satisfy the new purpose to which you wanted to apply it.
The car was evaluated as both high quality and low quality, even though nothing about the car changed.
It was your perception of the car’s quality relative to a new purpose that changed.
When talking about data quality, we must therefore be clear about whose purpose, what requirements, established when, and by whom.
Within the context of the DMBoK definition, the answer is that every consumer evaluates the quality of a data set independently. Data is considered to be of high quality when it is fit for my purpose, satisfies my requirements, established by me when I need the data.
Data quality, defined in this way, is truly in the eye of the beholder.
Furthermore, data quality analyses cannot be leveraged by new consumers. For decades, we in decision support have been selling the benefits of leveraging data across applications and analyses. It has been the fundamental justification for data warehouses, data lakes, data lakehouses, etc. But misalignment between the purpose for which data was created and the purpose for which it is being used may not be immediately apparent. Especially when the data is not well understood. The consequences are faulty models and erroneous analyses. We reflexively blame the quality of the data, but that’s not where the problem lies.
This is not data quality.
It is data fitness.
The DMBoK doesn’t recognize data fitness as a specific knowledge area but mentions it as part of data profiling:
Assessing the fitness of the data for a particular use requires documenting business rules and measuring how well the data meets those business rules. [DMBoK-2, 418; emphasis added]
But this sounds an awful lot like “data is of high quality to the degree that it meets the expectations and needs of data consumers.” It seems like quality and fitness are being conflated.
And confused.
I’m confused.
As a friend recently commented, “We need quality for the definition of quality.”
Let’s go back to the data headwaters: the customer for whom the data was created in the first place. The needs and utilization context for that customer were:
- Expressed in their requirements, epics, features, and/or user stories
- Captured in the data definitions, expected content, and other quality dimensions
- Implemented in the application
The needs of additional downstream consumers known a priori may also have been considered, but most of these uses and users emerge after the application is deployed.
This original set of requirements is the only standard against which data quality should be measured. This allows us to definitively answer the questions of whose purpose, what requirements, established when, and by whom.
Data quality is the degree to which data conforms to the requirements for which it was created (definition, expected content, etc.).
We know how to do that. The DMBoK lists several data quality dimensions, each with objective measures. The standard is now clear.
The definition of data fitness also becomes clear.
Data fitness is the degree to which data conforms to the requirements for which it is being considered for use.
Data fitness, not data quality, is evaluated by each new potential consumer. The question being asked is, “Does this data satisfy my needs?” not, “Is this data of high quality?”
We know how to do that too.
Finally, consumers can request upstream application changes to accommodate their specific requirements. Don’t frame these requests as quality improvements, though. This might at least partially explain why development teams are less than excited to hear from us when we approach them with “data quality” issues related to our expectations, not their requirements.
I hate to introduce (or reintroduce) vocabulary into a field that drops new terms like a hay bailer, but I believe that it is worthwhile to more clearly differentiate between data fitness and data quality. Each has a different meaning and a different purpose. Each is a separate knowledge area.
Something to consider for DMBoK-3.
Source link
#Ideas #Gap #Data #Prime quality #Match #Goal
Unlock the potential of cutting-edge AI choices with our full decisions. As a primary provider inside the AI panorama, we harness the flexibility of artificial intelligence to revolutionize industries. From machine finding out and data analytics to pure language processing and laptop computer imaginative and prescient, our AI choices are designed to spice up effectivity and drive innovation. Uncover the limitless prospects of AI-driven insights and automation that propel what you might be selling forward. With a dedication to staying on the forefront of the rapidly evolving AI market, we ship tailored choices that meet your explicit needs. Be part of us on the forefront of technological improvement, and let AI redefine one of the best ways you utilize and obtain a aggressive panorama. Embrace the long term with AI excellence, the place prospects are limitless, and opponents is surpassed.