...

Data Has No Moat! | Towards Data Science


of AI and data-driven projects, the importance of data and its quality have been recognized as critical to a project’s success. Some might even say that projects used to have a single point of failure: data!

The infamous “Garbage in, garbage out” was probably the first expression that took the data industry by storm (seconded by “Data is the new oil”). We all knew if data wasn’t well structured, cleaned and validated, the results of any analysis and potential applications were doomed to be inaccurate and dangerously incorrect.

For that reason, over the years, numerous studies and researchers focused on defining the pillars of data quality and what metrics can be used to assess it.

A 1991 research paper identified 20 different data quality dimensions, all of them very aligned with the main focus and data usage at the time – structured databases. Fast forward to 2020, the research paper on the Dimensions of Data Quality (DDQ), identified an astonishing number of data quality dimensions (around 65!!), reflecting not just how data quality definition should be constantly evolving, but also how data itself was used.

Dimensions of Data Quality: Toward Quality Data by Design, 1991 Wang

Nonetheless, with the rise of Deep Learning hype, the idea that data quality no longer mattered lingered in the minds of the most tech savvy engineers. The desire to believe that models and engineering alone were enough to deliver powerful solutions has been around for quite some time. Happily for us, enthusiastic data practitioners, 2021/2022 marked the rise of Data-Centric AI! This concept isn’t far from the classic “garbage in, garbage-out”, reinforcing the idea that in AI development, if we treat data as the element of the equation that needs tweaking, we’ll achieve better performance and results than by tuning the models alone (ups! after all, it’s not all about hyperparameter tuning).

So why can we hear again the rumors that data has no moat?!

Large Language Models’ (LLMs) capacity to mirror human reasoning has stunned us. Because they are trained on immense corpora combined with the computational power of GPUs, LLMs are not only able to generate good content, but actually content that is able to resemble our tone and way of thinking. Because they do it so remarkably well, and often with even minimal context, this had led many to a bold conclusion:

“Data has no moat.”
“We no longer need proprietary data to differentiate.”
“Just use a better model.”

Does data quality stand a chance against LLM’s and AI Agents?

In my opinion — absolutely yes! In fact, regardless of the current beliefs that data poses no differentiation in the LLMs and AI Agents age, data remains essential. I’ll even challenge by saying that the more capable and responsible agents become, their dependency on good data becomes even more critical!

So, why does data quality still matter?

Starting with the most obvious, garbage in, garbage out. It does not matter how much smarter your models and agents get if they can’t tell the difference between good and bad. If bad data or low-quality inputs are fed into the model, you will get wrong answers and misleading results. LLMs are generative models, which means that, ultimately, they simply reproduce patterns they have encountered. What is more concerning than ever is that the validation mechanisms we once relied on are no longer in place in many use cases, leading to potentially misleading results.

Furthermore, these models have no real world awareness, similarly to other previously dominating generative models. If something is outdated or even biases, they simply won’t recognize it, unless they are trained to do so, and that starts with high-quality, validated and carefully curated data.

More particularly, when it comes to AI agents, which often rely on tools like memory or document retrieval to work across activities, the importance of great data is even more obvious. If their knowledge is based on unreliable information, they won’t be able to perform a good decision-making. You’ll get an answer or an outcome, but that does not mean it is a useful one!

Why is data still a moat?

While barriers like computational infrastructure, storage capacity, as well as specialized expertise are mentioned as relevant to stay competitive in a future dominated by AI Agents and LLM based applications, data accessibility is still one of the most frequently cited as paramount for competitiveness. Here’s why:

  1. Access is Power
    In domains with restricted or proprietary data, such as healthcare, lawyers, enterprise workflows or even user interaction data, ai agents can only be built by those with privileged access to data. Without it, the developed applications will be flying blind.
  2. Public web won’t be enough
    Free and abundant public data is fading, not because it is no longer available, but because its quality its fading quickly. High-quality public datasets have been heavily mined with algorithms generated data, and some of what is left is either behind paywalls or protected by API restrictions.
    Moreover, major platform are increasingly closing off access in favor of monetization.
  3. Data poisoning is the new attack vector
    As the adoption of foundational models grows, attacks shift from model code to the training and fine-tuning of the model itself. Why? It is easier to do and harder to detect!
    We are entering an era where adversaries don’t have to break the system, they just need to pollute the data. From subtle misinformation to malicious labeling, data poisoning attacks are a reality that organizations that are looking into adopting AI Agents, will need to be prepared for. Controlling data origin, pipeline, and integrity is now essential to building trustworthy AI.

What are the data strategies for trustworthy AI?

To keep ahead of innovation, we must rethink how to treat data. Data is no longer just an element of the process but rather a core infrastructure for AI. Building and deploying AI is about code and algorithms, but also the data lifecycle: how it’s collected, filtered, and cleaned, protected, and most importantly, used. So, what are the strategies that we can adopt to make better use of data?

  1. Data Management as core infrastructure
    Treat data with the same relevance and priority as you would cloud infrastructure or security. This means centralizing governance, implementing access controls, and ensuring data flows are traceable and auditable. AI-ready organizations design systems where data is an intentional, managed input, not an afterthought.
  2. Active Data Quality Mechanisms
    The quality of your data defines how reliable and performant your agents are! Establish pipelines that automatically detect anomalies or divergent records, enforce labeling standards, and monitor for drift or contamination. Data engineering is the future and foundational to AI. Data needs not only to be collected but more importantly, curated!
  3. Synthetic Data to Fill Gaps and Preserve Privacy
    When real data is limited, biased, or privacy-sensitive, synthetic data offers a powerful alternative. From simulation to generative modeling, synthetic data allows you to create high-quality datasets to train models. It’s key to unlocking scenarios where ground truth is expensive or restricted.
  4. Defensive Design Against Data Poisoning
    Security in AI now starts at the data layer. Implement measures such as source verification, versioning, and real-time validation to guard against poisoning and subtle manipulation. Not only for the datasources but also for any prompts that enter the systems. This is especially important in systems learning from user input or external data feeds.
  5. Data feedback loops
    Data should not be seen as immutable in your AI systems. It should be able to evolve and adapt over time! Feedback loops are mandatory to create sense of evolution when it comes to data. When paired with strong quality filters, these loops make your AI-based solutions smarter and more aligned over time.

In summary, data is the moat and the future of AI solution’s defensiveness. Data-centric AI is more important than ever, even if the hype says otherwise. So, should AI be all about the hype? Only the systems that actually reach production can see beyond it.

Source link

#Data #Moat #Data #Science