As AI technologies continue to advance, the demand for relevant and accurate data has intensified, pushing organizations to capture, integrate, and harness data from many different sources. However, beneath the surface, significant data integration challenges must be addressed to enable the potential of AI systems.
One of the primary hurdles in data integration for AI systems is the issue of data quality and consistency. AI models heavily rely on accurate and reliable data to produce meaningful insights and predictions. Yet, integrating data from various origins often leads to data disparities, inconsistencies in data formats, and errors. Cleaning and processing this consolidated data is a critical task, demanding significant time and effort from data engineers and data scientists. Failure to address these data quality concerns can lead to biased AI models or misleading results, jeopardizing the integrity of the entire AI system.
Another complex challenge lies in data privacy and security. With the integration of diverse datasets, the risk of exposing sensitive information and violating privacy regulations escalates. AI systems must adhere to strict data protection protocols to ensure that personally identifiable information (PII) and other confidential data remain secure. Data anonymization and encryption techniques can offer some solutions, but striking a balance between data utility and privacy preservation remains an intricate task.
An often unseen challenge is that the act of combining data from multiple sources might result in that combination acquiring aspects of personally identifiable information, or in the case of confidential and proprietary data, levels of confidentiality and classification that the original data sets on their own don’t have. This accidental “upclassing”, “PII additive” or “deanonymization” problems are causing significant issues, especially in environments where data needs to be held securely, confidentially, or is required by regulation to be kept private.
On a recent GovFuture podcast, Stuart Wagner who is the Chief Digital Transformation Officer at the US Department of the Air Force, shared some of the unique, and unexpected challenges that integrating data poses when being used for advanced applications such as analytics and AI.
The Unintended Side Effects of Data Integration: “Up Classing”
Stuart explains, “Data that comes from a wide range of systems, especially telemetry and internet of things (IoT) data needs to connect and communicate with a wide range of systems and requires the ability to understand the state of a system. What I realized was the need to be able to combine data. In my second week at the Department of Defense, I requested to join two datasets together for a use case that I was increasingly learning about in my role that I was tasked to do. And I went and basically asked the head of the technology team to join these two datasets. ‘How do I do that?’ he said, ‘you can’t do that’. And I said, ‘Why not?’ He goes, you can talk to security about it, but basically, we’re afraid of what you would learn from joining those two datasets together. And I went and talked to the Security Officer and learned more about it. And what I began to realize was that, number one, we are afraid to learn from our data because of the risk of it “up classing”. Basically, by aggregating or compiling data together, it is possible to learn new things and those new things could be more classified.
This is something that never occurred to me before joining the Department of Defense. This is an unobvious problem. And so I said, ‘How is this determined?’ And the Security Officer said to me, ‘Well, you know when you see it.’ And I realized at that moment that I was on to a pretty serious problem. The problem was an arbitrary determination of whether or not you can combine data together.”
Stuart continues, “I realized quickly that in order to get to the artificial intelligence capabilities that are being described, and with the backdrop of our significant missions and objectives, I started to realize that basically, we’re never going to be able to combine critical weapons system data together if we’re not able to rapidly determine the classification of data.”
The “Battering Ram”
To address this challenge of the unintended consequences of data integration, Stuart and his team developed something called the “Battering Ram”, which they demonstrated at a GovFuture Forum DC event in June 2023. The core idea of the Battering Ram is to attempt to join data together to see how that changes its classification before actually joining that data together.
Stuart explains, “That is actually what Battering Ram is focused on. I’m still working on problems I discovered in week two at the Department of Defense. A battering ram was designed to break down the walls of a fortress, and to apply significant pressure on the weak area of a castle. It produces a small hole that enables those seeking to enter that castle and obtain its resources. What we realized is we were starting to work on this problem as we were building a battering ram on ourselves. The castles represent the silos of data that exist across the Department of Defense and the inability to basically access that data because we’re not able to rapidly and easily determine the classification of data. So the way this works is basically the policy that’s supposed to exist for security classification that’s unobvious isn’t located until they call the security classification guide. There are thousands of these across the Department of Defense at the unclass and secret level.
Each of these is hundreds of pages long, written sort of in a vacuum disconnected from other security classification guides, disconnected from other programs in the Air Force. And they’re supposed to describe what happens to the data when you combine it together. This is the critical problem. They don’t. It’s impossible to do this. It would be actually an n-squared problem. You would have to compare every piece of information that can exist in the DoD with every other one. And then actually it gets worse, technically it would be a factorial, but just for two pieces of data, it’s n squared. So the challenge is how do you make sense of all this policy to produce deterministic classification? And so the way we’ve addressed this is basically: we ingest this data, we view the policy as data, we ingest it, produce a knowledge graph for it, and then ultimately allow people to automatically query and discover contradictions of the policy. Once we can produce a non-contradictory policy, our intention is to actually provide for deterministic, like provide a road. Maybe we won’t turn this on today or tomorrow, but provide a pathway to deterministically automate classification policy decisions.”
Clearly Stuart and his team are addressing core issues of data integration that surprisingly still aren’t being solved by even the most advanced data technology suppliers and vendors. And even more surprisingly, are most likely issues faced by any organization dealing with the potential for data privacy, security, confidentiality, or regulatory issues that could be compromised by the simple act of combining data together. To learn more, listen to the GovFuture Podcast interview with Stuart Wagner on this topic.
Disclosure: Ronald Schmelzer is an Executive Director at GovFuture.