Large companies are finding a way to identify AI data they can trust

Uncategorized

Data is the fuel of artificial intelligence. It’s also a sticking point for large companies, as they are reluctant to fully embrace the technology without knowing more about the data used to build AI programs.

Now a consortium of companies has developed standards for describing the origins, history and legal rights to data. The standards are essentially a labeling system that identifies where, when and how data was collected and generated, as well as its intended uses and limitations.

The data provenance standards, announced Thursday, were developed by the Alliance for Data and Trusta non-profit organization consisting of about twenty mainly large companies and organizations, including American Express, Humana, IBM, Pfizer, UPS and Walmart, as well as some start-ups.

Alliance members believe the data labeling system will be similar to basic food safety standards, which require basic information such as where food comes from, who produced and grew it, and who handled the food on its way to the shelf in the supermarket.

More clarity and more information about the data used in AI models, executives say, will boost companies’ confidence in the technology. How widely the proposed standards will be used is uncertain, and much will depend on how easily the standards can be applied and automated. But standards have accelerated the use of every major technology, from electricity to the Internet.

“This is a step toward managing data as an asset, which is what everyone in the industry is trying to do today,” said Ken Finnerty, president of information technology and data analytics at UPS. “To do that, you need to know where the data was created, under what circumstances, its intended purpose and where it is legal to use it or not.”

Research shows that there is a need for greater trust in data and improved efficiency in data processing. In a poll among business leadersA majority cited “concerns about the lineage or provenance of data” as a major barrier to AI adoption. And a survey among data scientists found that they spent almost 40 percent of their time on data preparation tasks.

The data initiative is mainly intended for business data that companies use to create their own AI programs or data that they can selectively insert into AI systems from companies such as Google, OpenAI, Microsoft and Anthropic. The more accurate and reliable the data, the more reliable the AI-generated answers.

Companies have been using AI for years in applications ranging from tailoring product recommendations to predicting when jet engines need maintenance.

But the rise in the past year of so-called generative AI powering chatbots like OpenAI’s ChatGPT has heightened concerns about the use and misuse of data. These systems can generate text and computer code with human-like fluency, yet they often make things up – “hallucinate,” as researchers put it – depending on the data they access and collect.

Companies typically do not allow their employees to freely use the consumer versions of the chatbots. But they are using their own data in pilot projects that use the generative capabilities of the AI systems to help write business reports, presentations and computer code. And that business data can come from many sources, including customers, suppliers, weather and location data.

“The secret sauce isn’t the model,” says Rob Thomas, IBM’s senior vice president of software. “It’s the data.”

In the new system, there are eight basic standards, including origin, source, legal rights, data type and generation method. Then there are more detailed descriptions for most standards – for example noting that the data comes from social media or industrial sensors.

The data documentation can be performed in a variety of commonly used technical formats. Companies in the data consortium have been testing the standards to improve and refine them, and the plan is to make them available to the public early next year.

Labeling data by type, date and source is done by individual companies and industries. But the consortium says these are the first detailed standards to be used across all sectors.

“My whole life I’ve been drowning in data, trying to figure out what I can use and what is accurate,” says Thi Montalvo, data scientist and vice president of reporting and analytics at Transcarent.

Transparentmember of the data consortium, is a startup that relies on data analytics and machine learning models to personalize healthcare and accelerate payment to healthcare providers.

The benefit of the data standards, according to Ms. Montalvo, comes from greater transparency for everyone in the data supply chain. That workflow often starts with negotiating contracts with insurers for access to claims data and continues with the startup’s data scientists, statisticians and health economists building predictive models to guide patient treatment.

At each stage, knowing more about the data earlier should increase efficiency and eliminate repetitive work, potentially reducing time spent on data projects by 15 to 20 percent, Ms. Montalvo estimates.

The data consortium says the AI market today needs the clarity that the group’s data labeling standards can provide. “This could help solve some of the AI problems everyone is talking about,” said Chris Hazard, co-founder and Chief Technology Officer of How comea start-up that creates data analysis tools and AI software.