Data Quality: The Invisible Villain of Machine Learning
What are the main tasks of a modern machine learning engineer?
This seems like an easy question with a simple answer:
Build machine learning models and analyze data.
In reality, this answer is often not true.
Efficient use of data is essential in a successful modern business. However, transforming data into tangible business results requires it to undergo a journey. It must be acquired, shared securely and analyzed in its own development cycle.
The explosion of cloud computing in the mid-to-late 2000s and the adoption of machine learning by enterprises a decade later effectively addressed the beginning and end of this journey. Unfortunately, companies often encounter obstacles in the middle stages related to data quality, which is typically not on the radar of most executives.
Solution Advisor at Ataccama.
How Poor Data Quality Impacts Business
Poor quality, unusable data is a burden for those at the end of the data journey. These are the data consumers who use it to build models and contribute to other profitable activities.
Too often, data scientists are the people hired to “build machine learning models and analyze data,” but bad data prevents them from doing so. Organizations put so much effort and attention into getting access to this data, but no one thinks to verify that the data going “into” the model is usable. If the input data is bad, the output models and analysis will be too.
It’s estimated that data scientists spend between 60 and 80 percent of their time cleaning data so that their project results are reliable. This cleaning process can involve guessing at the meaning of data, inferring gaps, and can inadvertently remove potentially valuable data from their models. The result is frustrating and inefficient, because this dirty data prevents data scientists from doing the valuable part of their job: solving business problems.
These huge, often invisible costs slow down projects and reduce their results.
The problem is compounded when data cleanup tasks are performed in repetitive silos. Just because one person noticed and cleaned up a problem in one project, doesn’t mean they solved the problem for all their colleagues and their respective projects.
Even if a data engineering team can perform a large-scale cleanup, it can’t be done all at once. Furthermore, they may not fully understand the context of the task and why they are performing it.
The Impact of Data Quality on Machine Learning
Clean data is especially important for machine learning projects. Whether it’s classification or regression, supervised or unsupervised learning, deep neural networks, or when an ML model goes into new production, builders need to continually evaluate based on new data.
A critical part of the machine learning lifecycle is managing data drift to ensure the model remains effective and continues to deliver business value. After all, data is an ever-changing landscape. Source systems can merge after an acquisition, new governance can come into play, or the commercial landscape can change.
This means that previous assumptions about the data may no longer hold. While tools like Databricks/MLFlow, AWS Sagemaker or Azure ML Studio effectively cover model promotion, testing and retraining, they are less well equipped to investigate what part of the data has changed, why it has changed and then fix the problems, which can be tedious and time-consuming.
Being data-driven prevents these issues from arising in machine learning projects, but it’s not just about the technical teams building pipelines and models; it requires the entire business to be aligned. Examples of how this might arise in practice include when data needs a business workflow with someone to approve it, or when a front-office, non-technical stakeholder contributes knowledge early in the data journey.
The obstacle in building ML models
Incorporating business users as consumers of their organization’s data is increasingly possible with AI. Natural language processing enables non-technical users to query data and extract contextual insights.
The expected growth rate of AI between 2023 and 2030 is 37 percent. 72 percent of executives see AI as the most important business benefit and 20 percent of the EBIT of AI-mature companies will be generated by AI in the future.
Data quality is the backbone of AI. It improves the performance of algorithms and enables them to produce reliable predictions, recommendations, and classifications. For the 33 percent of companies that report AI projects have failed, the reason is poor data quality. In fact, organizations that focus on data quality can achieve greater AI effectiveness everywhere.
But data quality isn’t just a box to check off. Organizations that make it an integral part of their operations can achieve tangible business results by generating more machine learning models per year and delivering more reliable, predictable business outcomes by delivering model confidence.
How to Overcome Data Quality Barriers
Data quality shouldn’t be a matter of waiting for a problem to happen in production and then rushing to fix it. Data should be tested constantly, wherever it is, against an ever-expanding pool of known issues. All stakeholders should contribute, and all data should have clear, well-defined data owners. So when a data scientist is asked what they do, they can finally say: build machine learning models and analyze data.
We list the best business cloud storage for you.
This article was produced as part of TechRadarPro’s Expert Insights channel, where we showcase the best and brightest minds in the technology sector today. The views expressed here are those of the author and do not necessarily represent those of TechRadarPro or Future plc. If you’re interested in contributing, you can read more here: