Closing the door to the library of lost data

August 13, 2024

0 86 5 minutes read

Closing the door to the library of lost data

Most companies are undergoing a transformation into a ‘data company that… [insert their business] better than anyone else”. Modern businesses are not only data, digital and cloud native, they are finding ways to differentiate themselves through their data and monetize it as an additional revenue stream. Furthermore, the only way to keep pace with the rapid advances in AI and machine learning (ML) is to make strategic investments in stabilizing the underlying data infrastructure. But what happens when the immense amount of data stored today is not managed properly?

Imagine trying to find a specific book in a library, not knowing the location, title, or even the author. Oh, and there’s no resource or person to ask, so you go around asking someone else in the library for help, hoping they’ll point you in the right direction or just hand you a book. Similarly, unmanaged data buries itself in a dark corner of a “library,” but in most cases it doesn’t resemble the book it once was, and the author is unknown. This often happens through data silos, redundant or duplicate platform services, conflicting data stores and definitions, and more, all of which adds unnecessary cost and complexity.

While the ideal scenario would be to ensure that all data assets are discoverable in the first place, there are ways to untangle the mess once it has accumulated. But this is something that every enterprise struggles with. Individual teams often have their own access to infrastructure services, and not all data events—including sharing, copying, exporting, and enriching—on those platforms are monitored at the enterprise level. As a result, the challenge persists, expands, and the library of data continues to grow without consistent governance or control.

Nik Acheson

Chief Data Officer (CDO) of Dremio.

The cost of lost data

The consequences of untraceable data can be far-reaching. It can have a major impact on an organization’s operations and strategic goals, hinder decision-making, jeopardize operational efficiency, and increase vulnerability to compliance and data breaches. For decision-making, insights that are essential for informed choices are often described as unreliable or inaccessible.

This lack of visibility and trust leads to delays in identifying and acting on trends, customer needs, and responding quickly to market changes, ultimately hampering long-term competitiveness and agility. When data is scattered across uncontrolled silos or duplicated across cloud services without centralized oversight, it’s like having books sitting in different corners of a library without a central cataloging system.

Additionally, the inability to locate and secure sensitive data increases the likelihood of unauthorized access or inadvertent exposure, further increasing risks related to privacy breaches and intellectual property theft. Ask any engineer or analyst and they’ll likely point to the challenge of managing data that can be exported to spreadsheets. Solving the download problem should be harder than knowing what data is on the platform in the first place: at least then you can see that a download has occurred and who can help with any post-hoc audits.

Righting the wrong

For organizations that need to correct course, one of the most scalable solutions is to ensure “compliance as code.” Simply put, this is ensuring that every data event—from delivering services to enriching data within them—is recorded, monitored, and traceable. Most importantly, these events are visible to any stakeholder responsible for data protection or oversight.

By ensuring these events are sent to a common metadata catalog, such as by being pub-sub’d and enriching an enterprise catalog, enterprises can more effectively monitor and control their data. Any non-compliant resource should theoretically be immediately removed by the enterprise, reducing the chance of data loss or untraceability. So anyone who spins up an object store, compute services, etc. all has a record for auditability, events available for lineage and traceability, and ideally a path to the lineage of data.

Once lost, tools like BigID act as an advanced library catalog, providing a bottom-up view of the ecosystem and helping organizations understand what data lives where and what systems are using it. Tools that provide governance and compliance for data glossary and workflow management, and adopt patterns like Iceberg format, not only lower switching costs today and tomorrow, but also make it easier to integrate the many functional catalogs and platforms across the enterprise. The goal here is to create value quickly while simplifying data management in the future.

Companies need to gain visibility into their data landscape, identify potential compliance issues, and take corrective action before data becomes unmanageable, let alone build a system to better scale. This will always be the responsibility of a central team, or at best shared with functional leaders once it’s fully democratized. To be clear, not all of these tools are required to get started. Instead, understanding the nature of your current state (or starting point) will determine how you can quickly prioritize the use cases around which modernization will be prioritized. You need to balance quick wins and change with big fundamental changes that allow transformations to progress more quickly in the medium term to maintain momentum and continually build trust.

An effective parallel strategy is to also build microservices or bots that perform continuous scanning, auditing, and compliance assurance. These microservices can perform a range of functions, from basic compliance checks to full anomaly detection around asset usage compared to normal service delivery, roles, and asset usage. By continuously monitoring data events and usage patterns, these microservices can detect anomalies and potential compliance violations in real time, allowing for rapid corrective action. As mentioned above, all data sources and events should be automatically logged upon provisioning, so that any data that is not cataloged can be immediately removed by the bot as non-compliant.

The next chapter

Like a well-organized library where every book is cataloged and easily accessible, a well-managed data environment enables businesses to thrive. Avoiding data chaos requires a proactive and strategic approach to data management that doesn’t also create more friction or processes for users. By implementing compliance as code, leveraging data visibility tools, and building microservices for continuous compliance, businesses can ensure their data assets remain discoverable, secure, and valuable. With these strategies in place, businesses can navigate the complexities of data management and drive continued growth and innovation.

Finally, it is vital to foster a culture of data stewardship within the organization. Educating employees on the importance of data management and establishing clear data handling protocols can significantly reduce the risk to the company. Regular training sessions and updates on best practices ensure that all team members are aligned with the company’s data governance goals.

We provide an overview of the best cloud log management services.

This article was produced as part of TechRadarPro’s Expert Insights channel, where we showcase the best and brightest minds in the technology sector today. The views expressed here are those of the author and do not necessarily represent those of TechRadarPro or Future plc. If you’re interested in contributing, you can read more here:

August 13, 2024

0 86 5 minutes read