Blog

The Data Mesh as AI-Enabler

For any AI-driven company, it’s critically important to have a foundation for extracting value from historic analytical data at scale. Over the last decade, data warehousing and big data technologies have evolved at a tremendous pace. This allowed many organizations to successfully adopt new paradigms, such as data lakes, to store ever-increasing amounts of data and to train analytical models.

However, enterprise analytics ecosystems are very often centered around a single team that is responsible for everything data related — from providing the data infrastructure to integrating data from various sources, as well as data governance. As enterprise data grows in volume and complexity, this setup does not scale and often becomes a critical bottleneck, hindering the rapid development of AI skills.

What is the Data Mesh?

A data mesh is a distributed architectural paradigm that addresses the shortcomings of the monolithic patterns typically applied for analytics and data provisioning within an organization. It enables every team to produce and consume high-quality, read-optimized versions of operational data, continuously and independently. It builds around four core principles, that aim to (a) drastically lower the cost and specialization needed to produce data products, and (b) ensure a healthy ecosystem and seamless adoption of analytics in every part of an organization.

Domain Ownership

According to the domain-driven development paradigm, data ownership is assigned to the business domains. This means the responsibility is distributed to the people who are closest to the systems producing the operational data — and who can understand it better. Each domain is responsible for making available to the rest of the organization both the operational data they produce and the analytical data products they build. In that way, the production and consumption of data can scale as a function of the domains and service delivery teams.

Analytical Data as a Product

Providing analytical data as a product within an organization suggests that the data needs to be:

discoverable (users can find it)
addressable (users know how to interact with it)
self-describing (users understand what it is about)
secure and trustworthy (users trust the data quality)
interoperable (users can combine various data products)

Responsible for these are the domains’ data product owners. This principle ensures that data is no longer treated as a second-class citizen, but is regarded as a high-value asset, that users can understand and use.

Self-serve Platform

For people in the various business domains to be able to autonomously build and maintain their data products, it is essential to have access to a high-level abstraction of infrastructure that removes the complexity and friction of managing the lifecycle of data products. This translates to giving access to operational data, and also provides the tooling and the capabilities needed to be autonomous.

For example, running a Spark job on historical data might be a difficult task for inexperienced teams. A self-serve platform aims to enable anyone in the organization to implement their own analytics use cases, train their models (or use pre-trained models), with minimum special knowledge.

Federated Computational Governance

To enable domains to derive value by correlating independent data products, you must have a governance model that embraces decentralization and interoperability through global standardization. A federation — consisting of data product owners, business representatives, security, compliance officers, and others — is responsible for defining standards that all domains and data products must follow. These standards are computationally baked into the platform, and automatically applied.

A Key Foundation for AI-driven Companies

As we steadily march into a new era where “every company will be an AI-driven company,” the need for continuously collecting and historizing every operational data source is more important than ever. High-quality and trustable data is a necessity whether you are training in-house expert ML models or fine-tuning pre-trained foundational models, like Large Language Models (LLMs).

A data mesh architecture provides the necessary solid foundation upon which AI models can be trained and operationalized fast and reliably. Adopting this framework, together with a mature MLOps paradigm, is a key choice for unlocking the potential of AI and a differentiating factor for organizations that aim for rapid AI skill development.

Watch for our follow-up blog post, where we’ll describe how we built the ION Data Mesh, our Azure-native data mesh, and discuss design choices and implementation details, as well as practical challenges we faced.

Sharing

Article By

Iris Safaka
Principal Data Scientist

Iris Safaka is Principal Data Scientist for Ontinue and has more than 10 years of experience working in cybersecurity, machine learning and analytics. She has published scientific papers for top security venues, actively serves as member of Technical Program Committees, participates on expert panels on AI, data and security, and has been invited to give talks at USENIX, IEEE and ACM. Iris earned a PhD in computer and communication sciences from the Swiss Federal Institute of Technology (EPFL).

Keywords