Architecting modern data platforms: A deep dive on Azure

In this episode of the Azure on Air podcast, Erwin Stall, an experienced Azure architect from Xpirit, shares insights into modernizing a monolithic data platform on Azure. The conversation dives into key aspects such as data storage and management, centralized data storage, team flexibility, choosing the right data platform, Microsoft Azure Fabric’s limitations and potential, and automating data pipelines and infrastructure for multiple teams.

Modernizing a Monolithic Data Platform on Azure

The speaker recounts a project initiated about 18 months ago, where his team was brought in to assess and revamp a monolithic data platform. The challenge lay in the difficulties like changing a massive, integrated platform, hindering collaboration, automation, and security.

Data Storage and Management for a New Platform

The assessment revealed several inherent issues with the legacy platform’s centralized, “all eggs in one basket” approach. As a result, Erwin’s team recommended starting fresh to build a new federated platform aligning with modern cloud best practices.

They decided to retain centralized raw data storage in an Azure Data Lake for governance and accessibility while providing individual teams with dedicated landing zones. This allowed for a more agile, modular approach, ensuring each team had autonomy while adhering to governance rules.

Centralized Data Storage and Team Flexibility in Azure

By keeping centralized data access but decentralizing downstream pipelines, the new platform aimed to balance governance with autonomy. Teams can choose services meeting their needs, like Databricks or Azure Functions, while the core data corpus remains managed centrally. Each spoke bears responsibility for its own resource consumption and billing.

This federated separation of concerns methodology avoids contention between groups, often plaguing monolithic platforms. It also permits flexibility, enabling advanced teams to utilize cutting-edge techniques while less specialized groups can use simpler self-service tools.

Choosing the Right Data Platform

Rather than taking a one-size-fits-all approach, Erwin’s architecture allows tailoring environments to a team’s use case and strengths. Whether it’s Data Factory for simple ingestions, Databricks for complex transformations, or Azure Functions for code-centric approaches, the goal is to align the platform with the team’s unique requirements.

Microsoft Azure Fabric’s Limitations and Potential

Microsoft’s recent Azure Fabric launch seems well-aligned with many of the platform design decisions Erwin’s team ultimately arrived at independently. Azure Fabric offers similar centralized data storage with support for spinning up isolated analytics environments connected to this data pool.

However, as a relatively new service, Fabric currently needs to catch up in areas like cross-environment data access, CI/CD pipelines for infrastructure-as-code, and support for diverse data ingestion strategies. As a result, while showing long-term promise, Fabric isn’t yet a turnkey replacement for existing custom architectures like the one Erwin and colleagues created. Depending on teams’ needs and priorities, migration may become more appealing as the service matures.

Automating Data Pipelines and Infrastructure for Multiple Teams

A major focus of the new platform is automation and self-service capabilities aimed at easing data analysts’ and engineers’ lives while enforcing security and governance guardrails. Teams can leverage templatized GitHub workflows to instantly spin up preconfigured cloud resources upon request. They can also contribute and modify centralized data pipeline definitions matching their use-case requirements.

Built-in automation ensures new environments meet organizational standards and best practices without imposing lengthy human approval delays. Audit logging provides oversight while limiting friction to productivity.

Using Azure Platform for Data Pipelines and Automation

Microsoft Azure provides ideal foundations for building sophisticated yet responsive analytics architectures like the one described. Modularizing workflows while still centralizing policy definition allows organizations to scale teams horizontally while maintaining coherence and oversight. Automation further eases cloud resource experimentation, making the trial-and-error process of perfecting environments fast and low-risk.

He also underscores the importance of financial transparency in the model. With each team having its own subscription, it becomes straightforward to track costs, providing clarity on resource utilization. While incurring a slightly higher cost, the model is justified by the benefits of enhanced flexibility, security, and efficiency.

“If you want to view costs across multiple subscriptions, check out Serverless360’s Cost Analyzer feature.”

To conclude, Erwin Stall’s experience in modernizing data platforms on Azure offers valuable insights for organizations aiming to strike a balance between centralized control and team autonomy. The podcast provides a blueprint for architecting modern data platforms, leveraging Azure’s capabilities to create agile, secure, and efficient solutions.

Need a better Azure Management platform?

Turbo360 helps to streamline Azure monitoring, distributed tracing, documentation and optimize cost.