When it comes to architecture strategies there's no one-size-fits-all solution. Every team will have different use cases that will drive requirements such as security, environment isolation, and tools needed.
While working with one of my clients we started to dive into what a future architecture in Microsoft Fabric would look like for them. As we discussed their requirements I realized that a few potential issues were waiting for us in the fog.
Before we dive in, I'd like to recognize the awesome work of my friend, the data goblin himself, Kurt Buhler. Kurt helped create the Power BI usage scenario diagrams that can be found here:
A ton of work went into the design of the usage diagrams and I love the stylistic approach the team took. As such, I tried to keep the general format the same while extending some of the patterns to broader architecture.
If you haven't already I highly recommend checking out Kurt's blog:
Understanding Medallion Patterns
Everyone should have heard of medallion architecture by now. For those that haven't, medallion patterns have been around for years and have had many names (raw, validated, enriched, semi-curated, curated, etc.). While the naming pattern used is less important, the idea behind using zones typically stems from the need to have separation of data by state of readiness and other security or governance requirements. This is of course an oversimplification for conceptual purposes.
Below is an example medallion pattern in Fabric.
Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in bronze lakehouse as close to raw form as possible.
Create a shortcut from bronze to silver lakehouse to begin the data enrichment and cleansing process.
Create a shortcut from silver to gold to persist in a dimensional model for enterprise analytics.
We could follow a very similar pattern implemented with Azure data services.
Read/copy data from the source system using Spark via notebook or pipeline.
Use managed private endpoints or integration runtime gateway when working with on-premise or firewall-enabled source systems.
Store data in a data lake gen2 bronze container as close to raw form as possible.
Write from bronze container to silver delta lake container to begin data enrichment and cleansing process.
Write from silver delta lake container to gold delta lake container to persist dimensional model for enterprise analytics.
Combining Lifecycle Management and Medallion Patterns
In addition to medallion patterns, we must also be aware of application lifecycle management (ALM) practices (dev., test, prod.).
An example of a simple ALM strategy for Azure data services is to use prod./non-prod. subscriptions with dev./test/prod. resource groups.
Additional steps for ALM include:
Integrate infrastructure as code (IaC) with source control repo.
Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.
Deploy to test using the release pipeline.
Deploy to prod. using the release pipeline.
Fabric Application Lifecycle Management (ALM)
In Fabric, we have deployment pipelines to enable code promotion to higher environments. The overall flow is quite similar with a few differences in how things are released.
Additional steps for Fabric ALM:
Sync dev. workspace with git.
Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.
The release pipeline triggers the Fabric deployment pipeline to deploy to test.
Release pipeline triggers Fabric deployment pipeline to deploy to prod.
The wonderful thing with Fabric deployment pipelines is having the ability to sync a workspace directly to your git repo and having built-in pipelines to move artifacts giving us integrated low-code IaC and CI/CD.
https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/intro-to-deployment-pipelines
Governance and Security Consideration
You may have picked up on this through the diagrams already, but a core concept for the remainder of the article is that artifact management in Fabric is centralized around a workspace.
In the documentation for implementing medallion lakehouse architecture in Fabric, there's one section that sparked my curiosity.
https://learn.microsoft.com/en-us/fabric/onelake/onelake-medallion-lakehouse-architecture
Conceptually, the above diagram makes sense per the rule of zone isolation. That said, the diagram is illustrated to show separation at the lakehouse level with all lakehouses residing in the same workspace. We'll come back to this in a few minutes, but this is an important piece of the puzzle.
If you continue reading you'll find the following statement:
The statement above contradicts the diagram and suggests that each medallion zone should be broken into its own workspace. Said differently, a bronze lakehouse should be located in a bronze workspace, and so on. The reason behind the recommendation can be found in the lakehouse access control documentation.
https://learn.microsoft.com/en-us/fabric/data-engineering/workspace-roles-lakehouse
Circling back to the idea of all lakehouse zones living in the same workspace, we can see that from a security and governance perspective, this presents our first, rather large issue.
To perform any activity other than "read" over a lakehouse one must have an admin, member, or contributor role in the workspace in which the lakehouse resides. Granting A/M/C over the workspace is granting applicable permissions to all artifacts in the workspace. In other words, users have visibility to all lakehouse zones and the data within them.
Let's consider a scenario in which we have data with personally identifiable information (PII) such as a social security number in the bronze layer. As the data moves through silver we apply masking rules to the PII data. We want to enable our advanced business analysts and developers to perform tasks other than "read" on the data but we do not want them to see the underlying raw data. We would not be able to facilitate this requirement if the lakehouse zones were contained in a single workspace.
Workspace Sprawl
The next consideration is workspace sprawl. If we adhere to the recommendation of isolating zones by workspace, the number of workspaces needed exponentially increases. For example, rather than having a single workspace with three separate lakehouses, we will now have three workspaces with one lakehouse each.
I'm sure you're mind is already drifting in this direction, but what about handling lifecycle management?
Let's consider a scenario to justify following ALM for each medallion zone. To surface data in the bronze zone you will likely be copying data from a source system using a pipeline or notebook. Once the data is available in bronze you will create a shortcut to silver.
Thinking about the lifecycle:
Net new request for data is received.
Begin development of the pipeline or notebook to copy the data (dev.).
Test the copy process to ensure it meets requirements (test).
The copy process is considered stable and is placed on a schedule to ensure current data is available to Silver (prod.).
As you can see, by splitting the medallion zones by workspace we've tripled the number of needed workspaces in our pattern. With the increase in workspace count, the question becomes how do we manage deployments?
Artifact Deployment Considerations
Fabric deployment pipelines have been refactored quite a bit from their Power BI days with one of the most significant changes being an increase in the number of "stages" supported from three to now supporting ten.
Theoretically, if we followed the recommendations above, one deployment pipeline would support our nine workspaces. However, if we wanted to include additional separation of artifacts or add another zone to our medallion pattern we would exceed the allowed number of workspaces for our pipeline.
Another consideration with deployment pipelines is that they're linear, meaning an artifact must be deployed through all stages sequentially.
In addition to the workspace quantity limit, we're also limited in that a workspace can only belong to one pipeline.
Since a workspace can belong to only one deployment pipeline, the idea of chaining multiple pipelines together in the Fabric UI becomes void. Instead, you would need to integrate with DevOps release pipelines to programmatically trigger a deployment pipeline release.
Enterprise Architecture Strategies
As I stated in the opening, every team is going to have its own set of requirements that will drive source code management and deployment strategies. Below are a few examples of potential enterprise patterns.
Single Workspace Medallion Pattern:
If security and governance at the lakehouse level aren't a concern, perhaps it doesn't make sense to split your medallion layers by workspace. In such a scenario, a potential architecture could look something like the following:
Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in the lakehouse as close to raw form as possible.
- Copy data to the warehouse using pipeline (4b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
- Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).
Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.
Content creators create reports and dashboards for consumption.
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working code version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.
A workspace app is created and serves as the primary entry point for end-user consumption.
Isolated Workspace Medallion Pattern:
However, for most teams, I believe the security and governance isolation conversation isn't something that will easily be ignored and therefore zone isolation will be required. The overall change in architecture is quite significant as the entry points for workloads shift.
Bronze key considerations:
Data will be read from source systems and written to the bronze layer.
The readiness of the data doesn't yet enable enterprise report development.
Note: workspaces from each medallion layer are now managed by individual deployment pipelines
Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in the lakehouse as close to raw form as possible.
- Copy data to the warehouse using pipeline (4b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.)
- Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working code version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.
Silver key considerations:
Data will be read from the bronze layer and written to the silver layer.
The readiness of the data doesn't yet enable enterprise report development.
Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read data from bronze using a shortcut.
- Read data from bronze using pipeline (2b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
- Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.
Gold key considerations:
Data will be read from the silver layer and written to the gold layer.
The readiness of the data now enables enterprise report development.
End-user testing of reports will be needed.
Workspace applications will be used for report consumption.
Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read data from Silver using a shortcut.
- Read data from Silver using a pipeline (2b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
- Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).
Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.
Content creators create reports and dashboards for consumption.
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before being published.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.
A workspace app is created and serves as the primary entry point for end-user consumption.
By isolating the medallion zones into separate deployment pipelines you now have more control over the environment but you're also introducing overhead by needing to build and manage additional DevOps artifacts.
Final Thoughts
Your decision of which architecture to deploy will be based on several things, one of which is the balance between security/governance and the overhead to maintain your system. Like most things when working with data, there's no one-size-fits-all solution.
If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.
You can find the diagrams in SVG format on github: