Azure Data Factory (ADF) is a cloud-based data integration service that helps you create, schedule, and manage your data pipelines. It's a crucial component of Azure's data analytics platform.
Azure Data Factory is built on a scalable, managed, and secure architecture that supports a wide range of data sources and destinations. This architecture is designed to handle large volumes of data and complex data transformations.
The core components of ADF include Data Flows, Pipelines, and Datasets. Data Flows are used to create and manage data transformations, Pipelines are used to orchestrate and schedule data pipelines, and Datasets are used to store and manage metadata about your data.
Azure Data Factory's architecture is designed to be highly scalable and flexible, allowing you to easily add or remove nodes as needed to handle changing data volumes and workloads.
Take a look at this: Azure Data Factory Scheduling
Get Started
To begin building an Azure Data Factory (ADF) architecture, you'll need to create a new ADF instance in the Azure portal.
Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and manage your data pipelines.
Start by navigating to the Azure portal and searching for "Azure Data Factory" in the search bar.
This will take you to the ADF page where you can create a new instance.
You can choose from a variety of pricing tiers, including the free tier, which is suitable for small-scale data integration projects.
The free tier includes 400 free data movement units (DMUs) per month, which can be used to move up to 5 GB of data.
Next, you'll need to create a new data factory, which involves specifying a name and location for your data factory.
Consider reading: Is Azure Data Studio Free
Data Storage and Architecture
Azure Data Lake Storage Gen2 is a scalable and secure cloud storage service built for big data analytics, providing features such as fine-grained access control, tiered storage, and high availability.
It combines the capabilities of Azure Blob Storage with a hierarchical file system and multi-protocol access, making it an ideal choice for storing large volumes of structured and unstructured data.
Check this out: What Is the Data Storage in Azure Called
Data is organized into separate directories within the storage account, likely for better organization and management, and each DataFrame is written to a single CSV file.
Azure Data Lake Storage Gen2 integrates seamlessly with other Azure services, enabling organizations to perform complex analytics, reporting, and machine learning tasks on their data.
Here are some key considerations for data storage and architecture:
- Handle errors and exceptions during the write operation.
- Consider partitioning the data into multiple files for better performance.
- Ensure proper access controls and permissions are set up for the Azure Data Lake Storage Gen2 account.
Azure Synapse Analytics is a cloud-based analytics service that brings together big data and data warehousing capabilities, allowing users to analyze large volumes of data with high speed and concurrency.
Factory Architecture
Azure Data Factory is a cloud-based data integration service that enables users to create, schedule, and orchestrate data workflows at scale. It supports a variety of data sources and destinations, allowing for seamless data movement across on-premises and cloud environments.
Azure Data Factory is used to create and schedule data-driven workflows, and it's particularly useful for ingesting data from multiple data stores. The service is a collection of tools and services that enable the process of collecting data using cloud computing.
Azure Data Factory is a managed service that automates the ELT pipeline, incrementally moving the latest OLTP data from the SQL database server to the Azure Synapse. The pipeline contains a set of activities that ingest and clean log data, and then kick off the flow of mapping data to analyze the log data.
The pipeline is made up of three types of activities: data transformation, data movement, and control activities. Data transformation activities process and transform data, while data movement activities move data from one place to another. Control activities manage the flow of data through the pipeline.
Here's a breakdown of the different types of activities in an Azure Data Factory pipeline:
The pipeline can be scheduled and deployed independently, making it a convenient and efficient way to manage data workflows.
Storing
Storing data is a crucial part of the data pipeline, and it's essential to handle it correctly to ensure data integrity and availability.
To write transformed data to Azure Data Lake Storage Gen2, you can use the code from the example, which writes DataFrames to CSV files in a single directory. This approach ensures each DataFrame is written to a separate file and treats the first row as headers.
It's good practice to handle errors and exceptions during the write operation, as mentioned in the best practices section. This can help prevent data loss and ensure the pipeline continues running smoothly.
Consider partitioning large datasets into multiple files for better performance, especially if you're dealing with massive amounts of data. This can significantly improve write times and reduce the load on your storage account.
Proper access controls and permissions are also essential to restrict unauthorized access to your Azure Data Lake Storage Gen2 account. This will help prevent data breaches and ensure only authorized users can access your data.
Here are some best practices to keep in mind when storing data:
- Handle errors and exceptions during the write operation.
- Partition large datasets into multiple files for better performance.
- Set up proper access controls and permissions for the storage account.
Network Design
Network Design plays a crucial role in securing your data and architecture. You should use a next-generation firewall like Azure Firewall to secure network connectivity between your on-premises infrastructure and your Azure virtual network.
Deploying a self-hosted integration runtime (SHIR) on a virtual machine (VM) in your on-premises environment or in Azure is a great way to securely connect to on-premises data sources and perform data integration tasks in Data Factory. Consider deploying the VM in Azure as part of the shared support resource landing zone to simplify governance and security.
Machine learning-assisted data labeling doesn't support default storage accounts because they're secured behind a virtual network. You'll need to create a storage account for machine learning-assisted data labeling, then apply the labeling and secure it behind the virtual network.
Private endpoints provide a private IP address from your virtual network to an Azure service, making the service accessible only from your virtual network or connected networks. This ensures a more secure and private connection. Private endpoints use Azure Private Link, which secures the connection to the platform as a service (PaaS) solution.
If your workload uses any resources that don't support private endpoints, you might be able to use service endpoints. However, we recommend using private endpoints for mission-critical workloads whenever possible.
Here are some key considerations for Network Design:
- Use a next-generation firewall like Azure Firewall to secure network connectivity.
- Deploy a self-hosted integration runtime (SHIR) on a virtual machine (VM) in your on-premises environment or in Azure.
- Create a storage account for machine learning-assisted data labeling and secure it behind the virtual network.
- Use private endpoints to secure connections to Azure services.
Azure Data Factory Pipeline
An Azure Data Factory pipeline is a logical group of activities used to coordinate a task. It's a managed service that automates data movement and data transformation.
The pipeline contains a set of activities that perform tasks such as ingesting and cleaning log data, and then kick off the flow of mapping data to analyze the log data. This is done using the copy activity to copy data from the SQL server to Azure BLOB storage, and the data flow activity or Databricks Notebook activity to process and transform data.
A pipeline can be scheduled and deployed, making it more efficient to load changes that were previously run. It's a key component of the Azure Data Factory architecture.
Take a look at this: Azure Activity Data Connector
Here are some key activities in an Azure Data Factory pipeline:
- Copy activity: copies data from the SQL server to Azure BLOB storage
- Data flow activity: processes and transforms data from blob storage to Azure Synapse
- Databricks Notebook activity: processes and transforms data from blob storage to Azure Synapse
The pipeline manages activities in a set, allowing for more efficient data processing and transformation.
Azure Cloud and Infrastructure
Azure Data Factory is a cloud orchestration engine that collects data from a variety of sources. It's a game-changer for companies looking to migrate their databases to the cloud.
By using Azure Data Factory, you can achieve tremendous cost savings, flexibility, scalability, and performance gains. This is especially true when migrating SQL server databases to the cloud.
The data factory pipeline plays a crucial role in this process. It calls a stored procedure to execute an SSIS job that's hosted on-premises. This ensures a seamless transition of data from the on-premises environment to the cloud.
Blob storage is used to store files and data from the data factory. This is a great way to keep your data organized and easily accessible.
Here are some examples of how Azure Data Factory can be used:
- Load the logs of the network router for database analysis.
- Prepare employment data for analytical reporting.
- Load sales and product data into sales forecasting and data warehouse.
- Automate data warehouse and operational data stores for accounting and finance.
- Automate the pipeline process.
Azure Data Factory is a powerful tool that can help you streamline your data migration process. With its hybrid ETL approach, you can easily integrate your on-premises and cloud-based systems.
Reliability and Operational Excellence
Reliability is a top priority when designing an Azure Data Factory architecture. This architecture meets the uplifted requirements with default Azure SLAs across the solution, eliminating the need for high availability or multi-regional uplift.
The disaster recovery strategy has been uplifted to cover the full scope of platform services and updated target metrics, but this strategy must be tested regularly to ensure it remains fit for purpose.
To achieve zone-redundancy, the architecture uses features in solution components to protect against localized service problems. The following services or features use this type of redundancy:
Before selecting a region, it's essential to confirm that all required resources and redundancy requirements are supported, as not all Azure services are available in all regions.
Reliability
Reliability is a top priority when it comes to delivering a great customer experience. Meeting commitments is crucial, and reliability ensures that your application can do just that.
Using the default Azure SLAs across the solution can eliminate the need for high availability or multi-regional uplift, making it a convenient strategy. This approach also meets the uplifted requirements, giving you peace of mind.
Uplifting the disaster recovery strategy to cover the full scope of platform services and updated target metrics is a great idea. However, this strategy must be tested regularly to ensure it remains fit for purpose.
Zone-redundancy features in solution components can protect against localized service problems. Here's a breakdown of the resiliency types for the services or features in this architecture:
Before selecting a region, it's essential to confirm that all required resources and redundancy requirements are supported.
Operational Excellence
Operational excellence is a crucial aspect of ensuring that your application runs smoothly in production. It involves evolving the operating model to account for the new domain model, stakeholders, governance structures, persona-based training, and RACI.
To achieve operational excellence, it's essential to extend the tagging strategy to account for the domain model. This means creating a more comprehensive and accurate system for tracking and organizing your application's components.
Developing a central nonfunctional requirements register is also a key step. This register should include a standard of software development best practices that any platform solution can reference in any developer area.
By integrating a robust testing framework into the continuous integration and continuous deployment practice, you can ensure that your application meets the required standards and is delivered with minimal errors.
Sources
- https://blog.stackademic.com/building-an-end-to-end-etl-pipeline-with-azure-data-factory-azure-databricks-and-azure-synapse-0dc9dde0a5fb
- https://www.educba.com/azure-data-factory-architecture/
- https://learn.microsoft.com/en-us/azure/architecture/databases/architecture/azure-data-factory-mission-critical
- https://learn.microsoft.com/en-us/azure/architecture/databases/architecture/azure-data-factory-on-azure-landing-zones-index
- https://learn.microsoft.com/en-us/azure/architecture/databases/architecture/azure-data-factory-enterprise-hardened
Featured Images: pexels.com