Developing an Azure Data Factory pipeline with DevOps is a game-changer for data integration and management. This approach ensures that your pipeline is reliable, scalable, and continuously improved.
By integrating Azure Data Factory with DevOps practices, you can automate testing, deployment, and monitoring of your pipeline, reducing the risk of errors and downtime. This is achieved through tools like Azure DevOps, which provides a comprehensive platform for managing and automating your pipeline's lifecycle.
With Azure Data Factory, you can create and manage complex data pipelines using a user-friendly interface, without requiring extensive coding knowledge. This makes it an ideal choice for data professionals and developers alike.
Pipeline Creation
To create a pipeline, you'll need to start by creating a data factory. This involves opening Microsoft Edge or Google Chrome, selecting Create a resource > Integration > Data Factory, and filling out the Create Data Factory page with the required information, such as Azure Subscription, Resource Group, Region, and Name.
You can create a pipeline by following these steps: Create the linked service.Create input and output datasets.Create a pipeline. The pipeline creation process can be broken down into these three main steps, which will guide you through the process of creating a pipeline with a copy activity.
To get started with pipeline creation, select Orchestrate on the home page, then specify CopyPipeline for Name in the General panel under Properties. Next, drag and drop the Copy Data activity from the tool box to the pipeline designer surface, and specify CopyFromBlobToSql for Name.
Sample Copy
When creating a copy pipeline, you'll want to focus on the activities section. In this section, there is only one activity whose type is set to Copy.
For the copy activity, you'll need to specify the input and output datasets. These datasets are defined in JSON and can be referenced in the activities section. Specifically, the input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, you'll need to specify the source and sink types. For the sample pipeline, the source type is set to BlobSource and the sink type is set to SqlSink.
Here are the key points to note when creating a copy pipeline:
- There is only one activity of type Copy in the activities section.
- The input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
- The source type is set to BlobSource and the sink type is set to SqlSink.
Create a
To create a data factory, you need to open Microsoft Edge or Google Chrome, as the Data Factory UI is only supported in these two web browsers.
You'll then select Create a resource > Integration > Data Factory from the left menu. On the Create Data Factory page, choose the Azure Subscription where you want to create the data factory.
For the Resource Group, you can either select an existing one from the drop-down list or create a new one by selecting Create new and entering the name of the new resource group.
The data factory's location will also need to be specified, and only locations that are supported will be displayed in the drop-down list.
The name of the Azure data factory must be globally unique, so if you receive an error message, enter a different name for the data factory.
You'll also need to select the Version as V2 and configure Git later by selecting the Configure Git later check box.
Once you've completed these steps, select Review + create, and then Create after the validation is passed.
After the creation is finished, you'll see a notice in the Notifications center, and you can select Go to resource to navigate to the Data factory page.
To create a pipeline, you'll need to follow a few steps: create the linked service, create input and output datasets, and create a pipeline.
However, in this tutorial, you'll start with creating the pipeline, then create linked services and datasets when you need them to configure the pipeline.
To create a pipeline, you'll need to specify the Name as CopyPipeline in the General panel under Properties.
You'll then need to drag and drop the Copy Data activity from the tool box to the pipeline designer surface and specify the Name as CopyFromBlobToSql.
Pipeline Components
A pipeline in Azure Data Factory (ADF) is made up of various components that work together to transform and move data. At its core, a pipeline can have multiple activities, which are the building blocks of a pipeline.
These activities can be chained together using activity dependency, allowing subsequent activities to run in parallel or sequentially depending on the condition. You can also use branching to control the flow of activities and make decisions based on data.
The activities in a pipeline can be executed on various compute services, such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. Additionally, you can use data flows to transform and enrich data, creating a reusable library of data transformation routines that can be executed in a scaled-out manner.
Here are the pipeline components mentioned in the article sections:
In terms of pipeline structure, a pipeline is defined in JSON format, which includes components such as activities, parameters, concurrency, and annotations. The activities section is particularly important, as it can have one or more activities defined within it, each with its own JSON element.
Mapping Flows
Mapping Flows are a powerful way to transform data in Azure Data Factory (ADF). You can create reusable graphs of data transformation logic that can be executed on a Spark cluster.
These graphs, known as Data Flows, enable data engineers to build and maintain complex data transformation processes without needing to understand Spark clusters or programming.
Data Flows can be used to transform any-sized data, and you can build up a library of reusable data transformation routines. You can execute these processes in a scaled-out manner from your ADF pipelines.
Data Factory will automatically manage the Spark cluster for you, spinning it up and down as needed. This means you don't have to worry about managing or maintaining clusters.
Here are some key benefits of using Mapping Flows in ADF:
- Data Flows enable you to build and maintain complex data transformation processes without needing to understand Spark clusters or programming.
- Data Flows can be used to transform any-sized data.
- Data Factory will automatically manage the Spark cluster for you.
Multiple Activities
You can have multiple activities in a pipeline, which can run in parallel if they're not dependent on each other. This is a great way to automate complex workflows.
A pipeline can have up to 100 activities, and each activity can have its own dependencies and conditions that determine when it runs. You can use the "dependsOn" property to define activity dependencies, which is a crucial aspect of building robust pipelines.
Here's a table that summarizes the types of activities you can have in a pipeline:
Control activities are a crucial part of building pipelines, as they allow you to control the flow of your pipeline and make decisions based on conditions. Some common control activities include the "If Condition Activity" and the "For Each Activity".
Pipeline Components
Pipeline Components are the building blocks of a data pipeline. A pipeline is a series of activities that move and transform data from one place to another. To create a pipeline, you need to create a linked service, input and output datasets, and a pipeline.
A linked service is a connection to a data store, such as Azure Blob Storage or SQL Database. You can create a linked service by selecting + New in the Linked Services tab and following the prompts. For example, to create an Azure Blob Storage linked service, you need to enter the name, select the storage account, and test the connection.
Input and output datasets are used to define the structure of the data. You can create an input dataset by selecting + New in the Datasets tab and selecting the type of dataset you want to create. For example, to create an Azure Blob Storage dataset, you need to select Azure Blob Storage and then select the file path.
A pipeline is a series of activities that move and transform data from one place to another. You can create a pipeline by selecting + New in the Pipelines tab and following the prompts. For example, to create a pipeline that copies data from Azure Blob Storage to SQL Database, you need to create a linked service, input and output datasets, and then add a Copy Data activity to the pipeline.
Here are the different types of pipeline components:
- Linked Service: A connection to a data store.
- Input Dataset: Defines the structure of the input data.
- Output Dataset: Defines the structure of the output data.
- Copy Data Activity: Copies data from one place to another.
Linked Services
Linked services are a crucial part of Azure Data Factory, defining the connection information needed to connect to external resources. They're essentially connection strings that allow Data Factory to interact with data sources and compute resources.
A linked service represents a data store or compute resource, and can include things like a SQL Server database, Oracle database, file share, or Azure blob storage account. Some examples of linked services include Azure Storage-linked services, which specify a connection string to connect to the Azure Storage account, and Azure blob datasets, which specify the blob container and folder containing the data.
Linked services are used for two main purposes in Data Factory: to represent a data store, and to represent a compute resource that can host the execution of an activity. This means you can use linked services to connect to various data sources and execute activities on compute resources like HDInsight Hadoop clusters.
To update a linked service, you can go to Manage > Linked services, and then update the relevant linked service. For example, to update the Azure key vault linked service, you would go to Manage > Linked services, and then update the Azure key vault to connect to your subscription.
Here are some examples of linked services and how to update them:
Pipeline Control
Pipeline control is a crucial aspect of Azure Data Factory pipeline. Control flow activities are used to manage the flow of activities in a pipeline.
Control activities have a top-level structure that includes name, description, type, and typeProperties. The name and description are required, and the type is also required, which can be data movement, data transformation, or control activities.
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, and looping. Activities can be chained in a sequence using the dependsOn property, and branching can be achieved using the If Condition activity.
Here is a list of control flow activities supported by Azure Data Factory pipeline:
Control Activity
Control activities are the backbone of a pipeline, allowing you to manage the flow of data and activities. They enable you to make decisions, loop through data, and even wait for specific conditions to be met.
You can use control activities to add a value to an existing array variable with an Append Variable activity. This is especially useful when working with datasets that require incremental updates.
Control activities also enable you to define a repeating control flow with a For Each activity. This is similar to a Foreach looping structure in programming languages.
The If Condition activity allows you to branch based on a condition that evaluates to true or false. It provides the same functionality as an if statement in programming languages.
Here are some of the control activities supported:
Control activities also have a top-level structure that includes tags for name, description, type, typeProperties, and dependsOn. The name and description tags are required, while type and typeProperties are not.
Trigger Manually
To manually trigger a pipeline, select Trigger on the toolbar and then select Trigger Now. This will initiate the pipeline run.
You'll see a pipeline run that is triggered by a manual trigger on the Monitor tab. This is where you can view activity details and rerun the pipeline.
To view activity runs associated with the pipeline run, select the CopyPipeline link under the PIPELINE NAME column. This will show you a list of associated activity runs.
For details about the copy operation, select the Details link (eyeglasses icon) under the ACTIVITY NAME column. You can also select All pipeline runs at the top to go back to the Pipeline Runs view.
To refresh the view, select Refresh. This will update the pipeline runs view with the latest information.
Verify that the pipeline run is successful by checking that two more rows are added to the emp table in the database. This confirms that the pipeline has executed correctly.
Triggers
Triggers are the unit of processing that determines when a pipeline execution needs to be kicked off. They can be triggered by different types of events.
There are different types of triggers, including schedule triggers that run a pipeline on a specified schedule. For example, you can create a schedule trigger to run a pipeline every minute until a specified end datetime.
To set up a schedule trigger, you go to the Author tab, select New/Edit, and choose a trigger type. You then enter a name, update the start date, select a time zone, and set the recurrence to every minute. You also specify an end date and select the activated option.
A cost is associated with each pipeline run, so it's essential to set the end date appropriately. Once you've set up the trigger, you review the warning and select Save. Finally, you click Publish all to publish the change.
Here are the steps to create a schedule trigger:
- Go to the Author tab.
- Select New/Edit and choose a trigger type.
- Enter a name, update the start date, select a time zone, and set the recurrence to every minute.
- Specify an end date and select the activated option.
- Review the warning and select Save.
- Click Publish all to publish the change.
You can also trigger a pipeline manually by selecting Trigger on the toolbar and then selecting Trigger Now. This will kick off a pipeline run that you can view in the Monitor tab.
Ci/Cd
CI/CD is a crucial aspect of pipeline control, allowing you to incrementally develop and deliver your ETL processes before publishing the finished product. This enables you to refine raw data into a business-ready consumable form and load it into various analytics engines.
You can use Azure DevOps and GitHub to support CI/CD of your data pipelines, which streamlines the development and delivery process. Data Factory offers full support for CI/CD, making it easier to manage and deploy your data pipelines.
To create a CI/CD pipeline in Azure DevOps, you'll need to create a pipeline and select the adf_publish branch as the source. This branch contains the ARMTemplateForFactory and ARMTemplateParametersForFactory.json files. You can then add a task to publish build artifacts, specifying the DEV Data Factory folder as the path to publish.
Here are the steps to create a CI/CD pipeline in Azure DevOps:
- Create a pipeline and select the adf_publish branch as the source
- Add a task to publish build artifacts, specifying the DEV Data Factory folder as the path to publish
- Enable continuous integration and set the branch specification to adf_publish
By following these steps, you can automate the deployment of your data pipelines and ensure that your ETL processes are always up-to-date and running smoothly.
Frequently Asked Questions
What is a pipeline in Azure Data Factory?
A pipeline in Azure Data Factory is a logical grouping of activities that work together to perform a specific task, such as data ingestion and processing. It's a sequence of steps that helps you manage and automate complex data workflows in Azure.
What is the difference between Azure Data Factory and Azure pipelines?
Azure Data Factory is for transforming data, while Azure Pipelines are for orchestrating and managing workflows. Understanding the difference between these two Azure services can help you streamline your data processing and workflow management.
Does Azure Data Factory do ETL?
Yes, Azure Data Factory supports ETL (Extract, Transform, Load) processes, allowing you to prepare and move data across your digital transformation initiatives. With its code-free interface, you can easily construct and orchestrate ETL pipelines.
What is the execute pipeline in Azure Data Factory?
The Execute Pipeline activity in Azure Data Factory passes parameters as strings to child pipelines due to the way payload is transferred between pipelines. Understanding this behavior is key to optimizing your pipeline's performance and data flow.
Sources
- https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
- https://iterationinsights.com/article/azure-data-factory-ci-cd-with-devops-pipelines/
- https://learn.microsoft.com/en-us/azure/data-factory/introduction
- https://learn.microsoft.com/en-us/azure/devops/pipelines/apps/cd/azure/build-data-pipeline
- https://learn.microsoft.com/en-us/azure/data-factory/tutorial-copy-data-portal
Featured Images: pexels.com