As you start working with Azure Data Factory Data Flow, you'll want to know the basics. Data Flow is a visual, drag-and-drop interface for building data pipelines that can handle complex data transformations.
To get started, you'll need to create a new Data Flow in the Azure portal. This will give you a blank canvas to begin building your pipeline.
The key to building a successful Data Flow is to understand the different types of data transformations you can perform. These include data mapping, filtering, and aggregation.
Data mapping is a crucial step in any data pipeline, and Data Flow makes it easy to create custom mappings using a visual interface.
Azure Data Factory Data Flow Basics
Azure Data Factory's Data Flow component is a powerful tool for building ETL pipelines. It offers several transformations that can be used to transform data.
Data Preview allows users to preview their data as it flows through the pipeline, making it easier to identify issues with the data transformation logic. This feature is incredibly useful for debugging and making changes before executing the pipeline.
Debugging in Data Flow enables users to step through the pipeline and inspect the data at each stage. This helps identify and fix issues before executing the pipeline.
Expressions in Data Flow allow users to create dynamic data transformation logic. This means users can apply different transformations to their data as it flows through the pipeline.
Data Flow transformations are key to building rich ETL pipelines. The more transformations supported, the more efficient the ETL tool is.
Creating and Configuring
To create parameters in a data flow, you'll need to click on the blank portion of the data flow canvas to see the general properties, and then select the Parameter tab in the settings pane.
You can add a new parameter by selecting the New option, and for each parameter, you must assign a name, select a type, and optionally set a default value.
To create a pipeline with a Data Flow activity, follow these steps:
- On the home page of Azure Data Factory, select Orchestrate.
- In the General tab for the pipeline, enter TransformMovies for Name of the pipeline.
- In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow activity from the pane to the pipeline canvas.
- In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow TransformMovies. Click Finish when done.
- In the top bar of the pipeline canvas, slide the Data Flow debug slider on.
Select
The SELECT transform is a powerful tool that helps you curate the fields in your data stream. It allows you to remove unnecessary fields and duplicates by renaming fields, changing mappings, and removing undesired fields.
Data processing can get messy, but with the SELECT transform, you can clean up your data stream and focus on what matters. This functionality is particularly useful when joining, merging, splitting, and creating calculated fields.
By using the SELECT transform, you can ensure that your data stream is tidy and efficient, making it easier to work with and analyze. The SELECT transform is a key component in data processing that helps you get the most out of your data.
Create a Factory
To create a data factory, you'll need to open Microsoft Edge or Google Chrome, as the Data Factory UI is only supported in these web browsers.
You'll then select Create a resource > Integration > Data Factory from the left menu.
The name of the Azure data factory must be globally unique, so enter a name like ADFTutorialDataFactory, but be aware that if you receive an error message, you'll need to enter a different name.
Select the Azure subscription where you want to create the data factory.
For the Resource Group, you can either select an existing resource group or create a new one - just enter the name of the new group.
Only select V2 as the version for the data factory.
Choose a location for the data factory from the drop-down list, but be aware that data stores and computes can be in other regions.
To create the data factory, simply select the Create button.
Once the creation is finished, navigate to the Data factory page by selecting Go to resource in the Notifications center.
Finally, select Author & Monitor to launch the Data Factory UI in a separate tab.
Working with Parameters
To add parameters to your data flow, click on the blank portion of the data flow canvas to see the general properties, and then select the Parameter tab to generate a new parameter. You must assign a name, select a type, and optionally set a default value for each parameter.
Parameters can be referenced in any data flow expression and begin with a dollar sign ($). They are immutable, meaning their values cannot be changed once set. You'll find the list of available parameters inside the Expression Builder under the Parameters tab.
You can quickly add additional parameters by selecting New parameter and specifying the name and type. This is useful when working with complex data flows that require multiple input parameters.
For the inline source type, the linked service parameters are exposed in the data flow activity settings within the pipeline. For the dataset source type, the linked service parameters are exposed directly in the dataset configuration.
Once you've created a data flow with parameters, you can execute it from a pipeline with the Execute Data Flow Activity. After adding the activity to your pipeline canvas, you'll be presented with the available data flow parameters in the activity's Parameters tab.
Pipeline expression parameters allow you to reference system variables, functions, pipeline parameters, and variables similar to other pipeline activities. When you click Pipeline expression, a side-nav will open allowing you to enter an expression using the expression builder.
Data flow expression parameters can reference functions, other parameters, and any defined schema column throughout your data flow. This expression will be evaluated as is when referenced, so be careful not to pass in an invalid expression or reference a schema column that doesn't exist in that transformation.
A common pattern is to pass in a column name as a parameter value. If the column is defined in the data flow schema, you can reference it directly as a string expression. If the column isn't defined in the schema, use the byName() function to reference it.
If you want to map a string column based upon a parameter column name, you can add a derived column transformation equal to toString(byName($columnName)). Remember to cast the column to its appropriate type with a casting function such as toString().
When assigning a pipeline expression parameter of type string, by default quotes will be added and the value will be evaluated as a literal. To read the parameter value as a data flow expression, check the expression box next to the parameter.
Here are some key differences between string literals and expressions:
- String literals: quotes are added and the value is evaluated as a literal.
- Expressions: the value is evaluated as a data flow expression.
In the pipeline expression language, system variables such as pipeline().TriggerTime and functions like utcNow() return timestamps as strings in format 'yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ'. To convert these into data flow parameters of type timestamp, use string interpolation to include the desired timestamp in a toTimestamp() function.
Transformation Logic
Transformation Logic is a crucial aspect of Azure Data Factory Data Flow. You can design and build data transformation logic using a drag-and-drop interface with Mapping Data Flow.
Mapping Data Flow is a visual data transformation tool that allows users to create complex data transformations at scale. It is a fully managed, serverless data integration service that supports well-defined and static data transformations.
Azure Data Factory provides a range of transformation components, including aggregations, joins, and filters, to manipulate data. These components can be used to extract, transform, and load data respectively.
- Aggregations: Used to perform operations such as sum, average, and count on data.
- Joins: Used to combine data from multiple sources based on a common column.
- Filters: Used to limit the scope of data and process it conditionally.
What Is Mapping?
Mapping is a visual data transformation tool that allows users to design, build, and execute data transformation logic using a drag-and-drop interface. It's a fully managed, serverless data integration service that enables users to perform complex data transformations at scale.
Mapping Data Flow is based on a set of data transformation components that can be connected together to create a data flow pipeline. These components include sources, transformations, and sinks.
Sources are the starting point of a data flow pipeline in Mapping Data Flow, and they can include various types of data stores such as Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage.
Transformations are used to modify the data as it flows through the pipeline, and Azure Data Factory provides a range of transformation components such as aggregations, joins, and filters that can be used to manipulate data.
Sinks are used to load data into destinations such as Azure SQL Database, Azure Blob Storage, or Azure Data Lake Storage.
Here are the different types of components in a Mapping Data Flow pipeline:
- Source: Azure SQL Database, Azure Blob Storage, Azure Data Lake Storage
- Transformation: Aggregations, joins, filters
- Sink: Azure SQL Database, Azure Blob Storage, Azure Data Lake Storage
Build Transformation Logic
Building transformation logic is a crucial step in data processing, and Azure Data Factory provides a range of tools to make it easier.
You can use the Derived Column transformation to create new calculated fields or update existing fields in a data stream. This is especially useful when you need to perform complex calculations or data manipulations.
Data preview and debugging are essential features in Azure Data Factory that allow you to test and refine your transformation logic before executing the pipeline.
To create dynamic data transformation logic, you can use expressions in Mapping Data Flow. These expressions can be used to manipulate data as it flows through the pipeline.
Here are some common expression functions you can use:
- toString() to convert a value to a string
- byName() to reference a column by its name
- cast() to convert a value to a specific data type
The Aggregate transform is another powerful tool in Azure Data Factory that allows you to perform complex aggregations and calculations on your data.
You can use the Filter transformation to limit the scope of your data and process it conditionally. This is a common step in data processing pipelines.
The Lookup transform is used to validate if data already exists in a lookup stream, and it can be used to perform UPSERT-type actions.
The Exists transform is similar to the SQL EXISTS clause and can be used to compare data from one stream with data in another stream using one or multiple conditions.
When assigning a pipeline expression parameter of type string, make sure to check the expression box next to the parameter to read the parameter value as a data flow expression.
Remember that string interpolation is not supported in data flow expressions, so you need to concatenate the expression into string values using the '+' operator.
Split
The Split transform in Azure Data Factory is a powerful tool for dividing data into two streams based on a specific criterion. This allows for discrete types of data processing on categorized data.
You can split data based on the first matching criteria or all the matching criteria as desired, giving you flexibility in how you process your data.
Union
The Union transformation is a powerful tool in Azure Data Factory. It's equivalent to the UNION clause in SQL, allowing you to merge data from two data streams with identical or compatible schema into a single data stream.
To merge data, the schema from two streams can be mapped by name or ordinal position of the columns. This flexibility makes it easy to combine data from different sources.
The Union transformation is especially useful when you need to combine data from two or more data streams. This can be a game-changer for data analysts and scientists who work with large datasets.
By mapping the schema from two streams, you can create a single data stream that's easier to work with. This can save you a lot of time and effort in the long run.
Surrogate Key
In data warehousing scenarios, surrogate keys are used to provide a unique identifier for records.
Surrogate keys are particularly useful in slowly changing dimensions (SCD) where the business key can repeat as different versions of the same record are created.
Azure Data Factory provides a transform to generate these surrogate keys using the Surrogate Key transform.
This transform is designed to help you create unique identifiers for your records, making it easier to manage and analyze your data.
In essence, surrogate keys act as a substitute for the business key, ensuring that each record has a unique identifier.
Sort
Sorting data is a crucial step in data transformation, especially when working with time-series datasets that require a specific order.
Sorting can be used to process data correctly, as mentioned in the article.
Sorting data before loading it into a destination data repository is a good practice, allowing for efficient data management.
In Azure Data Factory, the sort transform is a useful tool for achieving this, enabling you to build a data flow that transforms data to the desired shape and size.
Frequently Asked Questions
What is the difference between pipeline and data flow in Azure Data Factory?
In Azure Data Factory, Pipelines manage workflows, while Data Flows transform data within those workflows. Learn more about how to effectively use both in your data processing tasks.
What is the difference between control flow and data flow in Azure Data Factory?
Control Flow in Azure Data Factory determines the path of execution for a pipeline, while Data Flow is used for data transformations, such as joining or splitting data. In essence, Control Flow steers the pipeline, while Data Flow shapes the data.
What is Azure Dataflow?
Azure Data Flow is a graphical data transformation tool that uses Apache Spark to process and transform data. It provides a drag-and-drop interface for designing and executing data transformation logic.
Sources
- https://learn.microsoft.com/en-us/azure/data-factory/parameters-data-flow
- https://azuretrainings.in/data-flow-in-azure-data-factory/
- https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
- https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions
- https://www.sqlshack.com/data-flow-transformations-in-azure-data-factory/
- https://www.sqlservercentral.com/articles/understanding-mapping-dataflow-in-azure-data-factory
Featured Images: pexels.com