Mastering Azure Data Factory Copy Activity Essentials

Azure Data Factory Copy Activity is a powerful tool for moving data between different sources and sinks. It's a crucial step in getting your data in one place.

The Copy Activity supports a wide range of data sources and sinks, including Azure Blob Storage, Azure Data Lake Storage, and SQL Server. You can also use it to copy data from on-premises sources to the cloud.

To get started with the Copy Activity, you'll need to create a pipeline in Azure Data Factory. This involves setting up a source dataset, a sink dataset, and a pipeline with a Copy Activity. The pipeline will handle the data copying process for you.

The Copy Activity has a lot of flexibility when it comes to data transformation. You can use it to perform data mapping, data type conversion, and even data validation. This means you can get your data into the right shape for analysis and use.

A unique perspective: Azure Data Pipelines

Prerequisites

Credit: youtube.com, 5. Copy Activity in azure data factory | Copy data activity in ADF

To use Azure Data Factory's copy activity, you need to meet certain prerequisites.

If your data store is located inside an on-premises network or an Azure virtual network, you'll need to configure a self-hosted integration runtime to connect to it.

You can use the Azure Integration Runtime if your data store is a managed cloud data service.

To access an on-premises network with restricted access, add the Azure Integration Runtime IPs to the allow list in your firewall rules.

Alternatively, you can use the managed virtual network integration runtime feature in Azure Data Factory to access the on-premises network without installing a self-hosted integration runtime.

Discover more: How to Access Azure Active Directory

Azure Data Factory Setup

To set up Azure Data Factory, you'll first need to create a data factory. You can use your existing data factory or create a new one as described in Quickstart: Create a data factory by using the Azure portal.

To start creating a data factory, browse to the Manage tab in your Azure Data Factory or Synapse workspace and select Linked Services, then click New. This will take you through the process of creating a new data factory.

Credit: youtube.com, 5. COPY Data using Azure Data Factory COPY DATA TOOL option | ADF Tutorials for beginners | ADF

Once you have a data factory set up, you can create linked services to connect to your data sources. You can create an Azure Data Lake Storage Gen2 linked service using the Azure portal UI, or a SQL Server linked service. To create a linked service, search for the relevant connector, configure the service details, test the connection, and create the new linked service.

A unique perspective: Azure Linked Service

Create Azure UI

To create linked services in Azure Data Factory or Synapse workspace, you'll want to start by browsing to the Manage tab. From there, select Linked Services and click New.

You can create a linked service for Azure Data Lake Storage Gen2 by searching for it in the Azure portal UI. Select the Azure Data Lake Storage Gen2 connector and configure the service details.

To test the connection, follow the steps outlined in the Azure portal UI. Once you've tested the connection, create the new linked service.

Discover more: Azure Datalake

Credit: youtube.com, 7. Creating Azure Data Factory Using Azure Portal | Azure Data Engineer

Alternatively, you can create a linked service for SQL Server by searching for SQL in the Azure portal UI. Select the SQL Server connector and configure the service details.

To create a linked service for SQL Server, you'll need to select Linked Services and click New in the Manage tab of your Azure Data Factory or Synapse workspace.

Create a Factory

To create a data factory, you can use your existing data factory or create a new one as described in Quickstart: Create a data factory by using the Azure portal.

You can start by going to the Azure portal and following the steps outlined in the Quickstart guide.

Creating a new data factory is a straightforward process that can be completed in a few minutes.

Make sure to follow the instructions carefully to ensure that your data factory is set up correctly.

By following these steps, you'll be able to create a data factory that meets your needs and helps you manage your data effectively.

Principal

Credit: youtube.com, How to Create Service Principal and Use in Azure Data Factory to Access Blob Storage - ADF Tutorial

To set up service principal authentication in Azure Data Factory, you'll first need to register an application with the Microsoft identity platform. This will give you the client ID, which you'll use to define the linked service.

The type property must be set to AzureBlobFS for Azure Blob File System or AzureBlobStorage for Azure Blob Storage, both of which require service principal authentication. You'll also need to specify the Azure cloud environment, such as AzurePublic or AzureChina.

To use service principal authentication, you'll need to grant the service principal proper permission in Azure Blob Storage. You can do this by assigning an Azure role for access to blob and queue data.

Here are the required properties for service principal authentication:

Note that service principal authentication isn't supported in Data Flow if your blob account enables soft delete, or if you access the blob storage through private endpoint using Data Flow.

Consider reading: Azure Data Flows

Recommended Version

When setting up Azure Data Factory, it's essential to choose the right version for your needs.

Credit: youtube.com, Live Demo: How to configure CI-CD on Azure Data factory | Tutorial 20 | LearnITEveryDay

The recommended version is Azure Data Factory V2, which is the latest and most feature-rich version available.

This version provides a more robust and scalable solution compared to the older V1 version.

You can also consider using the Azure Data Factory V1 version if you have existing pipelines that you don't want to migrate.

However, keep in mind that V1 is no longer supported, and you won't receive any new features or security updates.

If you're planning to use Azure Data Factory with Azure Synapse Analytics, V2 is the way to go.

This is because V2 provides better integration with Azure Synapse Analytics, making it easier to manage and monitor your data pipelines.

Check this out: Azure Synapse Data Warehouse

Review Settings and Deployment

To set up your Azure Data Factory pipeline, you need to review all settings and deployment carefully.

First, go to the Settings page and specify a name for your pipeline and its description. Select Next to use other default configurations.

Next, review all settings on the Summary page. This is your last chance to make any changes before deployment.

When you're satisfied, select Next to proceed to the Deployment complete page.

On this page, you can select Monitor to keep an eye on your pipeline after it's been created.

You might like: Azure Data Factory Pipeline Terraform

Configuration

Credit: youtube.com, 23. Copy Data Activity in Azure Data Factory

To configure your Azure Data Factory copy activity, you'll need to define the connector configuration details, such as the SQL Server database connector properties. This includes specifying the schema, table, and table name for the source and sink datasets.

The dataset properties for a SQL Server dataset include the type, schema, table, and table name. The type must be set to SqlServerTable, and the schema and table names are required for the sink dataset but not for the source dataset. You can also use the tableName property for backward compatibility.

To copy data from and to a SQL Server database, you'll need to configure the copy activity properties, which include the source and sink dataset properties. For example, you can use the Binary copy checkbox to copy files as-is, and select the Binary format from the list of supported file formats.

A different take: Azure Activity Logs

Complete Configuration

To complete the configuration, you'll need to create a new connection for your source data. This involves selecting the linked service type, such as Azure Blob Storage, and specifying a name for your connection.

Credit: youtube.com, STEL FIBER ONT COMPLETE CONFIGURATION

You'll also need to select your Azure subscription and storage account from the lists provided. Once you've done this, you can test the connection and create the new linked service.

When creating a SQL Server linked service, you'll need to browse to the Manage tab in your Azure Data Factory or Synapse workspace, select Linked Services, and click New. From there, you can search for SQL and select the SQL Server connector to configure the service details and test the connection.

To define a dataset for your SQL Server database, you'll need to specify the type as SqlServerTable, and provide the schema, table, and tableName properties. The schema and table properties are required for the sink, but not for the source.

Here's a summary of the properties you'll need to specify for a SQL Server dataset:

Custom Logic Writing

Custom Logic Writing is a powerful feature in data configuration that allows you to apply extra processing before inserting source data into the destination table. This is particularly useful when built-in copy mechanisms don't serve the purpose.

Credit: youtube.com, Tesys T custom logic simple project

You can load data to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to apply data. This approach takes advantage of table-valued parameters.

To use a stored procedure, you need to define a table type with the same name as sqlWriterTableType. The schema of the table type is the same as the schema returned by your input data. For example, in the Marketing table, you would create a table type called MarketingType with three columns: ProfileID, State, and Category.

Defining a stored procedure is the next step. It should handle input data from your specified source and merge into the output table. The parameter name of the table type in the stored procedure is the same as tableName defined in the dataset. The stored procedure should also include a parameter for any additional data you want to pass to the procedure.

Here's a step-by-step guide to defining a stored procedure:

1. Define the stored procedure with the same name as sqlWriterStoredProcedureName.

2. The stored procedure should include a parameter for the table type and any additional data.

A different take: Azure Activity Data Connector

Credit: youtube.com, Custom Fields & Logic Step 1 Create File

3. Use a MERGE statement to update or insert data into the destination table.

Here's an example of a stored procedure that performs an upsert into a table in the SQL Server database:

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category varchar(256)

BEGIN

MERGE [dbo].[Marketing] AS target

USING @Marketing AS source

ON (target.ProfileID = source.ProfileID and target.Category = @category)

WHEN MATCHED THEN

UPDATE SET State = source.State

WHEN NOT MATCHED THEN

INSERT (ProfileID, State, Category)

VALUES (source.ProfileID, source.State, source.Category);

END

Once you've defined the stored procedure, you can configure the SQL sink section in the copy activity as follows:

"sink": {

"type": "SqlSink",

"sqlWriterStoredProcedureName": "spOverwriteMarketing",

"storedProcedureTableTypeParameterName": "Marketing",

"sqlWriterTableType": "MarketingType",

"storedProcedureParameters": {

"category": {

"value": "ProductA"

}

Additional reading: Azure Data Studio Connect to Azure Sql

Syntax Details

In the world of data integration, syntax details are crucial to getting your Copy activity up and running smoothly. For a Copy activity, the type property must be set to Copy.

The inputs property requires specifying the dataset that points to the source data, and it's worth noting that the Copy activity supports only a single input. You'll also need to specify the outputs property, which points to the sink data, and again, only a single output is supported.

For more insights, see: Lookup Activity in Azure Data Factory

Credit: youtube.com, 5. Syntax CxLink | CxLink Documents | Configure and test GOS

The typeProperties property is where you configure the Copy activity, and it's a required field. This is where you'll specify the source and sink types, along with their corresponding properties. For more information on these properties, check out the "Copy activity properties" section in the connector article listed in Supported data stores and formats.

Here's a breakdown of the required properties:

While the typeProperties property is required, some other properties have optional settings that can be useful in certain scenarios. For example, you can specify explicit column mappings from source to sink using the translator property, but this is not required.

Data Copy Options

Data Copy Options allow you to control how your files are copied. You can choose to preserve the file hierarchy, merge files, or flatten the hierarchy.

Preserve the file hierarchy by setting recursive to true and copyBehavior to preserveHierarchy. This will create the target folder with the same structure as the source. For example, if you have a source folder with a subfolder, the target folder will also have the same subfolder.

Credit: youtube.com, 16. Copy behaviour in copy activity of ADF pipeline #adf #azuredatafactory #datafactory

Merge files by setting copyBehavior to mergeFiles. This will combine the contents of all files into one file with an autogenerated name. You can also choose to flatten the hierarchy by setting copyBehavior to flattenHierarchy, but this will not pick up subfolders.

Here are some specific data copy options:

Native Change Capture

Native change capture is a powerful feature in Azure Data Factory that allows you to automatically detect and extract changed data from SQL stores.

This feature is available for SQL Server, Azure SQL DB, and Azure SQL MI, making it a versatile tool for various data replication scenarios.

To use native change data capture, you'll need to map the data flow source transformation to enable it, which can be done with a simple click.

The changed data, including row inserts, updates, and deletions, can be automatically detected and extracted by ADF mapping data flow.

With native change data capture, you can easily achieve data replication scenarios from SQL stores by appending a database as a destination store.

Credit: youtube.com, Change Data Capture (CDC) Explained (with examples)

You can also compose any data transform logic in between to achieve incremental ETL scenarios from SQL stores.

To ensure that the pipeline and activity name remain unchanged, you should avoid changing them, as this will reset the checkpoint and cause you to start from the beginning or get changes from the current run.

If you do want to change the pipeline name or activity name, you can use your own Checkpoint key in the dataflow activity to achieve that.

Transformation

In data transformation, you can use various options to manipulate and process your data. You can use a common table expression (CTE) in a stored procedure, but not directly in the query mode of the source transformation.

To use a CTE, you need to create a stored procedure with a query like this: (select 'test' as a) select * from CTE.

You can also use wildcards to read multiple files in a single source transformation in Azure Blob Storage. For example, you can use a wildcard pattern like ** to loop through each matching folder and file.

Credit: youtube.com, How to Copy & Reuse Transformation Steps on Another Table | Power BI / Power Query Tutorial

Wildcard paths can be used to process multiple files within a single flow. You can add multiple wildcard matching patterns with the plus sign. For example, you can use a wildcard like /data/sales/**/*.csv to get all .csv files under /data/sales.

Wildcard examples include:

* Represents any set of characters.
** Represents recursive directory nesting.
? Replaces one character.
[] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales.
/data/sales/20??/**/ Gets all files in the 20th century.
/data/sales/*/*/*.csv Gets .csv files two levels under /data/sales.
/data/sales/2004/*/12/[XY]1?.csv Gets all .csv files in December 2004 starting with X or Y prefixed by a two-digit number.

You can also use the Partition root path setting to define what the top level of the folder structure is. When you view the contents of your data via a data preview, you'll see that the service adds the resolved partitions found in each of your folder levels.

The Partition root path setting is useful when you have partitioned folders in your file source with a key=value format. For example, you can set a wildcard to include all paths that are the partitioned folders plus the leaf files that you want to read.

Credit: youtube.com, How to Copy Data in Enverus Transform

You can choose to do nothing with the source file after the data flow runs, delete the source file, or move the source file. If you're moving source files to another location post-processing, you can specify the "from" and "to" directories. For example, you can move files from /data/sales to /backup/priorSales.

Script Example

When you're working with data copy options, you'll often need to write scripts to handle the process. SQL Server is a common source type in these situations.

The associated data flow script for SQL Server as a source type is shown in the SQL Server source script example.

To use SQL Server as a sink type, you'll need to write a specific script, which is outlined in the SQL Server sink script example.

These scripts are crucial for making data copy options work smoothly, and understanding them can save you a lot of time and effort.

Recursive and Behavioral Examples

Credit: youtube.com, Using Robocopy to Exclude Files and Folders Recursively

When you're copying data, the options for recursive and copy behavior can be a bit confusing. The good news is that the behavior is predictable, and I'm here to break it down for you.

If you set recursive to true, the copy operation will traverse the entire folder structure. For example, if you have a folder with subfolders, the copy operation will include all the files and subfolders.

In the case of preserveHierarchy, the target folder will have the same structure as the source. So, if you have a folder with subfolders, the target folder will also have subfolders.

On the other hand, if you set recursive to false, the copy operation will only copy the top-level files and folders. Any subfolders or files within those subfolders will be left behind.

Here's a summary of the different combinations of recursive and copyBehavior values:

As you can see, the copy behavior is determined by the combination of recursive and copyBehavior values. By understanding these options, you can choose the right approach for your data copy needs.

Preserve

Credit: youtube.com, How to preserve original attributes (modified date, permissions, etc) when copying files...

Preserve metadata during copy operations, and you can choose to preserve the file metadata along with data. This is useful when copying files from Amazon S3/Azure Blob/Azure Data Lake Storage Gen2 to Azure Data Lake Storage Gen2/Azure Blob.

You can also preserve ACLs from Data Lake Storage Gen1/Gen2 when copying files to Gen2. This ensures that the POSIX access control lists are preserved along with the data.

Preserving metadata is crucial in data lake migration scenarios, where you can choose to preserve the metadata and ACLs along with the data using copy activity. This ensures that the data is copied accurately and the metadata is preserved.

Here's a summary of the preserve options:

Note that preserving metadata and ACLs may require additional configuration and setup. It's essential to review the documentation and follow the recommended best practices to ensure accurate data preservation.

Parallel

Parallel data copy is a method that allows you to copy data simultaneously across multiple disks or storage devices. This can significantly speed up the data transfer process, especially for large datasets.

Using parallel data copy can be particularly useful for businesses that rely heavily on data-intensive operations, such as data analytics or scientific research.

Supported File Formats

Credit: youtube.com, 13.ADF Copy files of different file format | Binary

Azure Data Factory supports a wide range of file formats that can be used for data copying. These formats include Avro, Binary, Delimited text, Excel, Iceberg (only for Azure Data Lake Storage Gen2), JSON, ORC, Parquet, and XML.

Avro format is one of the supported formats, and it's great for handling large amounts of data. Binary format is also supported, making it easy to copy files as-is between data stores.

You can also use delimited text format, which is useful for copying data from text files. Excel format is supported as well, making it easy to copy data from Excel spreadsheets.

Iceberg format is a special case, only supported for Azure Data Lake Storage Gen2. JSON format is also supported, making it easy to copy data from JSON files.

ORC format is another supported format, which is great for handling large amounts of data. Parquet format is also supported, making it easy to copy data from Parquet files.

XML format is the final supported format, which is useful for copying data from XML files.

Here is a list of all the supported file formats:

Avro format
Binary format
Delimited text format
Excel format
Iceberg format (only for Azure Data Lake Storage Gen2)
JSON format
ORC format
Parquet format
XML format

File Handling

Credit: youtube.com, 112. How to Copy and Merge File with Azure Data Factory | Azure Data Factory | Copy file with Merge

Azure Data Factory's Copy activity supports a wide range of file formats, including Avro, Binary, Delimited text, Excel, Iceberg, JSON, ORC, Parquet, and XML. You can use the Copy activity to copy files as-is between two file-based data stores, or parse or generate files of a given format.

You can also add metadata tags to file-based sinks, such as Azure Storage, using pipeline parameters, system variables, functions, and variables. For binary file-based sinks, you can even add the Last Modified datetime of the source file using the keyword $$LASTMODIFIED.

The Copy activity also allows you to control the structure of the target folder using the recursive and copyBehavior settings. For example, if you set recursive to true and copyBehavior to preserveHierarchy, the target folder will be created with the same structure as the source. Here's a summary of the possible settings:

File List Examples

You can use a file list path in the Copy activity source to specify a list of files to copy. This path points to a text file in the same data store.

Credit: pexels.com, Computer server in data center room

The file list path includes one file per line, with the relative path to the path configured in the dataset. This allows you to specify multiple files to copy at once.

To use a file list path, you need to create a text file containing the list of files you want to copy. This file should be located in the same data store as the files you want to copy.

You can specify the file list path in the Copy activity source, just like you would specify a folder path. The file list path is used to determine which files to copy.

Here's an example of how the file list path is used in the Copy activity source:

The file list path can be used in conjunction with a dataset configuration to specify the files to copy. The dataset configuration includes the container, folder path, and file list path.

Credit: youtube.com, Python Tutorial: File Objects - Reading and Writing to Files

In the dataset configuration, you need to specify the container, folder path, and file list path. The container is the top-level folder in the data store, and the folder path is the path to the folder containing the files you want to copy.

For example, if you have a dataset configuration with the following settings:

Container: container
Folder path: FolderA
File list path: container/Metadata/FileListToCopy.txt

The file list path would point to a text file in the same data store containing a list of files to copy.

Preserve Metadata

You can choose to preserve file metadata along with data when copying files from Amazon S3/Azure Blob/Azure Data Lake Storage Gen2 to Azure Data Lake Storage Gen2/Azure Blob.

This feature is especially useful for preserving file properties during data migration.

To preserve metadata, you can learn more from Preserve metadata.

You can also preserve POSIX access control lists (ACLs) from Data Lake Storage Gen1/Gen2 when copying files to Gen2.

This allows you to maintain file permissions during the copy process.

A fresh viewpoint: Azure Data Lake Icon

Credit: youtube.com, 6 File Management with Folders and Metadata

Preserving metadata during copy is a key feature when migrating data from one storage type to another.

You can choose to preserve metadata when copying files from Amazon S3, Azure Blob Storage, or Azure Data Lake Storage Gen2 to Azure Data Lake Storage Gen2 or Azure Blob Storage.

This ensures that file properties are maintained during the copy process.

In addition to preserving metadata, you can also choose to preserve ACLs along with data during copy activity.

This is useful in scenarios like data lake migration.

You can add metadata tags to file-based sinks, including Azure Storage-based sinks like Azure Data Lake Storage or Azure Blob Storage.

These metadata tags will appear as part of the file properties as Key-Value pairs.

You can also add metadata involving dynamic content using pipeline parameters, system variables, functions, and variables.

For binary file-based sinks, you have the option to add the Last Modified datetime of the source file using the keyword $$LASTMODIFIED.

If this caught your attention, see: Key Components of Azure Data Factory

Blob and Azure

Credit: youtube.com, 5 Copy Data Activity in Azure Data Factory | Blob To Blob data pipeline

Azure Data Factory's copy activity allows you to copy data from various sources to Azure Blob Storage, including Azure Data Lake Storage Gen2. You can choose from several formats, such as Avro, Binary, Delimited text, Iceberg, JSON, ORC, and Parquet.

To specify the format, you'll need to set the type property under storeSettings to AzureBlobFSWriteSettings. For example, if you're using Parquet format, you'll set the type to AzureBlobFSWriteSettings and then specify the Parquet format.

Azure Blob Storage is a great destination for your data, especially if you're working with large files. You can use the copy activity to copy data from Azure Data Lake Storage Gen1 to Gen2, following best practices outlined in the documentation.

Here are the supported formats for Azure Data Lake Storage Gen2 as a sink type:

Avro format
Binary format
Delimited text format
Iceberg format
JSON format
ORC format
Parquet format

Shared Access Signature

You don't have to share your account access keys with a shared access signature. This URI encompasses all the information necessary for authenticated access to a storage resource.

Credit: youtube.com, How to Generate Shared Access Signatures with Azure Storage Accounts

The service now supports both service shared access signatures and account shared access signatures. This means you have more flexibility in granting access to your storage resources.

To use shared access signature authentication, you need to specify the type property as AzureBlobFS (suggested) or AzureBlobStorage (suggested) or AzureStorage. This property is required.

You also need to specify the sasUri property, which is the shared access signature URI to the Storage resources such as blob or container. This property is also required.

The sasUri property should be marked as SecureString to store it securely. You can also put the SAS token in Azure Key Vault to use auto-rotation and remove the token portion.

Here are the supported properties for using shared access signature authentication:

A shared access signature URI to a blob allows the data factory or Synapse pipeline to access that particular blob. A shared access signature URI to a Blob storage container allows the data factory or Synapse pipeline to iterate through blobs in that container.

Worth a look: Azure Data Factory vs Synapse

Blob as a

Credit: youtube.com, What is the Azure Blob Storage? | How to Use the Azure Blob Storage

Blob as a sink type is a powerful feature in Azure. You can use it to store data in Azure Blob Storage.

To configure Blob Storage as a sink, you need to set the type property under storeSettings to AzureBlobStorageWriteSettings. This is a required property.

The copy behavior when copying files from a file-based data store can be defined using the copyBehavior property. The allowed values are PreserveHierarchy, FlattenHierarchy, and MergeFiles.

PreserveHierarchy preserves the file hierarchy in the target folder, while FlattenHierarchy places all files in the first level of the target folder. MergeFiles merges all files from the source folder into one file.

You can specify the block size, in megabytes, used to write data to block blobs using the blockSizeInMB property. The allowed value is between 4 MB and 100 MB.

The default block size is 100 MB, but you can explicitly specify a block size to optimize performance. Make sure that blockSizeInMB*50000 is large enough to store the data.

Worth a look: What Is Azure Storage

Credit: youtube.com, A Beginners Guide to Azure Blob Storage

The maxConcurrentConnections property allows you to specify the upper limit of concurrent connections established to the data store during the activity run. This property is not required.

You can also set custom metadata when copying to Blob Storage using the metadata property. Each object under the metadata array represents an extra column. The name defines the metadata key name, and the value indicates the data value of that key.

Here's a summary of the supported properties for Azure Blob Storage under storeSettings settings:

Azure Gen2

Azure Gen2 is a powerful storage solution that offers a range of benefits over its predecessor, Gen1. You can create an Azure Data Lake Storage Gen2 linked service using the Azure portal UI by browsing to the Manage tab, selecting Linked Services, and clicking New.

To configure the service details, you'll need to search for Azure Data Lake Storage Gen2 and select the Azure Data Lake Storage Gen2 connector. This will allow you to test the connection and create the new linked service.

Related reading: Azure Data Lake Architecture

Credit: youtube.com, Azure Data Lake Gen 2 VS. Azure Blob Storage Explained

Azure Data Lake Storage Gen2 supports a variety of formats for data sinks, including Avro, Binary, Delimited text, Iceberg, JSON, ORC, and Parquet. These formats provide flexibility in how you store and process your data.

The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-based copy sink:

The copyBehavior property allows you to preserve the file hierarchy, flatten the hierarchy, or merge files from the source folder. The blockSizeInMB property can be used to specify the block size in MB used to write data to ADLS Gen2.

Performance and Monitoring

The Copy activity monitoring experience shows you the copy performance statistics for each of your activity run. You can use this information to identify areas for improvement and optimize the performance of the Copy activity.

The Copy activity performance and scalability guide lists the performance values observed during testing, which can help you set realistic expectations for your data movement tasks. For example, if you're copying large files, you might expect slower performance compared to smaller files.

To monitor the Copy activity, you can use both visual and programmatic methods, as described in the Monitor copy activity section. This allows you to track the status of your pipeline and activity runs in real-time.

Performance and Tuning

Credit: youtube.com, Monitoring & optimizing performance in Microsoft Fabric | DP-600 EXAM PREP (8 of 12)

The Copy activity monitoring experience is a valuable tool for understanding how your data movement is performing. It shows you the copy performance statistics for each of your activity run.

If you're looking to improve the performance of your Copy activity, I recommend checking out the Copy activity performance and scalability guide. It lists key factors that affect the performance of data movement via the Copy activity.

For binary file copy scenarios, you might be surprised to know that copy activity rerun starts from the beginning. This means you won't be able to pick up where you left off if something goes wrong.

The Copy activity performance and scalability guide also provides valuable insights into performance values observed during testing, which can help you optimize your Copy activity. By following these guidelines, you can significantly improve the performance of your data movement.

Monitoring

Monitoring is a crucial step in ensuring the performance of your pipeline. You can monitor the Copy activity run in the Azure Data Factory and Synapse pipelines both visually and programmatically.

Credit: youtube.com, Datadog Application Performance Monitoring (APM)

To monitor visually, you can switch to the Monitor tab, where you'll see the status of the pipeline. Select Refresh to refresh the list, and click the link under Pipeline name to view activity run details or rerun the pipeline.

On the Activity runs page, you can select the Details link (eyeglasses icon) under the Activity name column for more details about the copy operation.

Resume from Last Run

Resume from last run is a feature that allows you to pick up where you left off after a failed copy activity run. This is especially useful when dealing with large files or complex data migrations.

You can leverage resume from last run in two ways: activity level retry and rerun from failed activity. With activity level retry, you can set a retry count on your copy activity, and if it fails, the next automatic retry will start from the last trial's failure point.

Curious to learn more? Check out: Azure Data Engineer Roles and Responsibilities Resume

Credit: pexels.com, Detailed view of fiber optic cables connected to a patch panel in a data center.

Rerun from failed activity is another option, which allows you to trigger a rerun from the failed activity in the ADF UI monitoring view or programmatically. If the failed activity is a copy activity, the pipeline will not only rerun from this activity, but also resume from the previous run's failure point.

Resume happens at the file level, so if your copy activity fails when copying a file, it will be re-copied in the next run. To ensure resume works properly, make sure not to change the copy activity settings between reruns.

Here are some key details to keep in mind:

Resume is supported for the following file-based connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure Files, File System, FTP, Google Cloud Storage, HDFS, Oracle Cloud Storage, and SFTP.
Resume can resume from an arbitrary number of copied files for Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, and Google Cloud Storage, but only a limited number of files for other file-based connectors.
Self-hosted integration runtime version 5.43.8935.2 or above is required for resuming from last failed run via self-hosted integration runtime.

Additional Settings

When creating an Azure Data Factory pipeline, it's essential to review all settings and deployment. You can do this by specifying a name for the pipeline and its description on the Settings page.

To ensure you're on the right track, review all settings on the Summary page. This is where you can catch any errors or discrepancies before moving forward.

Credit: youtube.com, 13. Add additional columns during copy in Azure Data Factory

Once you've reviewed all settings, you can select Next to proceed to the Deployment complete page. Here, you can select Monitor to track the pipeline you've created.

To recap, the steps for reviewing all settings and deployment are:

Specify a name for the pipeline and its description on the Settings page.
Review all settings on the Summary page.
Select Next to proceed to the Deployment complete page.
Select Monitor to track the pipeline you've created.

Get Started

To get started with the Azure Data Factory Copy activity, you can use one of the following tools or SDKs: The Copy Data toolThe Azure portalThe .NET SDKThe Python SDKAzure PowerShellThe REST APIThe Azure Resource Manager template This is a great starting point, as it allows you to choose the tool that best fits your needs.

The Copy Data tool is a user-friendly option that can be accessed directly from the Azure Data Factory home page. Simply select the Ingest tile to start the Copy Data tool.

To begin with the Copy Data tool, you'll need to select the task type, which can be either a built-in copy task or a custom task. The built-in copy task is a good starting point, as it provides a simple and straightforward way to copy data.

Troubleshooting and Logs

Credit: youtube.com, 23. Session log in copy activity of ADF pipeline

You can log your copied file names to ensure data consistency between source and destination stores by reviewing the copy activity session logs.

This helps you verify that the data was successfully copied from source to destination store.

Session logs can be particularly useful for tracking the progress of your copy activities and identifying any issues that may have arisen during the process.

Session Log

Logging your session activity can be a lifesaver when troubleshooting issues. Reviewing the session log can help ensure that data is consistent between the source and destination store.

You can log your copied file names to verify successful copying. This is especially helpful for checking consistency between the two stores.

The session log can also help you identify any discrepancies or errors that occurred during the copy activity.

Helpful AI assistant

As a helpful AI assistant, I've learned that troubleshooting and logs can be overwhelming, especially when dealing with complex data sources like Azure Blob Storage. The key is to understand the properties that control how the Copy activity reads data from the source.

Credit: youtube.com, How to Use AI to Find and Fix Problems in Code

The type property of the Copy activity source must be set to BlobSource, which is a crucial setting to get right. This ensures that the activity knows how to interact with the data in the blob.

Being aware of the recursive property can also help you avoid issues. If set to true, the activity will read data recursively from subfolders, but be aware that this might not create empty folders or subfolders at the sink if it's a file-based store.

If you're dealing with a large dataset, you might need to limit the number of concurrent connections established to the data store during the activity run. This can be done by setting the maxConcurrentConnections property, but it's not required by default.

Here's a quick rundown of the properties you should know:

Frequently Asked Questions

What is copy data activity in ADF?

The Copy data activity in ADF is a data transfer tool that enables you to move data between on-premises and cloud-based data stores. It's a key feature of Azure Data Factory and Synapse pipelines, making data integration easier and more efficient.

What is copy behavior in Azure Data Factory?

In Azure Data Factory, copy behavior determines how file hierarchies are preserved when copying files from a file-based data store. By default, PreserveHierarchy preserves the original file hierarchy in the target folder.

Which activity in Azure Data Factory can be used to copy data from Azure Blob storage to Azure SQL?

To copy data from Azure Blob storage to Azure SQL, use the CopyActivity in Azure Data Factory. For more details on CopyActivity settings, see Copy activity in Azure Data Factory.

What is the difference between copy activity and data flow in ADF?

In ADF, a copy activity configures source and sink settings within a pipeline, whereas a data flow configures all settings in a separate interface, with the pipeline serving as a wrapper. This difference affects how you design and manage data integration workflows.

Sources

Katrina Sanford

Writer

View Katrina's Profile

Katrina Sanford is a seasoned writer with a knack for crafting compelling content on a wide range of topics. Her expertise spans the realm of important issues, where she delves into thought-provoking subjects that resonate with readers. Her ability to distill complex concepts into engaging narratives has earned her a reputation as a versatile and reliable writer.