Azure Synapse Studio Notebook is a powerful tool for data management, allowing you to create, execute, and manage notebooks in a centralized environment.
You can create notebooks from scratch or use templates, which can save you time and effort.
With Azure Synapse Studio Notebook, you can write code in various languages, including Python, SQL, and R, and execute it on the fly.
This feature is particularly useful for data scientists and analysts who need to perform complex data analysis and modeling tasks.
To get started with Azure Synapse Studio Notebook, you'll need to create a workspace and a notebook, which can be done in just a few clicks.
You can also connect to various data sources, including Azure Synapse Analytics, Azure SQL Database, and Azure Blob Storage.
This flexibility makes Azure Synapse Studio Notebook an ideal choice for data management tasks that require access to multiple data sources.
Getting Started
To get started with Azure Synapse Studio Notebook, you'll need to create a new notebook or open an existing one.
The notebook is where you'll write and run your code, so it's essential to understand how to navigate and use it effectively.
You can create a new notebook by clicking on the "New Notebook" button in the top left corner of the Azure Synapse Studio interface.
To open an existing notebook, simply click on the notebook file you want to work with in the file explorer.
To start writing code, you'll need to select a language from the dropdown menu at the top of the notebook.
Data Management
Data Management is crucial in Azure Synapse Studio Notebook, and it's great that you're thinking about it. You can load data from various sources like Azure Data Lake Storage Gen 2, Azure Blob Storage, and SQL pools.
To manage temporary data, the Azure Synapse connector creates a subdirectory with a unique format, making it easier to clean up. This subdirectory is created under the user-supplied tempDir location, which you can set up to be deleted periodically.
You can also drop the whole container and create a new one with the same name, but this requires using a dedicated container for temporary data and finding a time window when no queries are running.
Bring Data
Data can be loaded from various sources, including Azure Data Lake Storage Gen 2 and Azure Blob Storage. These services provide a convenient way to access and manage data.
Loading data from Azure Data Lake Storage Gen 2, Azure Blob Storage, and SQL pools is a straightforward process. You can use code samples to get started.
Azure Data Lake Storage Gen 2 and Azure Blob Storage are both cloud-based storage options that can be used to store and manage data. They offer a range of benefits, including scalability and reliability.
To load data from these sources, you'll need to use code samples, as shown in the following examples. This will allow you to access and work with your data.
Reference an Unpublished
Reference an Unpublished Notebook can be a game-changer for your data management workflow. You can enable this feature by selecting the appropriate checkbox on the Properties pane.
Disabling Reference an Unpublished Notebook means you'll always run the published version of your notebook. If you enable it, the reference run will adopt the current version of the notebook that appears on the notebook UX.
In Live mode, enabling Reference an Unpublished Notebook allows you to run the new version of a notebook, while disabling it will run the published version. In Git mode, enabling it enables you to run the committed version of a notebook.
Here's a quick rundown of what happens in different scenarios:
By enabling Reference an Unpublished Notebook, you can prevent the pollution of common libraries during the developing or debugging process.
Spark Configuration
You can configure a Spark session in Azure Synapse Studio Notebook by selecting the gear icon at the top of the notebook, which takes you to the Configure session pane. From there, you can specify the timeout duration, the number of executors, and the size of executors to give to the current Spark session.
To make configuration changes take effect, you'll need to restart the Spark session, which will also clear all cached notebook variables. You can also create a configuration from the Apache Spark configuration or select an existing configuration, but for details on how to do that, you'll need to refer to the Manage Apache Spark configuration section.
You can also use the magic command `%%configure` to specify Spark session settings. This command is recommended to be run at the beginning of your notebook, and it allows you to set various Spark configuration properties in the "conf" body. However, be aware that some special Spark properties won't take effect in this way, including "spark.driver.cores", "spark.executor.cores", "spark.driver.memory", "spark.executor.memory", and "spark.executor.instances".
Configuring a Spark Session
Configuring a Spark session is a crucial step in setting up your Apache Spark environment. You can configure a Spark session on the Configure session pane, which can be found by selecting the gear icon at the top of the notebook.
To configure a Spark session, you can specify the timeout duration, the number of executors, and the size of executors to give to the current Spark session. Restart the Spark session for configuration changes to take effect, and all cached notebook variables are cleared.
You can also create a configuration from the Apache Spark configuration or select an existing configuration. For details, refer to Manage Apache Spark configuration.
There are two ways to configure a Spark session: using the Configure session pane or using the magic command %%configure. The magic command %%configure allows you to specify Spark session settings via a magic command.
Here are some key points to keep in mind when using the magic command %%configure:
- You must use the standard Spark configuration properties in the "conf" body.
- Some special Spark properties won't take effect in the "conf" body, including "spark.driver.cores", "spark.executor.cores", "spark.driver.memory", "spark.executor.memory", and "spark.executor.instances".
- We recommend that you use the same value for driverMemory and executorMemory in %%configure, and also use the same value for driverCores and executorCores.
- You can use %%configure in Synapse pipelines, but if you don't set it in the first code cell, the pipeline run will fail because it can't restart the session.
Parameterized Session Configuration
You can use parameterized session configuration to replace values in the %%configure magic command with pipeline run parameters. This allows you to override default values by using an object.
The notebook uses the default value if you run the notebook in interactive mode directly or if the pipeline notebook activity doesn't provide a parameter that matches "activityParameterName". This is useful for testing and development.
During the pipeline run mode, you can use the Settings tab to configure settings for a pipeline notebook activity. You'll need to name the pipeline notebook activity parameter the same as activityParameterName in the notebook.
If you want to change the session configuration, simply replace the default value with a new one in the pipeline notebook activity parameter. For example, 8 replaces driverCores in %%configure.
By using parameterized session configuration, you can make your notebooks more flexible and reusable. This is especially useful when working with large datasets or complex pipelines.
Temporary Data
Temporary data management is crucial in Azure Synapse Studio Notebook, and there are a few things to keep in mind.
You can't reference data or variables directly across different languages in a Synapse notebook, but you can use temporary tables as a workaround. This involves creating a temporary table in one language and then querying it in another.
Temporary data management is also important to avoid cluttering your storage container. The Azure Synapse connector doesn't delete temporary files, so you need to periodically delete them to keep your storage organized.
To facilitate data cleanup, you can set up periodic jobs to recursively delete subdirectories that are older than a given threshold. Alternatively, you can drop the whole container and create a new one with the same name.
Temporary objects are also created behind the scenes when using the Azure Synapse connector. These objects include DATABASESCOPEDCREDENTIAL, EXTERNALDATASOURCE, EXTERNALFILEFORMAT, and EXTERNALTABLE, and they're automatically dropped at the end of the Spark job.
However, if the Spark driver process crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped. To identify and manually delete these objects, you can use queries like the following:
- SELECT*FROMsys.database_scoped_credentialsWHEREnameLIKE'tmp_databricks_%'
- SELECT*FROMsys.external_data_sourcesWHEREnameLIKE'tmp_databricks_%'
- SELECT*FROMsys.external_file_formatsWHEREnameLIKE'tmp_databricks_%'
- SELECT*FROMsys.external_tablesWHEREnameLIKE'tmp_databricks_%'
By following these best practices, you can keep your temporary data under control and ensure a smooth experience in Azure Synapse Studio Notebook.
Analytics
Azure Synapse Studio Notebook makes it easy to connect to Azure Synapse Analytics, a powerful analytics service that enables you to execute notebooks.
To start, you'll need to select the Azure Synapse Analytics (Artifacts) tab, where you can choose or create a new linked service that will run the Notebook activity. This is a crucial step in setting up your analytics workflow.
Select the new Synapse Notebook activity on the canvas, and then click on the Settings tab to begin configuring your linked service.
Use Multiple Languages
If you're working with multiple languages in a Synapse notebook, you can use magic commands to switch between languages. The magic commands are: %%pyspark for Python, %%spark for Scala, %%sql for Spark SQL, %%csharp for .NET for Spark C#, and %%sparkr for R.
You can use these commands to write queries in different languages, and the primary language for the notebook is set to PySpark by default. For example, you can write a PySpark query by using the %%pyspark magic command, or a Spark SQL query by using the %%sql magic command.
The following table lists the magic commands to switch cell languages:
By using these magic commands, you can easily switch between languages and write queries in the language of your choice.
Use IDE-Style IntelliSense
Using IDE-style IntelliSense in Synapse notebooks can significantly speed up your coding process. It brings features like syntax highlight, error marker, and automatic code completion to the cell editor.
With IntelliSense, you can write code faster and identify issues more easily. The features are at different levels of maturity for different languages.
For PySpark (Python), all IntelliSense features are supported, including syntax highlight, error marker, syntax code completion, variable code completion, system function code completion, user function code completion, smart indent, and code folding.
Spark (Scala) also supports all IntelliSense features except smart indent. Spark SQL supports syntax highlight, error marker, and syntax code completion, but lacks variable code completion, system function code completion, user function code completion, smart indent, and code folding.
.NET for Spark (C#) supports all IntelliSense features, but requires an active Spark session for variable code completion, system function code completion, and user function code completion.
Here's a summary of IntelliSense support for different languages:
Analytics Properties
When working with Azure Synapse Analytics, it's essential to understand the properties involved in the Notebook activity.
To name the activity in the pipeline, you'll need to provide a unique name. This is a required field, so don't forget to fill it in.
The type of activity is always SynapseNotebook, so you don't have to worry about selecting that. However, you will need to specify the name of the notebook you want to run in Azure Synapse Analytics.
You can also choose to specify a spark pool, but this is not required. If you do decide to use a spark pool, make sure to select the correct one.
The parameter field is also optional, but if you need to pass parameters to your notebook, this is where you would do it.
Analytics Activity Definition
Azure Synapse Analytics Notebook activity definition is a crucial part of any pipeline. It's a sample JSON definition that looks like this:
"Azure Synapse Analytics Notebook activity definition
Here is the sample JSON definition of an Azure Synapse Analytics Notebook Activity:"
The name of the activity in the pipeline is a required property. It's a simple text field where you can enter a name for your activity. The description property is not required, but it's a good idea to include it to provide context about what the activity does.
Here are the properties you need to define for an Azure Synapse Analytics Notebook activity:
The type property is also required and should be set to SynapseNotebook. This tells Azure Synapse Analytics that you want to run a notebook activity.
Variable Explorer
The variable explorer is a powerful tool that helps you keep track of your variables in a Synapse notebook. It's a table that lists variables in the current Spark session for PySpark cells, including columns for variable name, type, length, and value.
Selecting each column header sorts the variables in the table, making it easy to find what you need. The table automatically updates as you define new variables in your code cells.
To open or hide the variable explorer, simply select the Variables button on the notebook command bar.
Read Cell Value
Reading cell values is a crucial step in the analytics process. You can read notebook cell output value in activity, for this panel, you can refer to Transform data by running a Synapse notebook.
To read cell values, you'll need to access the output value in the activity panel. The output value can be found by running a Synapse notebook.
Synapse notebooks are a powerful tool for data transformation and analysis. Running a Synapse notebook allows you to read notebook cell output value in activity.
Two Answers
When working with dynamic parameters in Synapse, you'll need to use the mssparkutils command. This is because the %run magic command doesn't directly support passing dynamic parameters to the script or notebook being run.
You can run a different notebook in Synapse while supplying dynamic parameters using the following code: mssparkutils.runPython("/path/to/notebook", {"param1": "value1", "param2": "value2"}).
For more information, refer to the Microsoft documentation on Microsoft Spark Utilities, which is a great resource for learning more about how to work with Spark in Synapse.
Frequently Asked Questions
What are notebooks in Azure Synapse?
In Azure Synapse, notebooks are interactive web interfaces for creating files that combine live code, visualizations, and narrative text. They provide a dynamic way to explore and analyze data.
Can all notebooks in an Azure Synapse studio workspace be saved at once?
Yes, all notebooks in an Azure Synapse studio workspace can be saved at once by selecting the "Publish all" button on the workspace command bar. This saves changes to all notebooks in the workspace with a single click.
What are Azure data Studio notebooks?
Azure Data Studio notebooks are based on Jupyter notebooks with a customized front-end, tailored to fit within the Azure Data Studio experience. They provide a powerful interactive environment for data analysis and visualization.
Sources
- https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks
- https://docs.databricks.com/en/connect/external-systems/synapse-analytics.html
- https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/quickstart-apache-spark-notebook.md
- https://learn.microsoft.com/en-us/azure/data-factory/transform-data-synapse-notebook
- https://stackoverflow.com/questions/78007888/pyspark-run-magic-command-in-synapse-notebook-pass-dynamic-parameter
Featured Images: pexels.com