"It is important to maintain a conda dependency file and/or MLstudio environment. \r\n",
"\r\n",
"Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances."
"Create an experiment to track the runs in your notebook. A workspace can have muliple experiments. We will create an experiment specifically for training our model."
"By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU or CPU support. "
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"# list compute targets\r\n",
"print(ws.compute_targets.keys())"
],
"outputs": [],
"execution_count": null,
"metadata": {
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"gather": {
"logged": 1646660707664
}
}
},
{
"cell_type": "markdown",
"source": [
"We will use our cluster. It is better to keep the training compute (which probably has better specs) seperate from the notebook compute. This ensure a lower cost (only use heavy compute in the place where it is needed) and a central compute instance for every user of the workspace."
"To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. "
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"%%writefile $script_folder/train.py\r\n",
"\r\n",
"import os\r\n",
"import argparse\r\n",
"import joblib\r\n",
"\r\n",
"from azureml.core import Run\r\n",
"from azureml.core import Dataset as DatasetAzure\r\n",
"Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "markdown",
"source": [
"### Configure the training job\n",
"\n",
"Create a **ScriptRunConfig** object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Configure the **ScriptRunConfig** by specifying:\n",
"\n",
"* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. \n",
"* The compute target. In this case you will use the \"cpu-cluster\"\n",
"* The training script name, train.py\n",
"* An environment that contains the libraries needed to run the script\n",
"* Arguments required from the training script. "
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"from azureml.core import ScriptRunConfig\r\n",
"\r\n",
"args = ['--image-folder', image_dataset.as_mount(), # it is also possible to download image dataset on compute (as_download(), because mounting load files at the time of processing, it is usually faster than download.)\r\n",
"Run the experiment by submitting the ScriptRunConfig object. And you can navigate to Azure portal to monitor the run.\r\n",
"\r\n",
"Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"run = exp.submit(config=src)\r\n",
"run"
],
"outputs": [],
"execution_count": null,
"metadata": {
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"gather": {
"logged": 1646662413760
}
}
},
{
"cell_type": "markdown",
"source": [
"\r\n",
"## Monitor a remote run\r\n",
"\r\n",
"Here is what's happening while you wait:\r\n",
"\r\n",
"- **Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. \r\n",
"\r\n",
" This stage happens once for each Python environment since the container is cached for subsequent runs. During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs. If you prebuild the image this step will be much quicker.\r\n",
"\r\n",
"- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**\r\n",
"\r\n",
"- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.\r\n",
"\r\n",
"- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.\r\n",
"\r\n",
"\r\n",
"You can check the progress of a running job in multiple ways. This workshop uses a Jupyter widget it is also possible to use the `wait_for_completion` method. \r\n",
"\r\n",
"### Jupyter widget\r\n",
"\r\n",
"Watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"from azureml.widgets import RunDetails\r\n",
"RunDetails(run).show()"
],
"outputs": [],
"execution_count": null,
"metadata": {
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"gather": {
"logged": 1646662414044
}
}
},
{
"cell_type": "markdown",
"source": [
"By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "markdown",
"source": [
"## View Experiment\n",
"In the left-hand menu in Azure Machine Learning Studio, select __Experiments__ and then select your experiment. An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.\n",
"\n",
"## Register model\n",
"\n",
"The last step in the training script wrote the file `outputs/model.pth` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.\n",
"\n",
"You can see files associated with that run."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"print(run.get_file_names())"
],
"outputs": [],
"execution_count": null,
"metadata": {
"gather": {
"logged": 1646400935407
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "markdown",
"source": [
"Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model."
"In machine learning, models are trained to predict unknown labels for new data based on correlations between known labels and features found in the training data. Depending on the algorithm used, you may need to specify hyperparameters to configure how the model is trained. For example, the logistic regression algorithm uses a regularization rate hyperparameter to counteract overfitting; and deep learning techniques for convolutional neural networks (CNNs) use hyperparameters like learning rate to control how weights are adjusted during training, and batch size to determine how many data items are included in each training batch.\r\n",
"\r\n",
"The choice of hyperparameter values can significantly affect the resulting model, making it important to select the best possible values for your particular data and predictive performance goals.\r\n",
"\r\n",
"Hyperparameter tuning is accomplished by training the multiple models, using the same algorithm and training data but different hyperparameter values. The resulting model from each training run is then evaluated to determine the performance metric for which you want to optimize (for example, accuracy), and the best-performing model is selected.\r\n",
"\r\n",
"In Azure Machine Learning, you achieve this through an experiment that consists of a hyperdrive run, which initiates a child run for each hyperparameter combination to be tested. Each child run uses a training script with parameterized hyperparameter values to train a model, and logs the target performance metric achieved by the trained model."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "markdown",
"source": [
"## Defining a search space\r\n",
"\r\n",
"The set of hyperparameter values tried during hyperparameter tuning is known as the search space. The definition of the range of possible values that can be chosen depends on the type of hyperparameter.\r\n",
"\r\n",
"### Discrete hyperparameters\r\n",
"\r\n",
"Some hyperparameters require discrete values - in other words, you must select the value from a particular set of possibilities. You can define a search space for a discrete parameter using a choice from a list of explicit values, which you can define as a Python list (choice([10,20,30])), a range (choice(range(1,10))), or an arbitrary set of comma-separated values (choice(30,50,100))\r\n",
"\r\n",
"### Continuous hyperparameters\r\n",
"\r\n",
"Some hyperparameters are continuous - in other words you can use any value along a scale. To define a search space for these kinds of value, you can use any of the following distribution types:\r\n",
"\r\n",
"- normal\r\n",
"- uniform\r\n",
"- lognormal\r\n",
"- loguniform\r\n",
"\r\n",
"### Defining a search space\r\n",
"\r\n",
"To define a search space for hyperparameter tuning, create a dictionary with the appropriate parameter expression for each named hyperparameter. For example, the following search space indicates that the learning rate hyperparameter can have the value 5e-5 or 4e-5. The learning_rate hyperparameter can also have any value from a normal distribution with a mean of 5e-5 and a standard deviation of 1e-5.\r\n",
"The specific values used in a hyperparameter tuning run depend on the type of sampling used.\r\n",
"\r\n",
"### Grid sampling\r\n",
"\r\n",
"Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.\r\n",
"\r\n",
"### Random sampling\r\n",
"\r\n",
"Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values.\r\n",
"\r\n",
"### Bayesian sampling\r\n",
"\r\n",
"Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection. \r\n",
"With a sufficiently large hyperparameter search space, it could take many iterations (child runs) to try every possible combination. Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.\r\n",
"\r\n",
"To help prevent wasting time, you can set an early termination policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an evaluation_interval you specify, based on each time the target performance metric is logged. You can also set a delay_evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed.\r\n",
"\r\n",
"## Bandit policy\r\n",
"\r\n",
"You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin.\r\n",
"\r\n",
"## Median stopping policy\r\n",
"\r\n",
"A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.\r\n",
"\r\n",
"## Truncation selection policy\r\n",
"\r\n",
"A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X."
"This example applies the policy for every iteration after the first one, and abandons runs where the reported target metric is 0.2 or more worse than the best performing run after the same number of intervals.\r\n",
"\r\n",
"You can also apply a bandit policy using a slack factor, which compares the performance metric as a ratio rather than an absolute value."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "markdown",
"source": [
"## Running hyperparameter tuning\r\n",
"\r\n",
"To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must:\r\n",
"\r\n",
"Include an argument for each hyperparameter you want to vary.\r\n",
"Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model.\r\n",
"\r\n",
"We will use the previous training script and use the 'valid score per epoch' as a tracking metric."
It is important to maintain a conda dependency file and/or MLstudio environment.
Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances.
# We can directly build the environment - this will create a new Docker
# image in Azure Container Registry (ACR), and directly 'bake in' our dependencies
# from the conda definition. When we later use the Environment, all AML will need to
# do is pull the image for environment, thus saving the time for potentially a
# long-running conda environment creation.
build=env.build(workspace=ws)
build.wait_for_completion(show_output=True)
else:
# load existing environment
env=Environment.get(workspace=ws,name=env_name)
```
%% Cell type:markdown id: tags:
## Create experiment
Create an experiment to track the runs in your notebook. A workspace can have muliple experiments. We will create an experiment specifically for training our model.
%% Cell type:code id: tags:
``` python
fromazureml.coreimportExperiment
experiment_name='train_model_name'
exp=Experiment(workspace=ws,name=experiment_name)
```
%% Cell type:markdown id: tags:
### Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU or CPU support.
%% Cell type:code id: tags:
``` python
# list compute targets
print(ws.compute_targets.keys())
```
%% Cell type:markdown id: tags:
We will use our cluster. It is better to keep the training compute (which probably has better specs) seperate from the notebook compute. This ensure a lower cost (only use heavy compute in the place where it is needed) and a central compute instance for every user of the workspace.
%% Cell type:code id: tags:
``` python
fromazureml.core.computeimportAmlCompute
fromazureml.core.computeimportComputeTarget
importos
# choose compute target. Look at compute tab -> clusters for options OR look at list in cell above.
compute_name="cpu-cluster"
ifcompute_nameinws.compute_targets:
compute_target=ws.compute_targets[compute_name]
print("found compute target: "+compute_name)
else:
print("Compute not found, create compute in compute tab (cluster) with subnet in advanced settings if working in production subscription.")
```
%% Cell type:markdown id: tags:
## Import Data
Before you train a model, you need to understand the data you're using to train it. In this section we will:
* Load datasets created in the ETL.ipynb notebook
* Display some sample images
Lets connect to dataset by mounting on compute. It is also possible to download on compute.
For this task, you submit the job to run on the remote training cluster you set up earlier. To submit a job you:
* Create a directory
* Create a training script
* Create a script run configuration
* Submit the job
### Create a directory
Create a directory to deliver the necessary code from your computer to the remote resource.
%% Cell type:code id: tags:
``` python
importos
script_folder=os.path.join(os.getcwd(),"scripts")
os.makedirs(script_folder,exist_ok=True)
```
%% Cell type:markdown id: tags:
### Create a training script
To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created.
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial.
%% Cell type:markdown id: tags:
### Configure the training job
Create a **ScriptRunConfig** object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Configure the **ScriptRunConfig** by specifying:
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
* The compute target. In this case you will use the "cpu-cluster"
* The training script name, train.py
* An environment that contains the libraries needed to run the script
* Arguments required from the training script.
%% Cell type:code id: tags:
``` python
fromazureml.coreimportScriptRunConfig
args=['--image-folder',image_dataset.as_mount(),# it is also possible to download image dataset on compute (as_download(), because mounting load files at the time of processing, it is usually faster than download.)
Run the experiment by submitting the ScriptRunConfig object. And you can navigate to Azure portal to monitor the run.
Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.
%% Cell type:code id: tags:
``` python
run=exp.submit(config=src)
run
```
%% Cell type:markdown id: tags:
## Monitor a remote run
Here is what's happening while you wait:
-**Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**.
This stage happens once for each Python environment since the container is cached for subsequent runs. During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs. If you prebuild the image this step will be much quicker.
-**Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**
-**Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.
-**Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.
You can check the progress of a running job in multiple ways. This workshop uses a Jupyter widget it is also possible to use the `wait_for_completion` method.
### Jupyter widget
Watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.
%% Cell type:code id: tags:
``` python
fromazureml.widgetsimportRunDetails
RunDetails(run).show()
```
%% Cell type:markdown id: tags:
By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).
%% Cell type:markdown id: tags:
## View Experiment
In the left-hand menu in Azure Machine Learning Studio, select __Experiments__ and then select your experiment. An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.
## Register model
The last step in the training script wrote the file `outputs/model.pth` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.
You can see files associated with that run.
%% Cell type:code id: tags:
``` python
print(run.get_file_names())
```
%% Cell type:markdown id: tags:
Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model.
In machine learning, models are trained to predict unknown labels for new data based on correlations between known labels and features found in the training data. Depending on the algorithm used, you may need to specify hyperparameters to configure how the model is trained. For example, the logistic regression algorithm uses a regularization rate hyperparameter to counteract overfitting; and deep learning techniques for convolutional neural networks (CNNs) use hyperparameters like learning rate to control how weights are adjusted during training, and batch size to determine how many data items are included in each training batch.
The choice of hyperparameter values can significantly affect the resulting model, making it important to select the best possible values for your particular data and predictive performance goals.
Hyperparameter tuning is accomplished by training the multiple models, using the same algorithm and training data but different hyperparameter values. The resulting model from each training run is then evaluated to determine the performance metric for which you want to optimize (for example, accuracy), and the best-performing model is selected.
In Azure Machine Learning, you achieve this through an experiment that consists of a hyperdrive run, which initiates a child run for each hyperparameter combination to be tested. Each child run uses a training script with parameterized hyperparameter values to train a model, and logs the target performance metric achieved by the trained model.
%% Cell type:markdown id: tags:
## Defining a search space
The set of hyperparameter values tried during hyperparameter tuning is known as the search space. The definition of the range of possible values that can be chosen depends on the type of hyperparameter.
### Discrete hyperparameters
Some hyperparameters require discrete values - in other words, you must select the value from a particular set of possibilities. You can define a search space for a discrete parameter using a choice from a list of explicit values, which you can define as a Python list (choice([10,20,30])), a range (choice(range(1,10))), or an arbitrary set of comma-separated values (choice(30,50,100))
### Continuous hyperparameters
Some hyperparameters are continuous - in other words you can use any value along a scale. To define a search space for these kinds of value, you can use any of the following distribution types:
- normal
- uniform
- lognormal
- loguniform
### Defining a search space
To define a search space for hyperparameter tuning, create a dictionary with the appropriate parameter expression for each named hyperparameter. For example, the following search space indicates that the learning rate hyperparameter can have the value 5e-5 or 4e-5. The learning_rate hyperparameter can also have any value from a normal distribution with a mean of 5e-5 and a standard deviation of 1e-5.
%% Cell type:code id: tags:
``` python
fromazureml.train.hyperdriveimportchoice,normal
param_space={
'--learning-rate':choice(5e-5,4e-5)
# '--learning_rate': normal(5e-5, 1e-5)
}
```
%% Cell type:markdown id: tags:
## Configuring sampling
The specific values used in a hyperparameter tuning run depend on the type of sampling used.
### Grid sampling
Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.
### Random sampling
Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values.
### Bayesian sampling
Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection.
With a sufficiently large hyperparameter search space, it could take many iterations (child runs) to try every possible combination. Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.
To help prevent wasting time, you can set an early termination policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an evaluation_interval you specify, based on each time the target performance metric is logged. You can also set a delay_evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed.
## Bandit policy
You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin.
## Median stopping policy
A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.
## Truncation selection policy
A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X.
This example applies the policy for every iteration after the first one, and abandons runs where the reported target metric is 0.2 or more worse than the best performing run after the same number of intervals.
You can also apply a bandit policy using a slack factor, which compares the performance metric as a ratio rather than an absolute value.
%% Cell type:markdown id: tags:
## Running hyperparameter tuning
To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must:
Include an argument for each hyperparameter you want to vary.
Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model.
We will use the previous training script and use the 'valid score per epoch' as a tracking metric.