Skip to content
Snippets Groups Projects
Commit 443b306a authored by Brummans, Nick's avatar Brummans, Nick
Browse files

Delete train_model.ipynb

parent 482abb13
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/quickstart-ci/AzureMLin10mins.png)
%% Cell type:markdown id: tags:
# Setup
It is important to maintain a conda dependency file and/or MLstudio environment.
Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances.
%% Cell type:code id: tags:
``` python
!conda env update -n workshop_env --file conda-notebook.yml
!conda activate workshop_env
!python -m ipykernel install --user --name=workshop_env --display-name=workshop_env
```
%% Cell type:markdown id: tags:
Refresh page and change kernel to workshop_env.
%% Cell type:markdown id: tags:
Connect to the workspace for easier Azure commands.
%% Cell type:code id: tags:
``` python
from azureml.core import Workspace
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')
```
%% Cell type:markdown id: tags:
## Create training environment
We will use the environment created in **ETL.ipynb**
%% Cell type:code id: tags:
``` python
from azureml.core import Environment
new_update_env = False
env_name='workshop-training-env'
# pathing in notebook folder
conda_path='conda-training.yml'
if new_update_env:
# create new environment
env = Environment.from_conda_specification(name=env_name, file_path=conda_path)
env.register(workspace=ws)
# We can directly build the environment - this will create a new Docker
# image in Azure Container Registry (ACR), and directly 'bake in' our dependencies
# from the conda definition. When we later use the Environment, all AML will need to
# do is pull the image for environment, thus saving the time for potentially a
# long-running conda environment creation.
build = env.build(workspace=ws)
build.wait_for_completion(show_output=True)
else:
# load existing environment
env = Environment.get(workspace=ws, name=env_name)
```
%% Cell type:markdown id: tags:
## Create experiment
Create an experiment to track the runs in your notebook. A workspace can have muliple experiments. We will create an experiment specifically for training our model.
%% Cell type:code id: tags:
``` python
from azureml.core import Experiment
experiment_name = 'train_model_name'
exp = Experiment(workspace=ws, name=experiment_name)
```
%% Cell type:markdown id: tags:
### Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU or CPU support.
%% Cell type:code id: tags:
``` python
# list compute targets
print(ws.compute_targets.keys())
```
%% Cell type:markdown id: tags:
We will use our cluster. It is better to keep the training compute (which probably has better specs) seperate from the notebook compute. This ensure a lower cost (only use heavy compute in the place where it is needed) and a central compute instance for every user of the workspace.
%% Cell type:code id: tags:
``` python
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os
# choose compute target. Look at compute tab -> clusters for options OR look at list in cell above.
compute_name = "cpu-cluster"
if compute_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_name]
print("found compute target: " + compute_name)
else:
print("Compute not found, create compute in compute tab (cluster) with subnet in advanced settings if working in production subscription.")
```
%% Cell type:markdown id: tags:
## Import Data
Before you train a model, you need to understand the data you're using to train it. In this section we will:
* Load datasets created in the ETL.ipynb notebook
* Display some sample images
Lets connect to dataset by mounting on compute. It is also possible to download on compute.
%% Cell type:code id: tags:
``` python
from azureml.core import Dataset
# get dataset by name
image_dataset = Dataset.get_by_name(ws, "img_files_name")
labels_dataset = Dataset.get_by_name(ws, "labels_name")
# mount datasets on compute
image_mount = image_dataset.mount()
image_mount.start()
image_mount_folder = image_mount.mount_point
# load dataset as pandas frame
labels_pandas = labels_dataset.to_pandas_dataframe()
```
%% Cell type:markdown id: tags:
### Take a look at the data
Lets look at the pandas dataframe with labels.
%% Cell type:code id: tags:
``` python
labels_pandas.head()
```
%% Cell type:markdown id: tags:
Load the images files then use `matplotlib` to plot 3 random images from the dataset with their labels above them.
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
import numpy as np
import glob
from skimage import io
train = glob.glob(image_mount_folder + "/*")
# now let's show some images from the traininng set.
count = 0
sample_size = 3
plt.figure(figsize=(20, 4))
for name in train[:sample_size]:
count = count + 1
plt.subplot(1, sample_size, count)
plt.axhline("")
plt.axvline("")
# get label from filename
image = name.split("/")[-1].split(".")[0]
# get label from pandas frame
label_row = labels_pandas.loc[labels_pandas['image_id'] == image]
label = label_row.columns[(label_row == 1).iloc[0]][0]
# plot with text
plt.text(0, 0, label, horizontalalignment="left", verticalalignment="top", fontsize=18, backgroundcolor='white')
image = io.imread(name)
plt.imshow(image)
plt.show()
```
%% Cell type:code id: tags:
``` python
# stop mount point
image_mount.stop()
```
%% Cell type:markdown id: tags:
## Train on a remote cluster
For this task, you submit the job to run on the remote training cluster you set up earlier. To submit a job you:
* Create a directory
* Create a training script
* Create a script run configuration
* Submit the job
### Create a directory
Create a directory to deliver the necessary code from your computer to the remote resource.
%% Cell type:code id: tags:
``` python
import os
script_folder = os.path.join(os.getcwd(), "scripts")
os.makedirs(script_folder, exist_ok=True)
```
%% Cell type:markdown id: tags:
### Create a training script
To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created.
%% Cell type:code id: tags:
``` python
%%writefile $script_folder/train.py
import os
import argparse
import joblib
from azureml.core import Run
from azureml.core import Dataset as DatasetAzure
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import cv2
import albumentations as A
from albumentations.pytorch import ToTensorV2
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix
import numpy as np
import seaborn as sns
# model dataset
class PlantDataset(Dataset):
def __init__(self, df, mount_folder, transforms=None):
self.df = df
self.mount_folder = mount_folder
self.transforms = transforms
def __len__(self):
return self.df.shape[0]
def __getitem__(self, idx):
image_src = self.mount_folder + '/' + self.df.loc[idx, 'image_id'] + '.jpg'
image = cv2.imread(image_src, cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
labels = self.df.loc[idx, ['healthy', 'multiple_diseases', 'rust', 'scab']].values
labels = torch.from_numpy(labels.astype(np.int8))
labels = labels.unsqueeze(-1)
if self.transforms:
transformed = self.transforms(image=image)
image = transformed['image']
return image, labels
# custom model class
class PlantModel(nn.Module):
def __init__(self, num_classes=4):
super().__init__()
self.backbone = torchvision.models.resnet18(pretrained=True)
in_features = self.backbone.fc.in_features
self.logit = nn.Linear(in_features, num_classes)
def forward(self, x):
batch_size, C, H, W = x.shape
x = self.backbone.conv1(x)
x = self.backbone.bn1(x)
x = self.backbone.relu(x)
x = self.backbone.maxpool(x)
x = self.backbone.layer1(x)
x = self.backbone.layer2(x)
x = self.backbone.layer3(x)
x = self.backbone.layer4(x)
x = F.adaptive_avg_pool2d(x,1).reshape(batch_size,-1)
x = F.dropout(x, 0.25, self.training)
x = self.logit(x)
return x
# custom cross entropy class
class DenseCrossEntropy(nn.Module):
def __init__(self):
super(DenseCrossEntropy, self).__init__()
def forward(self, logits, labels):
logits = logits.float()
labels = labels.float()
logprobs = F.log_softmax(logits, dim=-1)
loss = -labels * logprobs
loss = loss.sum(-1)
return loss.mean()
# function for collecting input arguments
def get_runtime_args():
parser = argparse.ArgumentParser()
parser.add_argument('--image-folder', type=str)
parser.add_argument('--labels', type=str)
parser.add_argument('--size', type=int)
parser.add_argument('--split', type=float)
parser.add_argument('--batch-size', type=int)
parser.add_argument('--num-workers', type=int)
parser.add_argument('--num-classes', type=int)
parser.add_argument('--learning-rate', type=float)
parser.add_argument('--epochs', type=int)
args = parser.parse_args()
return args
# We define our main class here
def main():
args = get_runtime_args()
# A run represents a single trial of an experiment. Runs are used to monitor the asynchronous execution of a trial,
# log metrics and store output of the trial, and to analyze results and access artifacts generated by the trial.
run = Run.get_context()
# define a training image transformer
transforms_train = A.Compose([
A.RandomResizedCrop(height=args.size, width=args.size, p=1.0),
A.Flip(),
A.ShiftScaleRotate(rotate_limit=1.0, p=0.8),
# Pixels
A.OneOf([
A.IAAEmboss(p=1.0),
A.IAASharpen(p=1.0),
A.Blur(p=1.0),
], p=0.5),
# Affine
A.OneOf([
A.ElasticTransform(p=1.0),
A.IAAPiecewiseAffine(p=1.0)
], p=0.5),
A.Normalize(p=1.0),
ToTensorV2(p=1.0),
])
# define a validation image transformer
transforms_valid = A.Compose([
A.Resize(height=args.size, width=args.size, p=1.0),
A.Normalize(p=1.0),
ToTensorV2(p=1.0),
])
# get labels input dataset by id
ws = run.experiment.workspace
label_dataset = DatasetAzure.get_by_id(ws, id=args.labels)
# get image mount folder
image_mount_folder = args.image_folder
# convert label dataset to pandas for ease of use
labels_pandas = label_dataset.to_pandas_dataframe()
# split dataset in train and validation
train, valid = train_test_split(labels_pandas, test_size=args.split)
# reset indexes
train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
# get Datasets
dataset_train = PlantDataset(df=train, mount_folder=image_mount_folder, transforms=transforms_train)
dataset_valid = PlantDataset(df=valid, mount_folder=image_mount_folder, transforms=transforms_valid)
# get datasets in dataloaders
dataloader_train = DataLoader(dataset_train, batch_size=args.batch_size, num_workers=args.num_workers, shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=args.batch_size, num_workers=args.num_workers, shuffle=False)
# load model
model = PlantModel(num_classes=args.num_classes)
# set parameters for model training
criterion = DenseCrossEntropy()
plist = [{'params': model.parameters(), 'lr': args.learning_rate}]
optimizer = optim.Adam(plist)
# start training
train_loss = []
valid_loss = []
valid_score = []
# epoch loop
for epoch in range(args.epochs):
print(' Epoch {}/{}'.format(epoch + 1, args.epochs))
print(' ' + ('-' * 20))
model.train()
tr_loss = 0
for step, batch in enumerate(dataloader_train):
images = batch[0]
labels = batch[1]
outputs = model(images)
loss = criterion(outputs, labels.squeeze(-1))
loss.backward()
tr_loss += loss.item()
optimizer.step()
optimizer.zero_grad()
# Validate
model.eval()
val_loss = 0
val_preds = None
val_labels = None
for step, batch in enumerate(dataloader_valid):
images = batch[0]
labels = batch[1]
if val_labels is None:
val_labels = labels.clone().squeeze(-1)
else:
val_labels = torch.cat((val_labels, labels.squeeze(-1)), dim=0)
with torch.no_grad():
outputs = model(images)
loss = criterion(outputs, labels.squeeze(-1))
val_loss += loss.item()
preds = torch.softmax(outputs, dim=1).data.cpu()
if val_preds is None:
val_preds = preds
else:
val_preds = torch.cat((val_preds, preds), dim=0)
# update metrics
train_loss.append(tr_loss / len(dataloader_train))
valid_loss.append(val_loss / len(dataloader_valid))
valid_score.append(roc_auc_score(val_labels, val_preds, average='macro'))
# Create confusion matrix with last epoch
val_labels=np.argmax(val_labels, axis=1)
val_preds=np.argmax(val_preds, axis=1)
cf_matrix = confusion_matrix(val_labels, val_preds)
# make plot of matrix
ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['healthy','multiple_diseases','rust','scab'])
ax.yaxis.set_ticklabels(['healthy','multiple_diseases','rust','scab'])
fig = ax.get_figure()
# log results to ml studio
run.log_list(name='train loss per epoch', value=train_loss)
run.log_list(name='valid loss per epoch', value=valid_loss)
run.log_list(name='valid score per epoch', value=valid_score)
run.log_image(name='confusion matrix last epoch', plot=fig)
#copying to "outputs" directory, automatically uploads it to Azure ML
output_dir = './outputs/'
os.makedirs(output_dir, exist_ok=True)
torch.save(model.state_dict(), os.path.join(output_dir, 'model.pth'))
if __name__ == "__main__":
main()
```
%% Cell type:markdown id: tags:
Notice how the script gets data and saves models:
+ The training script reads input arguments:
- parser.add_argument(..)
+ The training script saves your model state dictionary into a directory named outputs. <br/>
`torch.save(model.state_dict(), os.path.join(output_dir, 'model.pth'))`<br/>
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial.
%% Cell type:markdown id: tags:
### Configure the training job
Create a **ScriptRunConfig** object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Configure the **ScriptRunConfig** by specifying:
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
* The compute target. In this case you will use the "cpu-cluster"
* The training script name, train.py
* An environment that contains the libraries needed to run the script
* Arguments required from the training script.
%% Cell type:code id: tags:
``` python
from azureml.core import ScriptRunConfig
args = ['--image-folder', image_dataset.as_mount(), # it is also possible to download image dataset on compute (as_download(), because mounting load files at the time of processing, it is usually faster than download.)
'--labels', labels_dataset.as_named_input('labels_name'),
'--size', 512,
'--split', 0.2,
'--batch-size', 4,
'--epochs', 3,
'--num-workers', 0,
'--num-classes', 4,
'--learning-rate', 5e-5]
src = ScriptRunConfig(source_directory="./",
script='scripts/train.py',
arguments=args,
compute_target=compute_target,
environment=env)
```
%% Cell type:markdown id: tags:
### Submit the job to the cluster
Run the experiment by submitting the ScriptRunConfig object. And you can navigate to Azure portal to monitor the run.
Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.
%% Cell type:code id: tags:
``` python
run = exp.submit(config=src)
run
```
%% Cell type:markdown id: tags:
## Monitor a remote run
Here is what's happening while you wait:
- **Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**.
This stage happens once for each Python environment since the container is cached for subsequent runs. During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs. If you prebuild the image this step will be much quicker.
- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**
- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.
- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.
You can check the progress of a running job in multiple ways. This workshop uses a Jupyter widget it is also possible to use the `wait_for_completion` method.
### Jupyter widget
Watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.
%% Cell type:code id: tags:
``` python
from azureml.widgets import RunDetails
RunDetails(run).show()
```
%% Cell type:markdown id: tags:
By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).
%% Cell type:markdown id: tags:
## View Experiment
In the left-hand menu in Azure Machine Learning Studio, select __Experiments__ and then select your experiment. An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.
## Register model
The last step in the training script wrote the file `outputs/model.pth` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.
You can see files associated with that run.
%% Cell type:code id: tags:
``` python
print(run.get_file_names())
```
%% Cell type:markdown id: tags:
Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model.
%% Cell type:code id: tags:
``` python
# register model
model = run.register_model(model_name="workshop_training_name", model_path='outputs/model.pth')
print(model.name, model.id, model.version, sep='\t')
```
%% Cell type:markdown id: tags:
## Hyper parameter tuning
In machine learning, models are trained to predict unknown labels for new data based on correlations between known labels and features found in the training data. Depending on the algorithm used, you may need to specify hyperparameters to configure how the model is trained. For example, the logistic regression algorithm uses a regularization rate hyperparameter to counteract overfitting; and deep learning techniques for convolutional neural networks (CNNs) use hyperparameters like learning rate to control how weights are adjusted during training, and batch size to determine how many data items are included in each training batch.
The choice of hyperparameter values can significantly affect the resulting model, making it important to select the best possible values for your particular data and predictive performance goals.
Hyperparameter tuning is accomplished by training the multiple models, using the same algorithm and training data but different hyperparameter values. The resulting model from each training run is then evaluated to determine the performance metric for which you want to optimize (for example, accuracy), and the best-performing model is selected.
In Azure Machine Learning, you achieve this through an experiment that consists of a hyperdrive run, which initiates a child run for each hyperparameter combination to be tested. Each child run uses a training script with parameterized hyperparameter values to train a model, and logs the target performance metric achieved by the trained model.
%% Cell type:markdown id: tags:
## Defining a search space
The set of hyperparameter values tried during hyperparameter tuning is known as the search space. The definition of the range of possible values that can be chosen depends on the type of hyperparameter.
### Discrete hyperparameters
Some hyperparameters require discrete values - in other words, you must select the value from a particular set of possibilities. You can define a search space for a discrete parameter using a choice from a list of explicit values, which you can define as a Python list (choice([10,20,30])), a range (choice(range(1,10))), or an arbitrary set of comma-separated values (choice(30,50,100))
### Continuous hyperparameters
Some hyperparameters are continuous - in other words you can use any value along a scale. To define a search space for these kinds of value, you can use any of the following distribution types:
- normal
- uniform
- lognormal
- loguniform
### Defining a search space
To define a search space for hyperparameter tuning, create a dictionary with the appropriate parameter expression for each named hyperparameter. For example, the following search space indicates that the learning rate hyperparameter can have the value 5e-5 or 4e-5. The learning_rate hyperparameter can also have any value from a normal distribution with a mean of 5e-5 and a standard deviation of 1e-5.
%% Cell type:code id: tags:
``` python
from azureml.train.hyperdrive import choice, normal
param_space = {
'--learning-rate': choice(5e-5, 4e-5)
# '--learning_rate': normal(5e-5, 1e-5)
}
```
%% Cell type:markdown id: tags:
## Configuring sampling
The specific values used in a hyperparameter tuning run depend on the type of sampling used.
### Grid sampling
Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.
### Random sampling
Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values.
### Bayesian sampling
Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection.
###
%% Cell type:code id: tags:
``` python
from azureml.train.hyperdrive import GridParameterSampling
param_sampling = GridParameterSampling(param_space)
```
%% Cell type:markdown id: tags:
## Configuring early termination
With a sufficiently large hyperparameter search space, it could take many iterations (child runs) to try every possible combination. Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.
To help prevent wasting time, you can set an early termination policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an evaluation_interval you specify, based on each time the target performance metric is logged. You can also set a delay_evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed.
## Bandit policy
You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin.
## Median stopping policy
A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.
## Truncation selection policy
A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X.
%% Cell type:code id: tags:
``` python
from azureml.train.hyperdrive import BanditPolicy
early_termination_policy = BanditPolicy(slack_amount = 0.2,
evaluation_interval=1,
delay_evaluation=1)
```
%% Cell type:markdown id: tags:
This example applies the policy for every iteration after the first one, and abandons runs where the reported target metric is 0.2 or more worse than the best performing run after the same number of intervals.
You can also apply a bandit policy using a slack factor, which compares the performance metric as a ratio rather than an absolute value.
%% Cell type:markdown id: tags:
## Running hyperparameter tuning
To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must:
Include an argument for each hyperparameter you want to vary.
Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model.
We will use the previous training script and use the 'valid score per epoch' as a tracking metric.
%% Cell type:code id: tags:
``` python
from azureml.train.hyperdrive import HyperDriveConfig, PrimaryMetricGoal
hyperdrive = HyperDriveConfig(run_config=src,
hyperparameter_sampling=param_sampling,
policy=early_termination_policy,
primary_metric_name='valid score per epoch',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=6,
max_concurrent_runs=4)
experiment = Experiment(workspace = ws, name ='workshop_hyperparameter_tuning_name')
hyperdrive_run = experiment.submit(config=hyperdrive)
hyperdrive_run
```
%% Cell type:markdown id: tags:
To retrieve the best performing run, you can use the following code:
%% Cell type:code id: tags:
``` python
best_run = hyperdrive_run.get_best_run_by_primary_metric()
```
%% Cell type:markdown id: tags:
Now we can register the best performing model.
%% Cell type:code id: tags:
``` python
# register model
model = best_run.register_model(model_name="workshop_hyper_model_name", model_path='outputs/model.pth')
print(model.name, model.id, model.version, sep='\t')
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment