It is important to maintain a conda dependency file and/or MLstudio environment.
Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances.
For each conda environment we can setup a kernel so the notebook will use this environment.
- Open terminal (terminal opens in your account folder)
Azure ML environments are an encapsulation of the environment where your machine learning training happens. They define Python packages, environment variables, Docker settings and other attributes in declarative fashion. Environments are versioned: you can update them and retrieve old versions to revisit and review your work.
Environments allow you to:
- Encapsulate dependencies of your training process, such as Python packages and their versions.
- Reproduce the Python environment on your local computer in a remote run on VM or ML Compute cluster
- Reproduce your experimentation environment in production setting.
- Revisit and audit the environment in which an existing model was trained.
- Environment, compute target and training script together form run configuration: the full specification of training run.
With the following code we can list current available environments.
# We can directly build the environment - this will create a new Docker
# image in Azure Container Registry (ACR), and directly 'bake in' our dependencies
# from the conda definition. When we later use the Environment, all AML will need to
# do is pull the image for environment, thus saving the time for potentially a
# long-running conda environment creation.
build=env.build(workspace=ws)
build.wait_for_completion(show_output=True)
else:
# load existing environment
env=Environment.get(workspace=ws,name=env_name)
```
%% Cell type:markdown id: tags:
# Datastores
You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. For example, the following code lists the names of each datastore in the workspace.
%% Cell type:code id: tags:
``` python
fromazureml.coreimportDatastore
fords_nameinws.datastores:
print(ds_name)
```
%% Cell type:markdown id: tags:
You can get a reference to any datastore by using the Datastore.get() method as shown here:
The workspace always includes a default datastore (initially, this is the built-in workspaceblobstore datastore), which you can retrieve by using the get_default_datastore() method of a Workspace object, like this:
%% Cell type:code id: tags:
``` python
default_store=ws.get_default_datastore()
```
%% Cell type:markdown id: tags:
When planning for datastores, consider the following guidelines:
- When using Azure blob storage, premium level storage may provide improved I/O performance for large datasets. However, this option will increase cost and may limit replication options for data redundancy.
- When working with data files, although CSV format is very common, Parquet format generally results in better performance.
%% Cell type:markdown id: tags:
# Datasets
Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.
Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. You can create the following types of dataset:
-**Tabular:** The data is read from the dataset as a table. You should use this type of dataset when your data is consistently structured and you want to work with it in common tabular data structures, such as Pandas dataframes.
-**File:** The dataset presents a list of file paths that can be read as though from the file system. Use this type of dataset when your data is unstructured, or when you need to process the data at the file level (for example, to train a convolutional neural network from a set of image files).
Our data consists of two datasets:
-**labels_short.csv**: csv with labels of images. We will create a Tabular dataset from this csv.
-**images**: a folder with images of plants with diseases. We will create a File dataset from this folder.
%% Cell type:markdown id: tags:
To create a tabular dataset using the SDK, use the from_delimited_files method of the Dataset.Tabular class
Datasets can be versioned, enabling you to track historical versions of datasets that were used in experiments, and reproduce those experiments with data in the same state.
You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version property:
file_ds_filtered=file_ds_filtered.register(workspace=ws,name='img_files_name',description='images of plant disease dataset filtered on Train name',create_new_version=True)
```
%% Cell type:markdown id: tags:
We have selected only images with **Train_** in the name since we only know the labels of these images.
%% Cell type:markdown id: tags:
You can retrieve a specific version of a dataset by specifying the version parameter in the get_by_name method of the Dataset class.