Update ETL.ipynb

05247464 · Brummans, Nick · 75305159 · 05247464
Commit 05247464 authored 3 years ago by Brummans, Nick
--- a/ETL.ipynb
+++ b/ETL.ipynb
@@ -11,8 +11,8 @@
        "\n",
        "- Open terminal (terminal opens in your account folder)\n",
        "    - conda env update --file workshop-mlstudio/conda-notebook.yml\n",
-        "    - conda activate workshop_env\n",
-        "    - python -m ipykernel install --user --name=workshop_env --display-name=workshop_env\n",
+        "    - conda activate workshop-env\n",
+        "    - python -m ipykernel install --user --name=workshop-env --display-name=workshop-env\n",
        "\n",
        "Refresh page and change kernel to workshop_env."
      ],
@@ -662,4 +662,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
\ No newline at end of file
+}
 %% Cell type:markdown id: tags:

 # Setup

 It is important to maintain a conda dependency file and/or MLstudio environment.
 Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances.
 For each conda environment we can setup a kernel so the notebook will use this environment.

 - Open terminal (terminal opens in your account folder)
    - conda env update --file workshop-mlstudio/conda-notebook.yml
-    - conda activate workshop_env
-    - python -m ipykernel install --user --name=workshop_env --display-name=workshop_env
+    - conda activate workshop-env
+    - python -m ipykernel install --user --name=workshop-env --display-name=workshop-env

 Refresh page and change kernel to workshop_env.

 %% Cell type:markdown id: tags:

 Connect to the workspace for easier Azure commands.

 %% Cell type:code id: tags:

 ``` python
 from azureml.core import Workspace

 ws = Workspace.from_config()
 print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')
 ```

 %% Cell type:markdown id: tags:

 ## Setup environment

 Azure ML environments are an encapsulation of the environment where your machine learning training happens. They define Python packages, environment variables, Docker settings and other attributes in declarative fashion. Environments are versioned: you can update them and retrieve old versions to revisit and review your work.

 Environments allow you to:

 - Encapsulate dependencies of your training process, such as Python packages and their versions.
 - Reproduce the Python environment on your local computer in a remote run on VM or ML Compute cluster
 - Reproduce your experimentation environment in production setting.
 - Revisit and audit the environment in which an existing model was trained.
 - Environment, compute target and training script together form run configuration: the full specification of training run.

 With the following code we can list current available environments.

 %% Cell type:code id: tags:

 ``` python
 from azureml.core import Environment

 envs = Environment.list(workspace=ws)

 for env in envs:
    print("Name",env)
    if envs[env].python.conda_dependencies is not None:
            print("packages", envs[env].python.conda_dependencies.serialize_to_string())
 ```

 %% Cell type:markdown id: tags:

 We can create our own environment or load an existing environment.

 %% Cell type:code id: tags:

 ``` python
 new_update_env = False
 env_name='workshop-training-env'
 # pathing in notebook folder
 conda_path='conda-training.yml'

 if new_update_env:
    # create new environment
    env = Environment.from_conda_specification(name=env_name, file_path=conda_path)
    env.register(workspace=ws)
    # We can directly build the environment - this will create a new Docker
    # image in Azure Container Registry (ACR), and directly 'bake in' our dependencies
    # from the conda definition. When we later use the Environment, all AML will need to
    # do is pull the image for environment, thus saving the time for potentially a
    # long-running conda environment creation.
    build = env.build(workspace=ws)
    build.wait_for_completion(show_output=True)
 else:
    # load existing environment
    env = Environment.get(workspace=ws, name=env_name)
 ```

 %% Cell type:markdown id: tags:

 # Datastores

 You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. For example, the following code lists the names of each datastore in the workspace.

 %% Cell type:code id: tags:

 ``` python
 from azureml.core import Datastore

 for ds_name in ws.datastores:
    print(ds_name)
 ```

 %% Cell type:markdown id: tags:

 You can get a reference to any datastore by using the Datastore.get() method as shown here:


 %% Cell type:code id: tags:

 ``` python
 store = Datastore.get(ws, datastore_name='datastore_name')
 ```

 %% Cell type:markdown id: tags:

 The workspace always includes a default datastore (initially, this is the built-in workspaceblobstore datastore), which you can retrieve by using the get_default_datastore() method of a Workspace object, like this:

 %% Cell type:code id: tags:

 ``` python
 default_store = ws.get_default_datastore()
 ```

 %% Cell type:markdown id: tags:

 When planning for datastores, consider the following guidelines:

 - When using Azure blob storage, premium level storage may provide improved I/O performance for large datasets. However, this option will increase cost and may limit replication options for data redundancy.
 - When working with data files, although CSV format is very common, Parquet format generally results in better performance.

 %% Cell type:markdown id: tags:

 # Datasets

 Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.

 Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. You can create the following types of dataset:

 - **Tabular:** The data is read from the dataset as a table. You should use this type of dataset when your data is consistently structured and you want to work with it in common tabular data structures, such as Pandas dataframes.
 - **File:** The dataset presents a list of file paths that can be read as though from the file system. Use this type of dataset when your data is unstructured, or when you need to process the data at the file level (for example, to train a convolutional neural network from a set of image files).

 Our data consists of two datasets:
 - **labels_short.csv**: csv with labels of images. We will create a Tabular dataset from this csv.
 - **images**: a folder with images of plants with diseases. We will create a File dataset from this folder.

 %% Cell type:markdown id: tags:

 To create a tabular dataset using the SDK, use the from_delimited_files method of the Dataset.Tabular class

 %% Cell type:code id: tags:

 ``` python
 from azureml.core import Dataset

 csv_paths = [(store, 'labels_short.csv')]
 tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
 tab_ds = tab_ds.register(workspace=ws, name='labels_name', description='labels of plant disease dataset', create_new_version=True)
 ```

 %% Cell type:markdown id: tags:

 After creating the dataset, the code registers it in the workspace with the name labels.

 %% Cell type:markdown id: tags:

 To create a file dataset using the SDK, use the from_files method of the Dataset.File class, like this:

 %% Cell type:code id: tags:

 ``` python
 file_ds = Dataset.File.from_files(path=(store, 'images/*.jpg'))
 file_ds = file_ds.register(workspace=ws, name='img_files_name', description='images of plant disease dataset', create_new_version=True)
 ```

 %% Cell type:markdown id: tags:

 The dataset in this example includes all .jpg files in the images folder:

 After creating the dataset, the code registers it in the workspace with the name img_files.

 %% Cell type:markdown id: tags:

 After registering a dataset, you can retrieve it by using any of the following techniques:

 The datasets dictionary attribute of a Workspace object.
 The get_by_name or get_by_id method of the Dataset class.

 %% Cell type:code id: tags:

 ``` python
 # Load the workspace from the saved config file
 ws = Workspace.from_config()

 # Get a dataset from the workspace datasets collection
 ds_labels = ws.datasets['labels_name']

 # Get a dataset by name from the datasets class
 ds_images = Dataset.get_by_name(ws, 'img_files_name')
 ```

 %% Cell type:markdown id: tags:

 Datasets can be versioned, enabling you to track historical versions of datasets that were used in experiments, and reproduce those experiments with data in the same state.

 You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version property:

 %% Cell type:code id: tags:

 ``` python
 file_ds_filtered = Dataset.File.from_files(path=(store, 'images/Train*.jpg'))
 file_ds_filtered = file_ds_filtered.register(workspace=ws, name='img_files_name', description='images of plant disease dataset filtered on Train name', create_new_version=True)
 ```

 %% Cell type:markdown id: tags:

 We have selected only images with **Train_** in the name since we only know the labels of these images.

 %% Cell type:markdown id: tags:

 You can retrieve a specific version of a dataset by specifying the version parameter in the get_by_name method of the Dataset class.


 %% Cell type:code id: tags:

 ``` python
 img_ds = Dataset.get_by_name(workspace=ws, name='img_files_name', version=2)
 ```

 %% Cell type:markdown id: tags:

 You can read data directly from a tabular dataset by converting it into a Pandas or Spark dataframe:


 %% Cell type:code id: tags:

 ``` python
 df = tab_ds.to_pandas_dataframe()
 # code to work with dataframe goes here, for example:
 print(df.head())
 ```

 %% Cell type:markdown id: tags:

 When working with a file dataset, you can use the to_path() method to return a list of the file paths encapsulated by the dataset:


 %% Cell type:code id: tags:

 ``` python
 for file_path in img_ds.to_path():
    print(file_path)
 ```