diff --git a/ETL.ipynb b/ETL.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..adf5377964d847b0f1fec3c2d7eaac3101ce5848 --- /dev/null +++ b/ETL.ipynb @@ -0,0 +1,681 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Setup\r\n", + "\r\n", + "It is important to maintain a conda dependency file and/or MLstudio environment. \r\n", + "\r\n", + "Every user of the workspace will use their own compute instance, with conda files and environments it is easy to install dependencies on these different compute instances." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "!conda env update -n workshop_env --file conda-notebook.yml\r\n", + "!python -m ipykernel install --user --name=workshop_env --display-name=workshop_env" + ], + "outputs": [], + "execution_count": null, + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Refresh page and change kernel to workshop_env." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Connect to the workspace for easier Azure commands." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "from azureml.core import Workspace\r\n", + "\r\n", + "ws = Workspace.from_config()\r\n", + "print(f'WS name: {ws.name}\\nRegion: {ws.location}\\nSubscription id: {ws.subscription_id}\\nResource group: {ws.resource_group}')" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659144055 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "## Setup environment\r\n", + "\r\n", + "Azure ML environments are an encapsulation of the environment where your machine learning training happens. They define Python packages, environment variables, Docker settings and other attributes in declarative fashion. Environments are versioned: you can update them and retrieve old versions to revisit and review your work.\r\n", + "\r\n", + "Environments allow you to:\r\n", + "\r\n", + "- Encapsulate dependencies of your training process, such as Python packages and their versions.\r\n", + "- Reproduce the Python environment on your local computer in a remote run on VM or ML Compute cluster\r\n", + "- Reproduce your experimentation environment in production setting.\r\n", + "- Revisit and audit the environment in which an existing model was trained.\r\n", + "- Environment, compute target and training script together form run configuration: the full specification of training run.\r\n", + "\r\n", + "With the following code we can list current available environments." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "from azureml.core import Environment\r\n", + "\r\n", + "envs = Environment.list(workspace=ws)\r\n", + "\r\n", + "for env in envs:\r\n", + " print(\"Name\",env)\r\n", + " if envs[env].python.conda_dependencies is not None:\r\n", + " print(\"packages\", envs[env].python.conda_dependencies.serialize_to_string())" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659352537 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "We can create our own environment or load an existing environment." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "new_update_env = False\r\n", + "env_name='workshop-training-env'\r\n", + "# pathing in notebook folder\r\n", + "conda_path='conda-training.yml'\r\n", + "\r\n", + "if new_update_env:\r\n", + " # create new environment\r\n", + " env = Environment.from_conda_specification(name=env_name, file_path=conda_path)\r\n", + " env.register(workspace=ws)\r\n", + " # We can directly build the environment - this will create a new Docker \r\n", + " # image in Azure Container Registry (ACR), and directly 'bake in' our dependencies \r\n", + " # from the conda definition. When we later use the Environment, all AML will need to \r\n", + " # do is pull the image for environment, thus saving the time for potentially a \r\n", + " # long-running conda environment creation.\r\n", + " build = env.build(workspace=ws)\r\n", + " build.wait_for_completion(show_output=True)\r\n", + "else:\r\n", + " # load existing environment\r\n", + " env = Environment.get(workspace=ws, name=env_name)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "# Datastores\r\n", + "\r\n", + "You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. For example, the following code lists the names of each datastore in the workspace." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "from azureml.core import Datastore\r\n", + "\r\n", + "for ds_name in ws.datastores:\r\n", + " print(ds_name)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659467722 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "You can get a reference to any datastore by using the Datastore.get() method as shown here:\r\n", + "\r\n" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "store = Datastore.get(ws, datastore_name='datastore_name')" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659596301 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "The workspace always includes a default datastore (initially, this is the built-in workspaceblobstore datastore), which you can retrieve by using the get_default_datastore() method of a Workspace object, like this:" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "default_store = ws.get_default_datastore()" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659609540 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "When planning for datastores, consider the following guidelines:\r\n", + "\r\n", + "- When using Azure blob storage, premium level storage may provide improved I/O performance for large datasets. However, this option will increase cost and may limit replication options for data redundancy.\r\n", + "- When working with data files, although CSV format is very common, Parquet format generally results in better performance." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "# Datasets\r\n", + "\r\n", + "Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.\r\n", + "\r\n", + "Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. You can create the following types of dataset:\r\n", + "\r\n", + "- **Tabular:** The data is read from the dataset as a table. You should use this type of dataset when your data is consistently structured and you want to work with it in common tabular data structures, such as Pandas dataframes.\r\n", + "- **File:** The dataset presents a list of file paths that can be read as though from the file system. Use this type of dataset when your data is unstructured, or when you need to process the data at the file level (for example, to train a convolutional neural network from a set of image files).\r\n", + "\r\n", + "Our data consists of two datasets:\r\n", + "- **labels_short.csv**: csv with labels of images. We will create a Tabular dataset from this csv.\r\n", + "- **images**: a folder with images of plants with diseases. We will create a File dataset from this folder." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "To create a tabular dataset using the SDK, use the from_delimited_files method of the Dataset.Tabular class" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "from azureml.core import Dataset\r\n", + "\r\n", + "csv_paths = [(store, 'labels_short.csv')]\r\n", + "tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)\r\n", + "tab_ds = tab_ds.register(workspace=ws, name='labels_name', description='labels of plant disease dataset', create_new_version=True)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646659730056 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "After creating the dataset, the code registers it in the workspace with the name labels." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "To create a file dataset using the SDK, use the from_files method of the Dataset.File class, like this:" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "file_ds = Dataset.File.from_files(path=(store, 'images/*.jpg'))\r\n", + "file_ds = file_ds.register(workspace=ws, name='img_files_name', description='images of plant disease dataset', create_new_version=True)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646137084656 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "The dataset in this example includes all .jpg files in the images folder:\r\n", + "\r\n", + "After creating the dataset, the code registers it in the workspace with the name img_files." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "After registering a dataset, you can retrieve it by using any of the following techniques:\r\n", + "\r\n", + "The datasets dictionary attribute of a Workspace object.\r\n", + "The get_by_name or get_by_id method of the Dataset class." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "# Load the workspace from the saved config file\r\n", + "ws = Workspace.from_config()\r\n", + "\r\n", + "# Get a dataset from the workspace datasets collection\r\n", + "ds_labels = ws.datasets['labels_name']\r\n", + "\r\n", + "# Get a dataset by name from the datasets class\r\n", + "ds_images = Dataset.get_by_name(ws, 'img_files_name')" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646660040765 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Datasets can be versioned, enabling you to track historical versions of datasets that were used in experiments, and reproduce those experiments with data in the same state.\r\n", + "\r\n", + "You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version property:" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "file_ds_filtered = Dataset.File.from_files(path=(store, 'images/Train*.jpg'))\r\n", + "file_ds_filtered = file_ds_filtered.register(workspace=ws, name='img_files_name', description='images of plant disease dataset filtered on Train name', create_new_version=True)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646137713677 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "We have selected only images with **Train_** in the name since we only know the labels of these images." + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "markdown", + "source": [ + "You can retrieve a specific version of a dataset by specifying the version parameter in the get_by_name method of the Dataset class.\r\n", + "\r\n" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "img_ds = Dataset.get_by_name(workspace=ws, name='img_files_name', version=2)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646660267853 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "You can read data directly from a tabular dataset by converting it into a Pandas or Spark dataframe:\r\n", + "\r\n" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "df = tab_ds.to_pandas_dataframe()\r\n", + "# code to work with dataframe goes here, for example:\r\n", + "print(df.head())" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646660275291 + } + } + }, + { + "cell_type": "markdown", + "source": [ + "When working with a file dataset, you can use the to_path() method to return a list of the file paths encapsulated by the dataset:\r\n", + "\r\n" + ], + "metadata": { + "nteract": { + "transient": { + "deleting": false + } + } + } + }, + { + "cell_type": "code", + "source": [ + "for file_path in img_ds.to_path():\r\n", + " print(file_path)" + ], + "outputs": [], + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": false, + "outputs_hidden": false + }, + "nteract": { + "transient": { + "deleting": false + } + }, + "gather": { + "logged": 1646660293091 + } + } + } + ], + "metadata": { + "kernelspec": { + "name": "workshop_env", + "language": "python", + "display_name": "workshop_env" + }, + "language_info": { + "name": "python", + "version": "3.6.9", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + }, + "kernel_info": { + "name": "workshop_env" + }, + "nteract": { + "version": "nteract-front-end@1.0.0" + }, + "microsoft": { + "host": { + "AzureML": { + "notebookHasBeenCompleted": true + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file