How to Update a Azure ML Dataset with a new pandas DataFrame and How to Revert to a Specific Version if Needed

Question:

Is there a way that we could update an Existing Azure ML Dataset using a pandas Dataframe and update the version? The default Dataset is stored in a blob as a csv file.How can we approach this?

Also let’s say we want to change the latest version to another version.

enter image description here

Above we see that version 2 is the latest, but I want to change the latest to version 1 so that if anyone reads the Dataset it will be from version 1. Don’t want to use versions specifically to retrieve it.

Asked By: L_Jay

||

Answers:

Regarding your first question, here are two methods to update your Azure ML dataset with a new version using a CSV file stored in Blob Storage:

Method 1:

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'

my_dataset = Data(
    path=blob_url ,
    type=AssetTypes.MLTABLE,
    description="a description for your dataset",
    name="dataset_name",
    version='<new_version>'
)

ml_client.data.create_or_update(my_dataset)

Method 2:

import azureml.core
from azureml.core import Dataset, Workspace

ws = Workspace.from_config()
datastore = ws.get_default_datastore()

blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'

my_dataset = Dataset.File.from_delimited_files(path=blob_url)
my_dataset.register(
    workspace=ws,
    name="dataset_name",
    description="a description for your dataset",
    create_new_version=True
)

If you want to update the dataset using a pandas DataFrame:

my_df = ...  # the variable that contains the new dataset in a DataFrame
my_dataset = Dataset.File.from_pandas_dataframe(dataframe=my_df)
my_dataset.register(
    ...
)

Regarding your second question:

Above we see that version 2 is the latest, but I want to change the latest to version 1

It is not possible since ‘latest’ always points to the last (latest) uploaded version of the dataset with the given name. So, if you want a specific or latest version, you should change the version parameter in the Data class in the "Method 1" code snippet.

Answered By: msamsami