Is there a way to get log the descriptive stats of a dataset using MLflow?

Question:

Is there a way to get log the descriptive stats of a dataset using MLflow? If any could you please share the details?

Asked By: Naga Budigam

||

Answers:

Generally speaking you can log arbitrary output from your code using the mlflow_log_artifact() function. From the docs:

mlflow.log_artifact(local_path, artifact_path=None)
Log a local file or directory as an artifact of the currently active run.

Parameters:
local_path – Path to the file to write.
artifact_path – If provided, the directory in artifact_uri to write to.

As an example, say you have your statistics in a pandas dataframe, stat_df.

## Write csv from stats dataframe
stat_df.to_csv('dataset_statistics.csv')

## Log CSV to MLflow
mlflow.log_artifact('dataset_statistics.csv')

This will show up under the artifacts section of this MLflow run in the Tracking UI. If you explore the docs further you’ll see that you can also log an entire directory and the objects therein. In general, MLflow provides you a lot of flexibility – anything you write to your file system you can track with MLflow. Of course that doesn’t mean you should. 🙂

Answered By: Raphael K

There is also the possibility to log the artifact as an html file such that it is displayed as an (ugly) table in mlflow.

import seaborn as sns
import mlflow

mlflow.start_run()
df_iris = sns.load_dataset("iris")
df_iris.describe().to_html("iris.html")
mlflow.log_artifact("iris.html",
                    "stat_descriptive")
mlflow.end_run()

enter image description here

Answered By: Adrien Pacifico

As the answers pointed out, MLFlow allows for uploading any local files. But the good practice is to dump to and upload from temporary files.

The advantage over the accepted answer are: no leftovers, and no issues with parallelization.

  with tempfile.TemporaryDirectory() as tmpdir:
    fname = tmpdir+'/'+'bits_corr_matrix.csv'
    np.savetxt(fname,corr_matrix,delimiter=',')
    mlflow.log_artifact(fname)
Answered By: Maciej S.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.