How to transform my csv file into this scikit learn dataset


Sorry if I don’t use the right terminology here. I have a csv file with my own data. I first need to transform it into another format so I can load it into another Python code. I show an example of the format below, it’s a subset of the Iris dataset which the example loads through:

from sklearn import datasets
data = datasets.load_iris()

Which gives me (I truncated some parts to keep it readable):

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, ... 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': 'Iris Plants Databasen====================nnNotesn-----nData Set Characteristics:n    :Number of Instances: 150 (50 in each of three classes)n    :Number of Attributes: 4 numeric, predictive attributes and the classn    :Attribute Information:n        - sepal length in cmn        - sepal width in cmn        - petal length in cmn        - petal width in cmn        - class:n                - Iris-Setosan                - Iris-Versicolourn                - Iris-Virginican    :Summary Statistics:nn    ============== ==== ==== ======= ===== ====================n                    Min  Max   Mean    SD   Class Correlationn    ============== ==== ==== ======= ===== ====================n    sepal length:   4.3  7.9   5.84   0.83    0.7826n    sepal width:    2.0  4.4   3.05   0.43   -0.4194n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)n    ============== ==== ==== ======= ===== ====================nn    :Missing Attribute Values: Nonen    :Class Distribution: 33.3% for each of 3 classes.n    :Creator: R.A. Fishern    :Donor: Michael Marshall (MARSHALL%[email protected])n    :Date: July, 1988nnThis is a copy of UCI ML iris datasets.n famous Iris database, first used by Sir R.A FishernnThis is perhaps the best known database to be found in thenpattern recognition literature.  Fisher's paper is a classic in the field andnis referenced frequently to this day.  (See Duda & Hart, for example.)  Thendata set contains 3 classes of 50 instances each, where each class refers to antype of iris plant.  One class is linearly separable from the other 2; thenlatter are NOT linearly separable from each other.nnReferencesn----------n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions ton     Mathematical Statistics" (John Wiley, NY, 1950).n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New Systemn     Structure and Classification Rule for Recognition in Partially Exposedn     Environments".  IEEE Transactions on Pattern Analysis and Machinen     Intelligence, Vol. PAMI-2, No. 1, 67-71.n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactionsn     on Information Theory, May 1972, 431-433.n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS IIn     conceptual clustering system finds 3 classes in the data.n   - Many, many more ...n', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']}

I can produce the first ‘data’ array and the second ‘target’ one. But I’m struggling with the last part of the file containing, I believe, some dictionary tags like ‘target_names’, ‘feature_names’, ‘mean’ and some more.

I need these tags in the rest of the code which can be found here:

And the dataset info is here:

Ideally looking for a piece of code to generate this format from my csv file.

My code so far:

from numpy import genfromtxt
data = genfromtxt('myfile.csv', delimiter=',')
features = data[:, :3]
targets = data[:, 3]

myfile.csv is just random numbers in 4 columns with headers and a few rows, just to test.

Asked By: Hugues



ok. I found a way to do this, with the help of this post:
How to create my own datasets using in scikit-learn?

my iris.csv file looks like this:

....(150 rows)

and the code to transform this .csv in the format i described in my OP:

import numpy as np
import csv
from sklearn.datasets.base import Bunch

def load_my_dataset():
    with open('iris.csv') as csv_file:
        data_file = csv.reader(csv_file)
        temp = next(data_file)
        n_samples = 150 #number of data rows, don't count header
        n_features = 4 #number of columns for features, don't count target column
        feature_names = ['f1','f2','f3','f4'] #adjust accordingly
        target_names = ['t1','t2','t3'] #adjust accordingly
        data = np.empty((n_samples, n_features))
        target = np.empty((n_samples,),

        for i, sample in enumerate(data_file):
            data[i] = np.asarray(sample[:-1], dtype=np.float64)
            target[i] = np.asarray(sample[-1],

    return Bunch(data=data, target=target, feature_names = feature_names, target_names = target_names)

data = load_my_dataset()

I agree the code could be made a little more smart, but it works, you just need to adapt:

  • your file name
  • number of data rows, without counting header
  • number of columns for features, don’t count last target column
  • list feature names
  • list target names
Answered By: Hugues

Following this thread and the SciKit website, the format utils.Bunch is not a required input format. A list of Python strings is enough. Still working my way through this. But a pandas dataframe should work too.

Answered By: Simone
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.