Loading data using hugging_face load_dataset from Common Voice is giving an error

Question:

I am working on a voice dataset using the Facebook Hugging Face_ transformer, but I am unable to load data from the Common Voice forum:

from datasets import load_dataset, load_metric
common_voice_train = load_dataset("common_voice", "id", split="train+validation")
common_voice_test = load_dataset("common_voice", "id", split="test")

It gives the following error:

Couldn’t find file locally at common_voice/common_voice.py, or remotely at https://raw.githubusercontent.com/huggingface/datasets/1.4.1/datasets/common_voice/common_voice.py.

The file was picked from the master branch on github instead at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/common_voice/common_voice.py.

How can I fix this problem?

Asked By: amad durrani

||

Answers:

You are using the Hugging Face lightweight datasets library to load the Common Voice repository dataset. The id parameter must be replaced with the builder configuration parameter, for instance, if you want to load the English dataset from the Common Voice corpus, the builder configuration parameter is en.

You can check the parameter on the Common Voice repository. It is prefixed where the version is mentioned.

Answered By: Asim Bakhshi

You have to write the specific path of the dataset, and the language id "en" for English, for example:

common_voice_train = load_dataset("mozilla-foundation/common_voice_15_0", "en", split="train+validation")

Here I typed "common_voice_15_0" for example. Choose the compatible version for you.

Answered By: mahmoud khaled