Create and Push dataset to Hugging Face

Recently I was going over the process of fine-tuning stable diffusion model and came across Instruction tuning, instruction tuning is a method of fine tuning the model by giving an instruction text prompt, an original image and a desired output image.

To learn more about Instruction tuning you can read this blog from Hugging Face.

Now, to fine tune the model using instruction tuning method, we need to have a dataset on Hugging Face which can be automatically fetched during the training process. In this blog I will be showing you, how you can upload your images data or any other type of data to Hugging Face.

Step 1: Creating the dataset on Hugging Face

Start by creating your dataset on Hugging Face, go to the top right corner and click your profile, and then click on New Dataset.
After creating your dataset, go the Settings tab and make sure the dataset visibility is set to public and enable the Access Requests.

Step 2: Create the dataset Structure

To add data to your dataset, start by logging into Hugging Face using huggingface-cli login
Now install the datasets lib using pip install datasets

For this blog, I will be showing you to add image-text pair dataset, my dataset consists of original input and desired output images. I started by putting my input and output images in separate folders.

Start by creating a list of pairs of paths for each pair.

import os

data_paths = []

for input_img, output_img in zip(os.listdir("/path/to/input_images"), os.listdir("/path/to/output_images")):
  data_paths.append((f"/path/to/input_images/{input_img}", f"/path/to/output_images/{output_img}"))

Now we will create a generator function which will yield our image pairs

def data_generator(data_paths):
    def fn():
        for data_path in data_paths:
            yield {
                "input_image": {"path": data_path[0]},
                "instruct_prompt": "instruction prompt",
                "output_image": {"path": data_path[1]},
            }

    return fn

The format of this data returned should be same as what you want in your Hugging Face dataset.

Now we use the datasets lib to create the dataset.

from datasets import Dataset, Features
from datasets import Image as ImageFeature
from datasets import Value

data = data_generator(data_paths)
ds = Dataset.from_generator(
    data,
    features=Features(
        input_image=ImageFeature(),
        instruct_prompt=Value("string"),
        output_image=ImageFeature(),
        ),
    )

We pass the generator function and specify the type of each of our feature.

Step 3: Push the data to Hugging Face

Now we simply push the data to Hugging Face.

ds.push_to_hub("{username}/{dataset_name}", token="{hf_token}")

Congo 🎉, we have sucessfully created a dataset on Hugging Face.

Thanks for reading!!