- Published on
Create and Push dataset to Hugging Face
- Authors
- Name
- Shivansh Yadav
- @_shivansh_13
Recently I was going over the process of fine-tuning stable diffusion model and came across Instruction tuning, instruction tuning is a method of fine tuning the model by giving an instruction text prompt, an original image and a desired output image.
To learn more about Instruction tuning you can read this blog from Hugging Face.
Now, to fine tune the model using instruction tuning method, we need to have a dataset on Hugging Face which can be automatically fetched during the training process. In this blog I will be showing you, how you can upload your images data or any other type of data to Hugging Face.
Step 1: Creating the dataset on Hugging Face
Start by creating your dataset on Hugging Face, go to the top right corner and click your profile, and then click on New Dataset.
After creating your dataset, go the Settings tab and make sure the dataset visibility is set to public and enable the Access Requests.
Step 2: Create the dataset Structure
- To add data to your dataset, start by logging into Hugging Face using
huggingface-cli login
- Now install the
datasets
lib usingpip install datasets
For this blog, I will be showing you to add image-text pair dataset, my dataset consists of original input and desired output images. I started by putting my input and output images in separate folders.
Start by creating a list of pairs of paths for each pair.
import os
data_paths = []
for input_img, output_img in zip(os.listdir("/path/to/input_images"), os.listdir("/path/to/output_images")):
data_paths.append((f"/path/to/input_images/{input_img}", f"/path/to/output_images/{output_img}"))
Now we will create a generator function which will yield our image pairs
def data_generator(data_paths):
def fn():
for data_path in data_paths:
yield {
"input_image": {"path": data_path[0]},
"instruct_prompt": "instruction prompt",
"output_image": {"path": data_path[1]},
}
return fn
The format of this data returned should be same as what you want in your Hugging Face dataset.
Now we use the datasets
lib to create the dataset.
from datasets import Dataset, Features
from datasets import Image as ImageFeature
from datasets import Value
data = data_generator(data_paths)
ds = Dataset.from_generator(
data,
features=Features(
input_image=ImageFeature(),
instruct_prompt=Value("string"),
output_image=ImageFeature(),
),
)
We pass the generator function and specify the type of each of our feature.
Step 3: Push the data to Hugging Face
Now we simply push the data to Hugging Face.
ds.push_to_hub("{username}/{dataset_name}", token="{hf_token}")
Congo 🎉, we have sucessfully created a dataset on Hugging Face.
Thanks for reading!!