Dataset Preparation

This guide walks you through an easy step by step process for preparing your dataset for Finetuning in our no code Finetuner UI.

Once you have chosen an LLM and are on the Dataset Preparation page:

  1. Click on the Select a Task dropdown and choose the type of task you are training over

  1. Then click on the Choose HuggingFace Dataset dropdown. You can choose a pre-specified HuggingFace Dataset, you can specify your own choice of HuggingFace Dataset by selecting "other" option or select a dataset uploaded from 'Managed Dataset' service.

If you select a pre-specified HuggingFace Dataset then no dataset preparation is required and you can simply click on "Next" to proceed ahead.

If you choose "other" option and imported a hugging face dataset through link or selected a dataset from 'My Datasets':

Now scroll to the Prompt Configuration window and just replace the placeholders inside the square the bracket with your actual column names you want to use for Fine-tuning

For example if our dataset looks like this:

We will replace [instruction column name] and [response column name] with [prompt] and [completion] respectively i.e. the column names of our target columns in the dataset.

Our updated Prompt Configuration window would look like this after making the changes:

And we are done!!! Our Fine-tuning solution will take care of the rest.

Simply click on "Next" and finalize your finetuning job request.

Other examples for reference

Classification:

Lets train a LLM to to classify spam sms using hugging face dataset

First, we select a task called "Text Classification".

we can choose other option for specifying our own choice of HuggingFace dataset.

Let us specify a dataset called "sms_spam" for text classification:

This is how the "sms_spam" dataset looks like with input text in column 'sms' and classification label in column 'label':

This is how we specify the target columns in our dataset preparation window (basically we changed the placeholders in square brackets and set them as the target column names of our dataset):

Summary Generation with custom dataset:

First, we select a task called "Summary Generator" and then choose a pre-specified dataset or we can choose other option for specifying our own choice of HuggingFace dataset.

Let us specify a dataset called "xsum" for text classification:

This is how the "xsum" dataset looks like with 'document' column containing content/input to be summarized and summary in 'summary' column to be generated:

This is how we specify the target columns in our dataset preparation window (basically we changed the placeholders in square brackets and set them as the target column names of our dataset):

That's all you need to do for preparing your dataset for finetuning on MonsterAPI.

Once you have confirmed that all the column names are correct as per your chosen HuggingFace dataset, click "Next" to proceed to the next step where you'll define the hyper parameters.

For any questions, don't hesitate to reach us out at support@monsterapi.ai

Last updated