Skip to content

Finetune

Dataset

Dataset(path: Path | str | None)

Dataset to finetune LLMs on generating synthetic data.

Parameters:

Name Type Description Default
path Path | str | None

The path to the dataset on disk. If None, the dataset will be stored to a file in a newly created temporary directory.

required

append

append(task: BaseTask, data: RelationalData) -> None

Append data to the dataset.

Parameters:

Name Type Description Default
task BaseTask

The task to append the data for.

required
data RelationalData

The data to be appended.

required

load

load() -> Dataset

Load the dataset as a Hugging Face datasets.Dataset.

Returns:

Type Description
Dataset

The Hugging Face datasets.Dataset.

load_prompt_completion

load_prompt_completion(
    prompt_template: str | Callable[[DatasetElem], str],
    eos_token: str = "",
    tokenizer: PreTrainedTokenizer | None = None,
    tokenize: bool = False,
    completion_key: str = "out",
    max_length: int | None = None,
    enforce_max_len: bool = True,
) -> Dataset

Load the dataset as a Hugging Face dataset, map it to prompt-completion format and (optionally) tokenize it.

Parameters:

Name Type Description Default
prompt_template str | Callable[[DatasetElem], str]

Either a string to be formatted to get the prompts, or a callable returning the prompt from the JSON representation of DatasetElem. If a string, it may contain as keys the fields of DatasetElem.

required
eos_token str

An optional EOS token to add to the end of the completion. Should not be given if the tokenizer is provided.

''
tokenizer PreTrainedTokenizer | None

A Hugging Face tokenizer. If given, its EOS is added at the end of the completion, and it is used to optionally tokenize the dataset.

None
tokenize bool

Whether to tokenize the dataset. If True, the trainer will skip the tokenization step.

False
completion_key str

The field of DatasetElem to be used as completion.

'out'
max_length int | None

The maximum length in tokens of a single example.

None
enforce_max_len bool

Whether to raise an error in case the number of tokens of a single example exceeds max_length. If False, a simple warning is issued.

True

Returns:

Type Description
Dataset

The Hugging Face datasets.Dataset in prompt-completion form.

FtConfig pydantic-config

FtConfig(
    dataset_path: str | Path | None,
    prompt_template: str | Callable[[DatasetElem], str],
    enforce_max_len: bool = True,
    valid_frac: Annotated[float, Field(gt=0, lt=1)] = 0.1,
    early_stop: NonNegativeInt = 0,
    dump_n_examples: NonNegativeInt = 10,
    best_ckpt: str = "best",
)

Finetune configuration.

Parameters:

Name Type Description Default
dataset_path str | Path | None

The path to the dataset. If the dataset has to be created and the path is set to None, the dataset is saved in a temporary directory.

required
prompt_template str | Callable[[DatasetElem], str]

The template for the prompt. It may contain as keys the fields of DatasetElem.

required
enforce_max_len bool

Whether to enforce the max length coming from the model during dataset loading (in the prompt-completion format).

True
valid_frac Annotated[float, Field(gt=0, lt=1)]

Fraction of the data to be used for validation.

0.1
early_stop NonNegativeInt

Early stopping patience.

0
dump_n_examples NonNegativeInt

Number of dataset examples to dump for inspection.

10
best_ckpt str

Name of the best checkpoint.

'best'

finetune

finetune

finetune(
    cfg_ft: FtConfig,
    cfg_model: ModelConfig,
    cfg_sft: SFTConfig,
    unsloth: bool = False,
) -> None

Finetune an LLM using a given Dataset.

Parameters:

Name Type Description Default
cfg_ft FtConfig

The finetune configuration.

required
cfg_model ModelConfig

The model configuration.

required
cfg_sft SFTConfig

The trainer configuration.

required
unsloth bool

Whether to use unsloth to perform the finetuning. If not, the standard Hugging Face finetune will be used.

False