Finetune

Dataset

Dataset(path: Path | str | None)

Dataset to finetune LLMs on generating synthetic data.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str \| None`	The path to the dataset on disk. If None, the dataset will be stored to a file in a newly created temporary directory.	required

append

append(task: BaseTask, data: RelationalData) -> None

Append data to the dataset.

Parameters:

Name	Type	Description	Default
`task`	`BaseTask`	The task to append the data for.	required
`data`	`RelationalData`	The data to be appended.	required

load

load() -> Dataset

Load the dataset as a Hugging Face datasets.Dataset.

Returns:

Type	Description
`Dataset`	The Hugging Face `datasets.Dataset`.

load_prompt_completion

load_prompt_completion(
    prompt_template: str | Callable[[DatasetElem], str],
    eos_token: str = "",
    tokenizer: PreTrainedTokenizer | None = None,
    tokenize: bool = False,
    completion_key: str = "out",
    max_length: int | None = None,
    enforce_max_len: bool = True,
) -> Dataset

Load the dataset as a Hugging Face dataset, map it to prompt-completion format and (optionally) tokenize it.

Parameters:

Name	Type	Description	Default
`prompt_template`	`str \| Callable[[DatasetElem], str]`	Either a string to be formatted to get the prompts, or a callable returning the prompt from the JSON representation of `DatasetElem`. If a string, it may contain as keys the fields of `DatasetElem`.	required
`eos_token`	`str`	An optional EOS token to add to the end of the completion. Should not be given if the tokenizer is provided.	`''`
`tokenizer`	`PreTrainedTokenizer \| None`	A Hugging Face tokenizer. If given, its EOS is added at the end of the completion, and it is used to optionally tokenize the dataset.	`None`
`tokenize`	`bool`	Whether to tokenize the dataset. If True, the trainer will skip the tokenization step.	`False`
`completion_key`	`str`	The field of `DatasetElem` to be used as completion.	`'out'`
`max_length`	`int \| None`	The maximum length in tokens of a single example.	`None`
`enforce_max_len`	`bool`	Whether to raise an error in case the number of tokens of a single example exceeds `max_length`. If False, a simple warning is issued.	`True`

Returns:

Type	Description
`Dataset`	The Hugging Face `datasets.Dataset` in prompt-completion form.

FtConfig `pydantic-config`

FtConfig(
    dataset_path: str | Path | None,
    prompt_template: str | Callable[[DatasetElem], str],
    enforce_max_len: bool = True,
    valid_frac: Annotated[float, Field(gt=0, lt=1)] = 0.1,
    valid_seed: int | None = None,
    early_stop: NonNegativeInt = 0,
    dump_n_examples: NonNegativeInt = 10,
    best_ckpt: str = "best",
)

Finetune configuration.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str \| Path \| None`	The path to the dataset. If the dataset has to be created and the path is set to None, the dataset is saved in a temporary directory.	required
`prompt_template`	`str \| Callable[[DatasetElem], str]`	The template for the prompt. It may contain as keys the fields of `DatasetElem`.	required
`enforce_max_len`	`bool`	Whether to enforce the max length coming from the model during dataset loading (in the prompt-completion format).	`True`
`valid_frac`	`Annotated[float, Field(gt=0, lt=1)]`	Fraction of the data to be used for validation.	`0.1`
`early_stop`	`NonNegativeInt`	Early stopping patience.	`0`
`dump_n_examples`	`NonNegativeInt`	Number of dataset examples to dump for inspection.	`10`
`best_ckpt`	`str`	Name of the best checkpoint.	`'best'`

finetune

finetune(
    cfg_ft: FtConfig,
    cfg_model: ModelConfig,
    cfg_sft: SFTConfig,
    unsloth: bool = False,
) -> None

Finetune an LLM using a given Dataset.

Parameters:

Name	Type	Description	Default
`cfg_ft`	`FtConfig`	The finetune configuration.	required
`cfg_model`	`ModelConfig`	The model configuration.	required
`cfg_sft`	`SFTConfig`	The trainer configuration.	required
`unsloth`	`bool`	Whether to use `unsloth` to perform the finetuning. If not, the standard Hugging Face finetune will be used.	`False`

Finetune

Dataset

append

load

load_prompt_completion

FtConfig pydantic-config

finetune

finetune

FtConfig `pydantic-config`