Finetune
Dataset
Dataset to finetune LLMs on generating synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str | None
|
The path to the dataset on disk. If None, the dataset will be stored to a file in a newly created temporary directory. |
required |
append
append(task: BaseTask, data: RelationalData) -> None
Append data to the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
BaseTask
|
The task to append the data for. |
required |
data
|
RelationalData
|
The data to be appended. |
required |
load
load() -> Dataset
Load the dataset as a Hugging Face datasets.Dataset.
Returns:
| Type | Description |
|---|---|
Dataset
|
The Hugging Face |
load_prompt_completion
load_prompt_completion(
prompt_template: str | Callable[[DatasetElem], str],
eos_token: str = "",
tokenizer: PreTrainedTokenizer | None = None,
tokenize: bool = False,
completion_key: str = "out",
max_length: int | None = None,
enforce_max_len: bool = True,
) -> Dataset
Load the dataset as a Hugging Face dataset, map it to prompt-completion format and (optionally) tokenize it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_template
|
str | Callable[[DatasetElem], str]
|
Either a string to be formatted to get the prompts,
or a callable returning the prompt from the JSON representation of |
required |
eos_token
|
str
|
An optional EOS token to add to the end of the completion. Should not be given if the tokenizer is provided. |
''
|
tokenizer
|
PreTrainedTokenizer | None
|
A Hugging Face tokenizer. If given, its EOS is added at the end of the completion, and it is used to optionally tokenize the dataset. |
None
|
tokenize
|
bool
|
Whether to tokenize the dataset. If True, the trainer will skip the tokenization step. |
False
|
completion_key
|
str
|
The field of |
'out'
|
max_length
|
int | None
|
The maximum length in tokens of a single example. |
None
|
enforce_max_len
|
bool
|
Whether to raise an error in case the number of tokens of a single example exceeds
|
True
|
Returns:
| Type | Description |
|---|---|
Dataset
|
The Hugging Face |
FtConfig
pydantic-config
FtConfig(
dataset_path: str | Path | None,
prompt_template: str | Callable[[DatasetElem], str],
enforce_max_len: bool = True,
valid_frac: Annotated[float, Field(gt=0, lt=1)] = 0.1,
early_stop: NonNegativeInt = 0,
dump_n_examples: NonNegativeInt = 10,
best_ckpt: str = "best",
)
Finetune configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_path
|
str | Path | None
|
The path to the dataset. If the dataset has to be created and the path is set to None, the dataset is saved in a temporary directory. |
required |
prompt_template
|
str | Callable[[DatasetElem], str]
|
The template for the prompt. It may contain as keys the fields of |
required |
enforce_max_len
|
bool
|
Whether to enforce the max length coming from the model during dataset loading (in the prompt-completion format). |
True
|
valid_frac
|
Annotated[float, Field(gt=0, lt=1)]
|
Fraction of the data to be used for validation. |
0.1
|
early_stop
|
NonNegativeInt
|
Early stopping patience. |
0
|
dump_n_examples
|
NonNegativeInt
|
Number of dataset examples to dump for inspection. |
10
|
best_ckpt
|
str
|
Name of the best checkpoint. |
'best'
|
finetune
finetune
finetune(
cfg_ft: FtConfig,
cfg_model: ModelConfig,
cfg_sft: SFTConfig,
unsloth: bool = False,
) -> None
Finetune an LLM using a given Dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_ft
|
FtConfig
|
The finetune configuration. |
required |
cfg_model
|
ModelConfig
|
The model configuration. |
required |
cfg_sft
|
SFTConfig
|
The trainer configuration. |
required |
unsloth
|
bool
|
Whether to use |
False
|